Using PDFBox to remove Optional Content Groups that are not enabled

Using PDFBox to remove Optional Content Groups that are not enabled - java

I'm using apache PDFBox from java, and I have a source PDF with multiple optional content groups. What I am wanting to do is export a version of the PDF that includes only the standard content and the optional content groups that were enabled. It is important for my purposes that I preserve any dynamic aspects of the original.... so text fields are still text fields, vector images are still vector images, etc. The reason that this is required is because I intend to ultimately be using a pdf form editor program that does not know how to handle optional content, and would blindly render all of them, so I want to preprocess the source pdf, and use the form editing program on a less cluttered destination pdf.
I've been trying to find something that could give me any hints on how to do this with google, but to no avail. I don't know if I'm just using the wrong search terms, or if this is just something that is outside of what the PDFBox API was designed for. I rather hope it's not the latter. The info shown here does not seem to work (converting the C# code to java), because despite the pdf I'm trying to import having optional content, there does not seem to be any OC resources when I examine the tokens on each page.
for(PDPage page:pages) {
PDResources resources = page.getResources();
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
Collection tokens = parser.getTokens();
...
}
I'm truly sorry for not having any more code to show what I've tried so far, but I've just been poring over the java API docs for about 8 hours now trying to figure out what I might need to do this, and just haven't been able to figure it out.
What I DO know how to do is add text, lines, and images to a new PDPage, but I do not know how to retrieve that information from a given source page to copy it over, nor how to tell which optional content group such information is part of (if any). I am also not sure how to copy form fields in the source pdf over to the destination, nor how to copy the font information over.
Honestly, if there's a web page out there that I wasn't able to find with google with the searches that I tried, I'd be entirely happy to read up more about it, but I am really quite stuck here, and I don't know anyone personally that knows about this library.
Please help.
EDIT:
Trying what I understand from what was suggested below, I've written a loop to examine each XObject on the page as follows:
PDResources resources = pdPage.getResources();
Iterable<COSName> names = resources.getXObjectNames();
for(COSName name:names) {
PDXObject xobj = resources.getXObject(name);
PDFStreamParser parser = new PDFStreamParser(xobj.getStream().toByteArray());
parser.parse();
Object [] tokens = parser.getTokens().toArray();
for(int i = 0;i<tokens.length-1;i++) {
Object obj = tokens[i];
if (obj instanceof COSName && obj.equals(COSName.OC)) {
i++;
Object obj = tokens[i];
if (obj instanceof COSName) {
PDPropertyList props = resources.getProperties((COSName)obj);
if (props != null) {
...
However, after an OC key, the next entry in the tokens array is always an Operator tagged as "BMC". Nowhere am I finding any info that I can recognize from the named optional content groups.

Here's a robust solution for removing marked content blocks (open to feedback if anyone finds anything that isn't working right). You should be able to adjust for OC blocks...
This code properly handles nesting and removal of resources (xobject, graphics state and fonts - easy to add others if needed).
public class MarkedContentRemover {
private final MarkedContentMatcher matcher;
/**
*
*/
public MarkedContentRemover(MarkedContentMatcher matcher) {
this.matcher = matcher;
}
public int removeMarkedContent(PDDocument doc, PDPage page) throws IOException {
ResourceSuppressionTracker resourceSuppressionTracker = new ResourceSuppressionTracker();
PDResources pdResources = page.getResources();
PDFStreamParser pdParser = new PDFStreamParser(page);
PDStream newContents = new PDStream(doc);
OutputStream newContentOutput = newContents.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter newContentWriter = new ContentStreamWriter(newContentOutput);
List<Object> operands = new ArrayList<>();
Operator operator = null;
Object token;
int suppressDepth = 0;
boolean resumeOutputOnNextOperator = false;
int removedCount = 0;
while (true) {
operands.clear();
token = pdParser.parseNextToken();
while(token != null && !(token instanceof Operator)) {
operands.add(token);
token = pdParser.parseNextToken();
}
operator = (Operator)token;
if (operator == null) break;
if (resumeOutputOnNextOperator) {
resumeOutputOnNextOperator = false;
suppressDepth--;
if (suppressDepth == 0)
removedCount++;
}
if (OperatorName.BEGIN_MARKED_CONTENT_SEQ.equals(operator.getName())
|| OperatorName.BEGIN_MARKED_CONTENT.equals(operator.getName())) {
COSName contentId = (COSName)operands.get(0);
final COSDictionary properties;
if (operands.size() > 1) {
Object propsOperand = operands.get(1);
if (propsOperand instanceof COSDictionary) {
properties = (COSDictionary) propsOperand;
} else if (propsOperand instanceof COSName) {
properties = pdResources.getProperties((COSName)propsOperand).getCOSObject();
} else {
properties = new COSDictionary();
}
} else {
properties = new COSDictionary();
}
if (matcher.matches(contentId, properties)) {
suppressDepth++;
}
}
if (OperatorName.END_MARKED_CONTENT.equals(operator.getName())) {
if (suppressDepth > 0)
resumeOutputOnNextOperator = true;
}
else if (OperatorName.SET_GRAPHICS_STATE_PARAMS.equals(operator.getName())) {
resourceSuppressionTracker.markForOperator(COSName.EXT_G_STATE, operands.get(0), suppressDepth == 0);
}
else if (OperatorName.DRAW_OBJECT.equals(operator.getName())) {
resourceSuppressionTracker.markForOperator(COSName.XOBJECT, operands.get(0), suppressDepth == 0);
}
else if (OperatorName.SET_FONT_AND_SIZE.equals(operator.getName())) {
resourceSuppressionTracker.markForOperator(COSName.FONT, operands.get(0), suppressDepth == 0);
}
if (suppressDepth == 0) {
newContentWriter.writeTokens(operands);
newContentWriter.writeTokens(operator);
}
}
if (resumeOutputOnNextOperator)
removedCount++;
newContentOutput.close();
page.setContents(newContents);
resourceSuppressionTracker.updateResources(pdResources);
return removedCount;
}
private static class ResourceSuppressionTracker{
// if the boolean is TRUE, then the resource should be removed. If the boolean is FALSE, the resource should not be removed
private final Map<COSName, Map<COSName, Boolean>> tracker = new HashMap<>();
public void markForOperator(COSName resourceType, Object resourceNameOperand, boolean preserve) {
if (!(resourceNameOperand instanceof COSName)) return;
if (preserve) {
markForPreservation(resourceType, (COSName)resourceNameOperand);
} else {
markForRemoval(resourceType, (COSName)resourceNameOperand);
}
}
public void markForRemoval(COSName resourceType, COSName refId) {
if (!resourceIsPreserved(resourceType, refId)) {
getResourceTracker(resourceType).put(refId, Boolean.TRUE);
}
}
public void markForPreservation(COSName resourceType, COSName refId) {
getResourceTracker(resourceType).put(refId, Boolean.FALSE);
}
public void updateResources(PDResources pdResources) {
for (Map.Entry<COSName, Map<COSName, Boolean>> resourceEntry : tracker.entrySet()) {
for(Map.Entry<COSName, Boolean> refEntry : resourceEntry.getValue().entrySet()) {
if (refEntry.getValue().equals(Boolean.TRUE)) {
pdResources.getCOSObject().getCOSDictionary(COSName.XOBJECT).removeItem(refEntry.getKey());
}
}
}
}
private boolean resourceIsPreserved(COSName resourceType, COSName refId) {
return getResourceTracker(resourceType).getOrDefault(refId, Boolean.FALSE);
}
private Map<COSName, Boolean> getResourceTracker(COSName resourceType){
if (!tracker.containsKey(resourceType)) {
tracker.put(resourceType, new HashMap<>());
}
return tracker.get(resourceType);
}
}
}
Helper class:
public interface MarkedContentMatcher {
public boolean matches(COSName contentId, COSDictionary props);
}

Optional Content Groups are marked with BDC and EMC. You will have to navigate through all of the tokens returned from the parser and remove the "section" from the array. Here is some C# Code that was posted a while ago - [1]: How to delete an optional content group alongwith its content from pdf using pdfbox?
I investigated that (converting to Java) but couldn't get it work as expected. I managed to remove the content between BDC and EMC and then save the result using the same technique as the sample but the PDF was corrupted. Perhaps that is my lack of C# Knowledge (related to Tuples etc.)
Here is what I came up with, as I said it doesn't work perhaps you or someone else (mkl, Tilman Hausherr) can spot the flaw.
OCGDelete (PDDocument doc, int pageNum, String OCName) {
PDPage pdPage = (PDPage) doc.getDocumentCatalog().getPages().get(pageNum);
PDResources pdResources = pdPage.getResources();
PDFStreamParser pdParser = new PDFStreamParser(pdPage);
int ocgStart
int ocgLength
Collection tokens = pdParser.getTokens();
Object[] newTokens = tokens.toArray()
try {
for (int index = 0; index < newTokens.length; index++) {
obj = newTokens[index]
if (obj instanceof COSName && obj.equals(COSName.OC)) {
// println "Found COSName at "+index /// Found Optional Content
startIndex = index
index++
if (index < newTokens.size()) {
obj = newTokens[index]
if (obj instanceof COSName) {
prop = pdRes.getProperties(obj)
if (prop != null && prop instanceof PDOptionalContentGroup) {
if ((prop.getName()).equals(delLayer)) {
println "Found the Layer to be deleted"
println "prop Name was " + prop.getName()
index++
if (index < newTokens.size()) {
obj = newTokens[index]
if ((obj.getName()).equals("BDC")) {
ocgStart = index
println("OCG Start " + ocgStart)
ocgLength = -1
index++
while (index < newTokens.size()) {
ocgLength++
obj = newTokens[index]
println " Loop through relevant OCG Tokens " + obj
if (obj instanceof Operator && (obj.getName()).equals("EMC")) {
println "the next obj was " + obj
println "after that " + newTokens[index + 1] + "and then " + newTokens[index + 2]
println("OCG End " + ocgLength++)
break
}
index++
}
if (endIndex > 0) {
println "End Index was something " + (startIndex + ocgLength)
}
}
}
}
}
}
}
}
}
}
catch (Exception ex){
println ex.message()
}
for (int i = ocgStart; i < ocgStart+ ocgLength; i++){
newTokens.removeAt(i)
}
PDStream newContents = new PDStream(doc);
OutputStream output = newContents.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter writer = new ContentStreamWriter(output);
writer.writeTokens(newTokens);
output.close();
pdPage.setContents(newContents);
}

Related

How to remove Images from PDF File?

Hello ,thank you for answer my question.This proble is perplex me for a long time.
I have search this QS for a long time,I read so many article in stack overFlow and google,but those articles is outdated or fragmented,so I have to seek for help.
I hope some one can help me ,please.
public class TEST04 {
public static void main(String[] args) throws IOException {
System.out.println("Hi");
//ori pdf file
String oriPDFFile = IFileUtils.getDesktopPath().getAbsoluteFile() + "\\1.pdf";
//out pdf file
String outPDFFile = IFileUtils.getDesktopPath().getAbsoluteFile() + "\\2.pdf";
strip(oriPDFFile, outPDFFile);
}
//parse
public static void strip(String pdfFile, String pdfFileOut) throws IOException {
//load ori pdf file
PDDocument document = PDDocument.load(new File(pdfFile));
//get All pages
List<PDPage> pageList = IterUtil.toList(document.getDocumentCatalog().getPages());
for (int i = 0; i < pageList.size(); i++) {
PDPage page = pageList.get(i);
COSDictionary newDictionary = new COSDictionary(page.getCOSObject());
PDFStreamParser parser = new PDFStreamParser(page);
List tokens = parser.getTokens();
List newTokens = new ArrayList();
for (int j = 0; j < tokens.size(); j++) {
Object token = tokens.get(j);
if (token instanceof Operator) {
Operator operator = (Operator) token;
if (operator.getName().equals("Do")) {
COSName cosName = (COSName) newTokens.remove(newTokens.size() - 1);
deleteObject(newDictionary, cosName);
continue;
}
}
newTokens.add(token);
}
PDStream newContents = new PDStream(document);
try (OutputStream outputStream = newContents.createOutputStream()) {
ContentStreamWriter writer = new ContentStreamWriter(outputStream);
writer.writeTokens(newTokens);
}
page.setContents(newContents);
// ContentStreamWriter writer = new ContentStreamWriter(newContents.createOutputStream());
// writer.writeTokens( newTokens );
// page.setContents(newContents);
PDResources newResources = new PDResources(newDictionary);
page.setResources(newResources);
}
document.save(pdfFileOut);
document.close();
}
//delete
public static boolean deleteObject(COSDictionary d, COSName name) {
for(COSName key : d.keySet()) {
if( name.equals(key) ) {
d.removeItem(key);
return true;
}
COSBase object = d.getDictionaryObject(key);
if(object instanceof COSDictionary) {
if( deleteObject((COSDictionary)object, name) ) {
return true;
}
}
}
return false;
}
}
The stack trace:

It works same way like it does in example RemoveAllText.java, just with different tag.
Use code from this example, just use "Do" instead of "Tj".
Of course, if you need to load metadata, etc, you should enumerate and check images threw page resources (like in my example)

Following the tip in Ali Yavari's answer you created a test class. Unfortunately that test code produced an exception. This answer focuses on fixing your code.
According to the stack trace you posted an image of the exception occurred while saving the document; some stream was asked to provide an InputStream and it failed with the message "Cannot read while there is an open stream writer".
So, let's have a look where your code opens a stream writer but does not close it again:
PDStream newContents = new PDStream(document);
ContentStreamWriter writer = new ContentStreamWriter(newContents.createOutputStream());
writer.writeTokens( newTokens );
page.setContents(newContents);
Indeed, here you ask a stream (the PDStream newContents) for something to write to (newContents.createOutputStream()) but don't close it.
You can do that like this:
PDStream newContents = new PDStream(document);
try (OutputStream outputStream = newContents.createOutputStream()) {
ContentStreamWriter writer = new ContentStreamWriter(outputStream);
writer.writeTokens(newTokens);
}
page.setContents(newContents);
A side note, you will have to re-write what you do with the newDictionary object. Currently you
initialize it with the page dictionary entries,
recursively remove all entries with a key that is a name of an image you remove, and
set the page resources to this dictionary.
Item 2 can delete much more than you actually want, the same name in a different dictionary may refer to an entry with a completely different meaning. Furthermore, you recurse without further checks; if there is a circular relation among the dictionaries, this may result in an infinite recursion, i.e. a stack overflow exception.
Item 3 sets this manipulated page clone inappropriately as the resources of the original page. This create a completely broken page structure.
Instead you should retrieve the resources from the page (resources = page.getResources()) and remove the images by putting them to null (resources.put(cosName, (PDXObject)null)).

In my other answer I focused on advise on how to fix the code in the question. Here I focus on a different approach to the task.
In your code you try to remove the bitmap images by inspecting the page content streams, finding Do operations therein drawing XObjects, and removing both this instruction and the referenced XObject.
It is a bit easier to instead simply replace all image XObjects in the resources by an empty form XObject. This is the approach used here.
As that approach is very easy to implement, I extended it to not only go through the immediate resources of the pages but also iterate into embedded form XObjects and patterns.
void replaceBitmapImagesResources(PDDocument document) throws IOException {
PDFormXObject pdFormXObject = new PDFormXObject(document);
pdFormXObject.setBBox(new PDRectangle(1, 1));
for (PDPage pdPage : document.getPages()) {
replaceBitmapImagesResources(pdPage.getResources(), pdFormXObject);
}
}
void replaceBitmapImagesResources(PDResources resources, PDFormXObject formXObject) throws IOException {
if (resources == null)
return;
for (COSName cosName : resources.getPatternNames()) {
PDAbstractPattern pdAbstractPattern = resources.getPattern(cosName);
if (pdAbstractPattern instanceof PDTilingPattern) {
PDTilingPattern pdTilingPattern = (PDTilingPattern) pdAbstractPattern;
replaceBitmapImagesResources(pdTilingPattern.getResources(), formXObject);
}
}
List<COSName> xobjectsToReplace = new ArrayList<>();
for (COSName cosName : resources.getXObjectNames()) {
PDXObject pdxObject = resources.getXObject(cosName);
if (pdxObject instanceof PDImageXObject) {
xobjectsToReplace.add(cosName);
} else if (pdxObject instanceof PDFormXObject) {
PDFormXObject pdFormXObject = (PDFormXObject) pdxObject;
replaceBitmapImagesResources(pdFormXObject.getResources(), formXObject);
}
}
for (COSName cosName : xobjectsToReplace) {
resources.put(cosName, formXObject);
}
}
(RemoveImages helper methods)
To apply this approach to a PDDocument simply call the first replaceBitmapImagesResources with that document as parameter.
Beware: I tried to keep the code simple; for production use remember to limit the recursion here to prevent endless recursions as in some PDFs XObjects or patterns call themselves directly or indirectly. Also you may want to inspect page annotations and the resources of template pages.

Signing PDF with multiple signature fields using PDFBox 2.0.17

I am trying to sign a PDF with 2 signature fields using the example code provided by PDFBox (https://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/signature/CreateVisibleSignature.java). But the signed PDF shows There have been changes made to this document that invalidate the signature.
I have uploaded my sample project to GitHub please find it here.
The project can be opened using IntelliJ or Eclipse.
The program argument should be set to the following to simulate the problem.
keystore/lawrence.p12 12345678 pdfs/Fillable-2.pdf images/image.jpg
Grateful if any PDFBox expert can help me. Thank you.

This answer to the question “Lock” dictionary in signature field is the reason of broken signature after signing already contains code for signing that respects the signature Lock dictionary and creates a matching FieldMDP transformations while signing.
As clarified in a comment, though, the OP wonders
is there any way to lock the corresponding textfield after signing
Thus, not only shall changes to protected form fields invalidate the signature in question but in the course of signing these protected fields shall themselves be locked.
Indeed, one can improve the code from the referenced answer to do that, too:
PDSignatureField signatureField = FIND_YOUR_SIGNATURE_FIELD_TO_SIGN;
PDSignature signature = new PDSignature();
signatureField.setValue(signature);
COSBase lock = signatureField.getCOSObject().getDictionaryObject(COS_NAME_LOCK);
if (lock instanceof COSDictionary)
{
COSDictionary lockDict = (COSDictionary) lock;
COSDictionary transformParams = new COSDictionary(lockDict);
transformParams.setItem(COSName.TYPE, COSName.getPDFName("TransformParams"));
transformParams.setItem(COSName.V, COSName.getPDFName("1.2"));
transformParams.setDirect(true);
COSDictionary sigRef = new COSDictionary();
sigRef.setItem(COSName.TYPE, COSName.getPDFName("SigRef"));
sigRef.setItem(COSName.getPDFName("TransformParams"), transformParams);
sigRef.setItem(COSName.getPDFName("TransformMethod"), COSName.getPDFName("FieldMDP"));
sigRef.setItem(COSName.getPDFName("Data"), document.getDocumentCatalog());
sigRef.setDirect(true);
COSArray referenceArray = new COSArray();
referenceArray.add(sigRef);
signature.getCOSObject().setItem(COSName.getPDFName("Reference"), referenceArray);
final Predicate<PDField> shallBeLocked;
final COSArray fields = lockDict.getCOSArray(COSName.FIELDS);
final List<String> fieldNames = fields == null ? Collections.emptyList() :
fields.toList().stream().filter(c -> (c instanceof COSString)).map(s -> ((COSString)s).getString()).collect(Collectors.toList());
final COSName action = lockDict.getCOSName(COSName.getPDFName("Action"));
if (action.equals(COSName.getPDFName("Include"))) {
shallBeLocked = f -> fieldNames.contains(f.getFullyQualifiedName());
} else if (action.equals(COSName.getPDFName("Exclude"))) {
shallBeLocked = f -> !fieldNames.contains(f.getFullyQualifiedName());
} else if (action.equals(COSName.getPDFName("All"))) {
shallBeLocked = f -> true;
} else { // unknown action, lock nothing
shallBeLocked = f -> false;
}
lockFields(document.getDocumentCatalog().getAcroForm().getFields(), shallBeLocked);
}
signature.setFilter(PDSignature.FILTER_ADOBE_PPKLITE);
signature.setSubFilter(PDSignature.SUBFILTER_ADBE_PKCS7_DETACHED);
signature.setName("blablabla");
signature.setLocation("blablabla");
signature.setReason("blablabla");
signature.setSignDate(Calendar.getInstance());
document.addSignature(signature [, ...]);
(CreateSignature helper method signAndLockExistingFieldWithLock)
with lockFields implemented like this:
boolean lockFields(List<PDField> fields, Predicate<PDField> shallBeLocked) {
boolean isUpdated = false;
if (fields != null) {
for (PDField field : fields) {
boolean isUpdatedField = false;
if (shallBeLocked.test(field)) {
field.setFieldFlags(field.getFieldFlags() | 1);
if (field instanceof PDTerminalField) {
for (PDAnnotationWidget widget : ((PDTerminalField)field).getWidgets())
widget.setLocked(true);
}
isUpdatedField = true;
}
if (field instanceof PDNonTerminalField) {
if (lockFields(((PDNonTerminalField)field).getChildren(), shallBeLocked))
isUpdatedField = true;
}
if (isUpdatedField) {
field.getCOSObject().setNeedToBeUpdated(true);
isUpdated = true;
}
}
}
return isUpdated;
}
(CreateSignature helper method lockFields)

Java - Variable number of variables

I wrote a code to find all URLs within a PDF file and replace the one(s) that matches the parameters that was passed from a PHP script.
It is working fine when a single URL is passed. But I don't know how to handle more than one URL, I'm guessing I would need a loop that reads the array length, and call the changeURL method passing the correct parameters.
I actually made it work with if Statements (if myarray.lenght < 4 do this, if it is < 6, do that, if < 8.....), but I am guessing this is not the optimal way. So I removed it and want to try something else.
Parameters passed from PHP (in this order):
args[0] - Location of original PDF
args[1] - Location of new PDF
args[2] - URL 1 (URL to be changed)
args[3] - URL 1a (URL that will replace URL 1)
args[4] - URL 2 (URL to be changed)
args[5] - URL 2a - (URL that will replace URL 2)
args...
and so on... up to maybe around 16 args, depending on how many URLs the PDF file contains.
Here's the code:
Main.java
public class Main {
public static void main(String[] args) {
if (args.length >= 4) {
URLReplacer.changeURL(args);
} else {
System.out.println("PARAMETER MISSING FROM PHP");
}
}
}
URLReplacer.java
public class URLReplacer {
public static void changeURL(String... a) {
try (PDDocument doc = PDDocument.load(a[0])) {
List<?> allPages = doc.getDocumentCatalog().getAllPages();
for (int i = 0; i < allPages.size(); i++) {
PDPage page = (PDPage) allPages.get(i);
List annotations = page.getAnnotations();
for (int j = 0; j < annotations.size(); j++) {
PDAnnotation annot = (PDAnnotation) annotations.get(j);
if (annot instanceof PDAnnotationLink) {
PDAnnotationLink link = (PDAnnotationLink) annot;
PDAction action = link.getAction();
if (action instanceof PDActionURI) {
PDActionURI uri = (PDActionURI) action;
String oldURL = uri.getURI();
if (a[2].equals(oldURL)) {
//System.out.println("Page " + (i + 1) + ": Replacing " + oldURL + " with " + a[3]);
uri.setURI(a[3]);
}
}
}
}
}
doc.save(a[1]);
} catch (IOException | COSVisitorException e) {
e.printStackTrace();
}
}
}
I have tried all sort of loops, but with my limited Java skills, did not achieve any success.
Also, if you notice any dodgy code, kindly let me know so I can learn the best practices from more experienced programmers.

Your main problem - as I understand -, is the "variable number of variables". And you have to send from PHP to JAVA.
1 you can transmit one by one as your example
2 or, in a structure.
there are several structures.
JSON is rather simple at PHP: multiple examples here:
encode json using php?
and for java you have: Decoding JSON String in Java.
or others (like XML , which seems too complex for this).

I'd structure your method to accept specific parameters. I used map to accept URLs, a custom object would be another option.
Also notice the way loops are changed, might give you a hint on some Java skills.
public static void changeURL(String originalPdf, String targetPdf, Map<String, String> urls ) {
try (PDDocument doc = PDDocument.load(originalPdf)) {
List<PDPage> allPages = doc.getDocumentCatalog().getAllPages();
for(PDPage page: allPages){
List annotations = page.getAnnotations();
for(PDAnnotation annot : page.getAnnotations()){
if (annot instanceof PDAnnotationLink) {
PDAnnotationLink link = (PDAnnotationLink) annot;
PDAction action = link.getAction();
if (action instanceof PDActionURI) {
PDActionURI uri = (PDActionURI) action;
String oldURL = uri.getURI();
for (Map.Entry<String, String> url : urls.entrySet()){
if (url.getKey().equals(oldURL)) {
uri.setURI(url.getValue());
}
}
}
}
}
}
doc.save(targetPdf);
} catch (IOException | COSVisitorException e) {
e.printStackTrace();
}
}
If you have to get the URL and PDF locations from command line, then call the changeURL function like this:
public static void main(String[] args) {
if (args.length >= 4) {
String originalPdf = args[0];
String targetPdf = args[1];
Map<String, String> urls = new HashMap<String, String>();
for(int i = 2; i< args.length; i+=2){
urls.put(args[i], args[i+1]);
}
URLReplacer.changeURL(originalPdf, targetPdf, urls);
} else {
System.out.println("PARAMETER MISSING FROM PHP");
}
}

Of the top of my head, you could do something like this
public static void main(String[] args) {
if (args.length >= 4 && args.length % 2 == 0) {
for(int i = 2; i < args.length; i += 2) {
URLReplacer.changeURL(args[0], args[1], args[i], args[i+1]);
args[0] = args[1];
}
} else {
System.out.println("PARAMETER MISSING FROM PHP");
}
}

Extract page number from PDF file

I have a PDF document which might have been created by extracting few pages from another PDF document. I am wondering How do I get the page number. As the starting page number is 572, which for a complete PDF document should have been 1.
Do you think converting the PDF into an XMl will sort this issue?

Most probably the document contains /PageLabels entry in the Document Catalog. This entry specifies the numbering style for page numbers and the starting number, too.
You might have to update the starting number or remove the entry completely. The following document contains more information about /PageLabels entry:
Specifying consistent page numbering for PDF documents
The example 2 in the document might be useful if you decide to update the entry.

Finally figured it out using iText. Would not have been possible without Bovrosky's hint. Tons of thanks to him. Posting the code sample:
public void process(PdfReader reader) {
PRIndirectReference obj = (PRIndirectReference) dict.get(com.itextpdf.text.pdf.PdfName.PAGELABELS);
System.out.println(obj.getNumber());
PdfObject ref = reader.getPdfObject(obj.getNumber());
PdfArray array = (PdfArray)((PdfDictionary) ref).get(com.itextpdf.text.pdf.PdfName.NUMS);
System.out.println("Start Page: " + resolvePdfIndirectReference(array, reader));
}
private static int resolvePdfIndirectReference(PdfObject obj, PdfReader reader) {
if (obj instanceof PdfArray) {
PdfDictionary subDict = null;
PdfIndirectReference indRef = null;
ListIterator < PdfObject > itr = ((PdfArray) obj).listIterator();
while (itr.hasNext()) {
PdfObject pdfObj = itr.next();
if (pdfObj instanceof PdfIndirectReference)
indRef = (PdfIndirectReference) pdfObj;
if (pdfObj instanceof PdfDictionary) {
subDict = (PdfDictionary) pdfObj;
break;
}
}
if (subDict != null) {
return resolvePdfIndirectReference(subDict, reader);
} else if (indRef != null)
return resolvePdfIndirectReference(indRef, reader);
} else if (obj instanceof PdfIndirectReference) {
PdfObject ref = reader.getPdfObject(((PdfIndirectReference) obj).getNumber());
return resolvePdfIndirectReference(ref, reader);
} else if (obj instanceof PdfDictionary) {
PdfNumber num = (PdfNumber)((PdfDictionary) obj).get(com.itextpdf.text.pdf.PdfName.ST);
return num.intValue();
}
return 0;
}

Download attachments using Exchange Web Services Java API?

I am writing a Java application to download emails using Exchange Web Services. I am using Microsoft's ewsjava API for doing this.
I am able to fetch email headers. But, I am not able to download email attachments using this API. Below is the code snippet.
FolderId folderId = new FolderId(WellKnownFolderName.Inbox, "mailbox#example.com");
findResults = service.findItems(folderId, view);
for(Item item : findResults.getItems()) {
if (item.getHasAttachments()) {
AttachmentCollection attachmentsCol = item.getAttachments();
System.out.println(attachmentsCol.getCount()); // This is printing zero all the time. My message has one attachment.
for (int i = 0; i < attachmentsCol.getCount(); i++) {
FileAttachment attachment = (FileAttachment)attachmentsCol.getPropertyAtIndex(i);
String name = attachment.getFileName();
int size = attachment.getContent().length;
}
}
}
item.getHasAttachments() is returning true, but attachmentsCol.getCount() is 0.

You need to load property Attachments before you can use them in your code. You set it for ItemView object that you pass to FindItems method.
Or you can first find items and then call service.LoadPropertiesForItems and pass findIesults and PropertySet object with added EmailMessageSchema.Attachments

FolderId folderId = new FolderId(WellKnownFolderName.Inbox, "mailbox#example.com");
findResults = service.findItems(folderId, view);
service.loadPropertiesForItems(findResults, new PropertySet(BasePropertySet.FirstClassProperties, EmailMessageSchema.Attachments));
for(Item item : findResults.getItems()) {
if (item.getHasAttachments()) {
AttachmentCollection attachmentsCol = item.getAttachments();
System.out.println(attachmentsCol.getCount());
for (int i = 0; i < attachmentsCol.getCount(); i++) {
FileAttachment attachment = (FileAttachment)attachmentsCol.getPropertyAtIndex(i);
attachment.load(attachment.getName());
}
}
}

Honestly as painful as it is, I'd use the PROXY version instead of the Managed API. It's a pity, but the managed version for java seems riddled with bugs.

before checking for item.getHasAttachments(), you should do item.load(). Otherwise there is a chance your code will not load the attachment and attachmentsCol.getCount() will be 0.
Working code with Exchange Server 2010 :
ItemView view = new ItemView(Integer.MAX_VALUE);
view.getOrderBy().add(ItemSchema.DateTimeReceived, SortDirection.Descending);
FindItemsResults < Item > results = service.findItems(WellKnownFolderName.Inbox, new SearchFilter.IsEqualTo(EmailMessageSchema.IsRead, true), view);
Iterator<Item> itr = results.iterator();
while(itr.hasNext()) {
Item item = itr.next();
item.load();
ItemId itemId = item.getId();
EmailMessage email = EmailMessage.bind(service, itemId);
if (item.getHasAttachments()) {
System.err.println(item.getAttachments());
AttachmentCollection attachmentsCol = item.getAttachments();
for (int i = 0; i < attachmentsCol.getCount(); i++) {
FileAttachment attachment=(FileAttachment)attachmentsCol.getPropertyAtIndex(i);
attachment.load("C:\\TEMP\\" +attachment.getName());
}
}
}

Little late for the answer, but here is what I have.
HashMap<String, HashMap<String, String>> attachments = new HashMap<String, HashMap<String, String>>();
if (emailMessage.getHasAttachments() || emailMessage.getAttachments().getItems().size() > 0) {
//get all the attachments
AttachmentCollection attachmentsCol = emailMessage.getAttachments();
log.info("File Count: " +attachmentsCol.getCount());
//loop over the attachments
for (int i = 0; i < attachmentsCol.getCount(); i++) {
Attachment attachment = attachmentsCol.getPropertyAtIndex(i);
//log.debug("Starting to process attachment "+ attachment.getName());
//FileAttachment - Represents a file that is attached to an email item
if (attachment instanceof FileAttachment || attachment.getIsInline()) {
attachments.putAll(extractFileAttachments(attachment, properties));
} else if (attachment instanceof ItemAttachment) { //ItemAttachment - Represents an Exchange item that is attached to another Exchange item.
attachments.putAll(extractItemAttachments(service, attachment, properties, appendedBody));
}
}
}
} else {
log.debug("Email message does not have any attachments.");
}
//Extract File Attachments
try {
FileAttachment fileAttachment = (FileAttachment) attachment;
// if we don't call this, the Content property may be null.
fileAttachment.load();
//extract the attachment content, it's not base64 encoded.
attachmentContent = fileAttachment.getContent();
if (attachmentContent != null && attachmentContent.length > 0) {
//check the size
int attachmentSize = attachmentContent.length;
//check if the attachment is valid
ValidateEmail.validateAttachment(fileAttachment, properties,
emailIdentifier, attachmentSize);
fileAttachments.put(UtilConstants.ATTACHMENT_SIZE, String.valueOf(attachmentSize));
//get attachment name
String fileName = fileAttachment.getName();
fileAttachments.put(UtilConstants.ATTACHMENT_NAME, fileName);
String mimeType = fileAttachment.getContentType();
fileAttachments.put(UtilConstants.ATTACHMENT_MIME_TYPE, mimeType);
log.info("File Name: " + fileName + " File Size: " + attachmentSize);
if (attachmentContent != null && attachmentContent.length > 0) {
//convert the content to base64 encoded string and add to the collection.
String base64Encoded = UtilFunctions.encodeToBase64(attachmentContent);
fileAttachments.put(UtilConstants.ATTACHMENT_CONTENT, base64Encoded);
}
//Extract Item Attachment
try {
ItemAttachment itemAttachment = (ItemAttachment) attachment;
PropertySet propertySet = new PropertySet(
BasePropertySet.FirstClassProperties, ItemSchema.Attachments,
ItemSchema.Body, ItemSchema.Id, ItemSchema.DateTimeReceived,
EmailMessageSchema.DateTimeReceived, EmailMessageSchema.Body);
itemAttachment.load();
propertySet.setRequestedBodyType(BodyType.Text);
Item item = itemAttachment.getItem();
eBody = appendItemBody(item, appendedBody.get(UtilConstants.BODY_CONTENT));
appendedBody.put(UtilConstants.BODY_CONTENT, eBody);
/*
* We need to check if Item attachment has further more
* attachments like .msg attachment, which is an outlook email
* as attachment. Yes, we can attach an email chain as
* attachment and that email chain can have multiple
* attachments.
*/
AttachmentCollection childAttachments = item.getAttachments();
//check if not empty collection. move on
if (childAttachments != null && !childAttachments.getItems().isEmpty() && childAttachments.getCount() > 0) {
for (Attachment childAttachment : childAttachments) {
if (childAttachment instanceof FileAttachment) {
itemAttachments.putAll(extractFileAttachments(childAttachment, properties, emailIdentifier));
} else if (childAttachment instanceof ItemAttachment) {
itemAttachments = extractItemAttachments(service, childAttachment, properties, appendedBody, emailIdentifier);
}
}
}
} catch (Exception e) {
throw new Exception("Exception while extracting Item Attachments: " + e.getMessage());
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Using PDFBox to remove Optional Content Groups that are not enabled - java

Related

How to remove Images from PDF File?

Signing PDF with multiple signature fields using PDFBox 2.0.17

Java - Variable number of variables

Extract page number from PDF file

Download attachments using Exchange Web Services Java API?

Categories

Resources