How to remove Images from PDF File?

How to remove Images from PDF File? - java

Hello ,thank you for answer my question.This proble is perplex me for a long time.
I have search this QS for a long time,I read so many article in stack overFlow and google,but those articles is outdated or fragmented,so I have to seek for help.
I hope some one can help me ,please.
public class TEST04 {
public static void main(String[] args) throws IOException {
System.out.println("Hi");
//ori pdf file
String oriPDFFile = IFileUtils.getDesktopPath().getAbsoluteFile() + "\\1.pdf";
//out pdf file
String outPDFFile = IFileUtils.getDesktopPath().getAbsoluteFile() + "\\2.pdf";
strip(oriPDFFile, outPDFFile);
}
//parse
public static void strip(String pdfFile, String pdfFileOut) throws IOException {
//load ori pdf file
PDDocument document = PDDocument.load(new File(pdfFile));
//get All pages
List<PDPage> pageList = IterUtil.toList(document.getDocumentCatalog().getPages());
for (int i = 0; i < pageList.size(); i++) {
PDPage page = pageList.get(i);
COSDictionary newDictionary = new COSDictionary(page.getCOSObject());
PDFStreamParser parser = new PDFStreamParser(page);
List tokens = parser.getTokens();
List newTokens = new ArrayList();
for (int j = 0; j < tokens.size(); j++) {
Object token = tokens.get(j);
if (token instanceof Operator) {
Operator operator = (Operator) token;
if (operator.getName().equals("Do")) {
COSName cosName = (COSName) newTokens.remove(newTokens.size() - 1);
deleteObject(newDictionary, cosName);
continue;
}
}
newTokens.add(token);
}
PDStream newContents = new PDStream(document);
try (OutputStream outputStream = newContents.createOutputStream()) {
ContentStreamWriter writer = new ContentStreamWriter(outputStream);
writer.writeTokens(newTokens);
}
page.setContents(newContents);
// ContentStreamWriter writer = new ContentStreamWriter(newContents.createOutputStream());
// writer.writeTokens( newTokens );
// page.setContents(newContents);
PDResources newResources = new PDResources(newDictionary);
page.setResources(newResources);
}
document.save(pdfFileOut);
document.close();
}
//delete
public static boolean deleteObject(COSDictionary d, COSName name) {
for(COSName key : d.keySet()) {
if( name.equals(key) ) {
d.removeItem(key);
return true;
}
COSBase object = d.getDictionaryObject(key);
if(object instanceof COSDictionary) {
if( deleteObject((COSDictionary)object, name) ) {
return true;
}
}
}
return false;
}
}
The stack trace:

It works same way like it does in example RemoveAllText.java, just with different tag.
Use code from this example, just use "Do" instead of "Tj".
Of course, if you need to load metadata, etc, you should enumerate and check images threw page resources (like in my example)

Following the tip in Ali Yavari's answer you created a test class. Unfortunately that test code produced an exception. This answer focuses on fixing your code.
According to the stack trace you posted an image of the exception occurred while saving the document; some stream was asked to provide an InputStream and it failed with the message "Cannot read while there is an open stream writer".
So, let's have a look where your code opens a stream writer but does not close it again:
PDStream newContents = new PDStream(document);
ContentStreamWriter writer = new ContentStreamWriter(newContents.createOutputStream());
writer.writeTokens( newTokens );
page.setContents(newContents);
Indeed, here you ask a stream (the PDStream newContents) for something to write to (newContents.createOutputStream()) but don't close it.
You can do that like this:
PDStream newContents = new PDStream(document);
try (OutputStream outputStream = newContents.createOutputStream()) {
ContentStreamWriter writer = new ContentStreamWriter(outputStream);
writer.writeTokens(newTokens);
}
page.setContents(newContents);
A side note, you will have to re-write what you do with the newDictionary object. Currently you
initialize it with the page dictionary entries,
recursively remove all entries with a key that is a name of an image you remove, and
set the page resources to this dictionary.
Item 2 can delete much more than you actually want, the same name in a different dictionary may refer to an entry with a completely different meaning. Furthermore, you recurse without further checks; if there is a circular relation among the dictionaries, this may result in an infinite recursion, i.e. a stack overflow exception.
Item 3 sets this manipulated page clone inappropriately as the resources of the original page. This create a completely broken page structure.
Instead you should retrieve the resources from the page (resources = page.getResources()) and remove the images by putting them to null (resources.put(cosName, (PDXObject)null)).

In my other answer I focused on advise on how to fix the code in the question. Here I focus on a different approach to the task.
In your code you try to remove the bitmap images by inspecting the page content streams, finding Do operations therein drawing XObjects, and removing both this instruction and the referenced XObject.
It is a bit easier to instead simply replace all image XObjects in the resources by an empty form XObject. This is the approach used here.
As that approach is very easy to implement, I extended it to not only go through the immediate resources of the pages but also iterate into embedded form XObjects and patterns.
void replaceBitmapImagesResources(PDDocument document) throws IOException {
PDFormXObject pdFormXObject = new PDFormXObject(document);
pdFormXObject.setBBox(new PDRectangle(1, 1));
for (PDPage pdPage : document.getPages()) {
replaceBitmapImagesResources(pdPage.getResources(), pdFormXObject);
}
}
void replaceBitmapImagesResources(PDResources resources, PDFormXObject formXObject) throws IOException {
if (resources == null)
return;
for (COSName cosName : resources.getPatternNames()) {
PDAbstractPattern pdAbstractPattern = resources.getPattern(cosName);
if (pdAbstractPattern instanceof PDTilingPattern) {
PDTilingPattern pdTilingPattern = (PDTilingPattern) pdAbstractPattern;
replaceBitmapImagesResources(pdTilingPattern.getResources(), formXObject);
}
}
List<COSName> xobjectsToReplace = new ArrayList<>();
for (COSName cosName : resources.getXObjectNames()) {
PDXObject pdxObject = resources.getXObject(cosName);
if (pdxObject instanceof PDImageXObject) {
xobjectsToReplace.add(cosName);
} else if (pdxObject instanceof PDFormXObject) {
PDFormXObject pdFormXObject = (PDFormXObject) pdxObject;
replaceBitmapImagesResources(pdFormXObject.getResources(), formXObject);
}
}
for (COSName cosName : xobjectsToReplace) {
resources.put(cosName, formXObject);
}
}
(RemoveImages helper methods)
To apply this approach to a PDDocument simply call the first replaceBitmapImagesResources with that document as parameter.
Beware: I tried to keep the code simple; for production use remember to limit the recursion here to prevent endless recursions as in some PDFs XObjects or patterns call themselves directly or indirectly. Also you may want to inspect page annotations and the resources of template pages.

Related

Using PDFBox to remove Optional Content Groups that are not enabled

I'm using apache PDFBox from java, and I have a source PDF with multiple optional content groups. What I am wanting to do is export a version of the PDF that includes only the standard content and the optional content groups that were enabled. It is important for my purposes that I preserve any dynamic aspects of the original.... so text fields are still text fields, vector images are still vector images, etc. The reason that this is required is because I intend to ultimately be using a pdf form editor program that does not know how to handle optional content, and would blindly render all of them, so I want to preprocess the source pdf, and use the form editing program on a less cluttered destination pdf.
I've been trying to find something that could give me any hints on how to do this with google, but to no avail. I don't know if I'm just using the wrong search terms, or if this is just something that is outside of what the PDFBox API was designed for. I rather hope it's not the latter. The info shown here does not seem to work (converting the C# code to java), because despite the pdf I'm trying to import having optional content, there does not seem to be any OC resources when I examine the tokens on each page.
for(PDPage page:pages) {
PDResources resources = page.getResources();
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
Collection tokens = parser.getTokens();
...
}
I'm truly sorry for not having any more code to show what I've tried so far, but I've just been poring over the java API docs for about 8 hours now trying to figure out what I might need to do this, and just haven't been able to figure it out.
What I DO know how to do is add text, lines, and images to a new PDPage, but I do not know how to retrieve that information from a given source page to copy it over, nor how to tell which optional content group such information is part of (if any). I am also not sure how to copy form fields in the source pdf over to the destination, nor how to copy the font information over.
Honestly, if there's a web page out there that I wasn't able to find with google with the searches that I tried, I'd be entirely happy to read up more about it, but I am really quite stuck here, and I don't know anyone personally that knows about this library.
Please help.
EDIT:
Trying what I understand from what was suggested below, I've written a loop to examine each XObject on the page as follows:
PDResources resources = pdPage.getResources();
Iterable<COSName> names = resources.getXObjectNames();
for(COSName name:names) {
PDXObject xobj = resources.getXObject(name);
PDFStreamParser parser = new PDFStreamParser(xobj.getStream().toByteArray());
parser.parse();
Object [] tokens = parser.getTokens().toArray();
for(int i = 0;i<tokens.length-1;i++) {
Object obj = tokens[i];
if (obj instanceof COSName && obj.equals(COSName.OC)) {
i++;
Object obj = tokens[i];
if (obj instanceof COSName) {
PDPropertyList props = resources.getProperties((COSName)obj);
if (props != null) {
...
However, after an OC key, the next entry in the tokens array is always an Operator tagged as "BMC". Nowhere am I finding any info that I can recognize from the named optional content groups.

Here's a robust solution for removing marked content blocks (open to feedback if anyone finds anything that isn't working right). You should be able to adjust for OC blocks...
This code properly handles nesting and removal of resources (xobject, graphics state and fonts - easy to add others if needed).
public class MarkedContentRemover {
private final MarkedContentMatcher matcher;
/**
*
*/
public MarkedContentRemover(MarkedContentMatcher matcher) {
this.matcher = matcher;
}
public int removeMarkedContent(PDDocument doc, PDPage page) throws IOException {
ResourceSuppressionTracker resourceSuppressionTracker = new ResourceSuppressionTracker();
PDResources pdResources = page.getResources();
PDFStreamParser pdParser = new PDFStreamParser(page);
PDStream newContents = new PDStream(doc);
OutputStream newContentOutput = newContents.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter newContentWriter = new ContentStreamWriter(newContentOutput);
List<Object> operands = new ArrayList<>();
Operator operator = null;
Object token;
int suppressDepth = 0;
boolean resumeOutputOnNextOperator = false;
int removedCount = 0;
while (true) {
operands.clear();
token = pdParser.parseNextToken();
while(token != null && !(token instanceof Operator)) {
operands.add(token);
token = pdParser.parseNextToken();
}
operator = (Operator)token;
if (operator == null) break;
if (resumeOutputOnNextOperator) {
resumeOutputOnNextOperator = false;
suppressDepth--;
if (suppressDepth == 0)
removedCount++;
}
if (OperatorName.BEGIN_MARKED_CONTENT_SEQ.equals(operator.getName())
|| OperatorName.BEGIN_MARKED_CONTENT.equals(operator.getName())) {
COSName contentId = (COSName)operands.get(0);
final COSDictionary properties;
if (operands.size() > 1) {
Object propsOperand = operands.get(1);
if (propsOperand instanceof COSDictionary) {
properties = (COSDictionary) propsOperand;
} else if (propsOperand instanceof COSName) {
properties = pdResources.getProperties((COSName)propsOperand).getCOSObject();
} else {
properties = new COSDictionary();
}
} else {
properties = new COSDictionary();
}
if (matcher.matches(contentId, properties)) {
suppressDepth++;
}
}
if (OperatorName.END_MARKED_CONTENT.equals(operator.getName())) {
if (suppressDepth > 0)
resumeOutputOnNextOperator = true;
}
else if (OperatorName.SET_GRAPHICS_STATE_PARAMS.equals(operator.getName())) {
resourceSuppressionTracker.markForOperator(COSName.EXT_G_STATE, operands.get(0), suppressDepth == 0);
}
else if (OperatorName.DRAW_OBJECT.equals(operator.getName())) {
resourceSuppressionTracker.markForOperator(COSName.XOBJECT, operands.get(0), suppressDepth == 0);
}
else if (OperatorName.SET_FONT_AND_SIZE.equals(operator.getName())) {
resourceSuppressionTracker.markForOperator(COSName.FONT, operands.get(0), suppressDepth == 0);
}
if (suppressDepth == 0) {
newContentWriter.writeTokens(operands);
newContentWriter.writeTokens(operator);
}
}
if (resumeOutputOnNextOperator)
removedCount++;
newContentOutput.close();
page.setContents(newContents);
resourceSuppressionTracker.updateResources(pdResources);
return removedCount;
}
private static class ResourceSuppressionTracker{
// if the boolean is TRUE, then the resource should be removed. If the boolean is FALSE, the resource should not be removed
private final Map<COSName, Map<COSName, Boolean>> tracker = new HashMap<>();
public void markForOperator(COSName resourceType, Object resourceNameOperand, boolean preserve) {
if (!(resourceNameOperand instanceof COSName)) return;
if (preserve) {
markForPreservation(resourceType, (COSName)resourceNameOperand);
} else {
markForRemoval(resourceType, (COSName)resourceNameOperand);
}
}
public void markForRemoval(COSName resourceType, COSName refId) {
if (!resourceIsPreserved(resourceType, refId)) {
getResourceTracker(resourceType).put(refId, Boolean.TRUE);
}
}
public void markForPreservation(COSName resourceType, COSName refId) {
getResourceTracker(resourceType).put(refId, Boolean.FALSE);
}
public void updateResources(PDResources pdResources) {
for (Map.Entry<COSName, Map<COSName, Boolean>> resourceEntry : tracker.entrySet()) {
for(Map.Entry<COSName, Boolean> refEntry : resourceEntry.getValue().entrySet()) {
if (refEntry.getValue().equals(Boolean.TRUE)) {
pdResources.getCOSObject().getCOSDictionary(COSName.XOBJECT).removeItem(refEntry.getKey());
}
}
}
}
private boolean resourceIsPreserved(COSName resourceType, COSName refId) {
return getResourceTracker(resourceType).getOrDefault(refId, Boolean.FALSE);
}
private Map<COSName, Boolean> getResourceTracker(COSName resourceType){
if (!tracker.containsKey(resourceType)) {
tracker.put(resourceType, new HashMap<>());
}
return tracker.get(resourceType);
}
}
}
Helper class:
public interface MarkedContentMatcher {
public boolean matches(COSName contentId, COSDictionary props);
}

Optional Content Groups are marked with BDC and EMC. You will have to navigate through all of the tokens returned from the parser and remove the "section" from the array. Here is some C# Code that was posted a while ago - [1]: How to delete an optional content group alongwith its content from pdf using pdfbox?
I investigated that (converting to Java) but couldn't get it work as expected. I managed to remove the content between BDC and EMC and then save the result using the same technique as the sample but the PDF was corrupted. Perhaps that is my lack of C# Knowledge (related to Tuples etc.)
Here is what I came up with, as I said it doesn't work perhaps you or someone else (mkl, Tilman Hausherr) can spot the flaw.
OCGDelete (PDDocument doc, int pageNum, String OCName) {
PDPage pdPage = (PDPage) doc.getDocumentCatalog().getPages().get(pageNum);
PDResources pdResources = pdPage.getResources();
PDFStreamParser pdParser = new PDFStreamParser(pdPage);
int ocgStart
int ocgLength
Collection tokens = pdParser.getTokens();
Object[] newTokens = tokens.toArray()
try {
for (int index = 0; index < newTokens.length; index++) {
obj = newTokens[index]
if (obj instanceof COSName && obj.equals(COSName.OC)) {
// println "Found COSName at "+index /// Found Optional Content
startIndex = index
index++
if (index < newTokens.size()) {
obj = newTokens[index]
if (obj instanceof COSName) {
prop = pdRes.getProperties(obj)
if (prop != null && prop instanceof PDOptionalContentGroup) {
if ((prop.getName()).equals(delLayer)) {
println "Found the Layer to be deleted"
println "prop Name was " + prop.getName()
index++
if (index < newTokens.size()) {
obj = newTokens[index]
if ((obj.getName()).equals("BDC")) {
ocgStart = index
println("OCG Start " + ocgStart)
ocgLength = -1
index++
while (index < newTokens.size()) {
ocgLength++
obj = newTokens[index]
println " Loop through relevant OCG Tokens " + obj
if (obj instanceof Operator && (obj.getName()).equals("EMC")) {
println "the next obj was " + obj
println "after that " + newTokens[index + 1] + "and then " + newTokens[index + 2]
println("OCG End " + ocgLength++)
break
}
index++
}
if (endIndex > 0) {
println "End Index was something " + (startIndex + ocgLength)
}
}
}
}
}
}
}
}
}
}
catch (Exception ex){
println ex.message()
}
for (int i = ocgStart; i < ocgStart+ ocgLength; i++){
newTokens.removeAt(i)
}
PDStream newContents = new PDStream(doc);
OutputStream output = newContents.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter writer = new ContentStreamWriter(output);
writer.writeTokens(newTokens);
output.close();
pdPage.setContents(newContents);
}

iText 7.0.5: How to combine PDF and have existing bookmarks indented under new bookmarks for each document?

Problem:
com.itextpdf.kernel.PdfException: Pdf indirect object belongs to other PDF document. Copy object to current pdf document.
I want to combine PDF documents with an edited set of bookmarks that keeps a clear pairing of the bookmarks with each original document. I also want a new top level bookmark describing the set as a whole to improve combining with yet other documents later if the user chooses. The number of documents combined and the number of bookmarks in each is unknown and some documents might not have any bookmarks.
For simplicity, assume I have two documents with two pages and a bookmark to the second page in each. I would want the combined document to have a bookmark structure like this where "NEW" are the ones I am creating based on meta data I have about each source document and "EXISTING" are whatever I copy from the individual documents:
-- NEW Combined Document meta(page 1)
---- NEW Document one meta (page 1)
------ EXISTING Doc one link (page 2)
---- NEW Document two meta (page 3)
------ EXISTING Doc two link (page 4)
Code:
private static String combinePdf(List<String> allFile, LinkedHashMap<String, String> bookmarkMetaMap, Connection conn) throws IOException {
System.out.println("=== combinePdf() ENTER"); // TODO REMOVE
File outFile = File.createTempFile("combinePdf", "pdf", new File(DocumentObj.TEMP_DIR_ON_SERVER));
if (!outFile.exists() || !outFile.canWrite()) {
throw new IOException("Unable to create writeable file in " + DocumentObj.TEMP_DIR_ON_SERVER);
}
if (bookmarkMetaMap == null || bookmarkMetaMap.isEmpty()) {
bookmarkMetaMap = new LinkedHashMap<>(); // prevent NullPointer below
bookmarkMetaMap.put("Documents", "Documents");
}
try ( PdfDocument allPdfDoc = new PdfDocument(new PdfWriter(outFile)) ) {
allPdfDoc.initializeOutlines();
allPdfDoc.getCatalog().setPageMode(PdfName.UseOutlines);
PdfMerger allPdfMerger = new PdfMerger(allPdfDoc, true, false); // build own outline
Iterator<Map.Entry<String, String>> itr = bookmarkMetaMap.entrySet().iterator();
PdfOutline rootOutline = allPdfDoc.getOutlines(false);
PdfOutline mainOutline;
mainOutline = rootOutline.addOutline(itr.next().getValue());
mainOutline.addDestination(PdfExplicitDestination.createFit(allPdfDoc.getNumberOfPages() + 1));
int fileNum = 0;
for (String oneFile : allFile) {
PdfDocument onePdfDoc = new PdfDocument(new PdfReader(oneFile));
PdfAcroForm oneForm = PdfAcroForm.getAcroForm(onePdfDoc, false);
if (oneForm != null) {
oneForm.flattenFields();
}
allPdfMerger.merge(onePdfDoc, 1, onePdfDoc.getNumberOfPages());
fileNum++;
String bookmarkLabel = itr.hasNext() ? itr.next().getKey() : "Document " + fileNum;
PdfOutline linkToDoc = mainOutline.addOutline(bookmarkLabel);
linkToDoc.addDestination(PdfExplicitDestination.createFit(allPdfDoc.getNumberOfPages() + 1));
PdfOutline srcDocOutline = onePdfDoc.getOutlines(false);
if (srcDocOutline != null) {
List<PdfOutline> outlineList = srcDocOutline.getAllChildren();
if (!outlineList.isEmpty()) {
for (PdfOutline p : outlineList) {
linkToDoc.addOutline(p); // if I comment this out, no error, but links wrong order
}
}
}
onePdfDoc.close();
}
System.out.println("=== combinePdf() DONE ADDING PAGES ==="); //TODO REMOVE
}
return outFile.getAbsolutePath();
}
Problem:
com.itextpdf.kernel.PdfException: Pdf indirect object belongs to other PDF document. Copy object to current pdf document.
Error occurs after the debug line "=== combinePdf() DONE ADDING PAGES ===" so the for loop completes as expected.
This means the error occurs when allPdfDoc is automagically closed.
If I remove the line linkToDoc.addOutline(p); I get all of my links and they go to the correct pages but they are not nested/ordered as I want:
-- NEW Combined Document meta(page 1)
---- NEW Document one meta (page 1)
---- NEW Document two meta (page 3)
-- EXISTING Doc one link (page 2)
-- EXISTING Doc two link (page 4)
With the aforementioned line commented out, I am not even sure how the EXISTING links are included at all. I have the mergeOutlines flag set to false in the PdfMerger constructor since I thought I had to construct my own outline. I get similar results no matter whether I set the getOutlines() to true or false as well as if I take out my arbitrary top level new bookmark.
I know how to create a flattened list of new and existing bookmarks in the desired order. So my question is about how to get both the indenting and ordering as desired.
Thanks for taking a look!

Rather than shift bookmarks in the combined PDF, I did it in the component PDF before merging.
Feedback welcome, especially if something is horribly inefficient as PDF size increases:
private static void shiftPdfBookmarksUnderNewBookmark(PdfDocument pdfDocument, String bookmarkLabel) {
if (pdfDocument == null || pdfDocument.getWriter() == null) {
log.warn("shiftPdfBookmarksUnderNewBookmark(): no writer linked to PDFDocument, cannot modify bookmarks");
return;
}
pdfDocument.initializeOutlines();
try {
PdfOutline rootOutline = pdfDocument.getOutlines(false);
PdfOutline subOutline = rootOutline.addOutline(bookmarkLabel);
subOutline.addDestination(PdfExplicitDestination.createFit(pdfDocument.getFirstPage())); // Not sure why this is needed, but problems if omitted.
List<PdfOutline> pdfOutlineChildren = rootOutline.getAllChildren();
if (pdfOutlineChildren.size() == 1) {
return;
}
int i = 0;
for (PdfOutline p : rootOutline.getAllChildren()) {
if (p != subOutline) {
if (p.getDestination() == null) {
continue;
}
subOutline.addOutline(p);
}
}
rootOutline.getAllChildren().clear();
rootOutline.addOutline(subOutline);
subOutline.addDestination(PdfExplicitDestination.createFit(pdfDocument.getFirstPage())); // not sure why duplicate line above seems to be needed
}
catch (Exception logAndIgnore) {
log.warn("shiftPdfBookmarksUnderNewBookmark ignoring error and not shifting bookmarks: " +logAndIgnore, logAndIgnore);
}
}

How to remove a specific image from a PDF with PDFBox

I need to remove a specific image from PDF file according its metadata. Sadly. all examples I can find in Internet are using discarded methods.
I write it something like this:
try (PDDocument doc = PDDocument.load(new ByteArrayInputStream(pdf))) {
doc.getPages().forEach(page ->
{
PDResources resources = page.getResources();
List<COSName> itemsToRemove = new ArrayList<>();
resources.getXObjectNames().forEach(propertyName -> {
if(!resources.isImageXObject(propertyName)) {
return;
}
PDXObject pdxObject = resources.getXObject(propertyName);
PDImageXObject pdImageXObject = (PDImageXObject)pdxObject;
PDMetadata metadata = pdImageXObject.getMetadata();
if(checkMetadata(metadata)){
// What should I use here?
page.getCOSObject().removeItem(propertyName);
}
});
// Should I use page.setResources(resources); ?
});
doc.save(baos);
} catch (Exception e) {
//Code here
}

It works same way like it does in example RemoveAllText.java, just with different tag.
Use code from this example, just use "Do" instead of "Tj".
Of course, if you need to load metadata, etc, you should enumerate and check images threw page resources (like in my example)

Error: org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm cannot be cast to org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage

I am trying to extract image from the pdf using pdfbox. I have taken help from this post . It worked for some of the pdfs but for others/most it did not. For example, I am not able to extract the figures in this file
After doing some research I found that PDResources.getImages is deprecated. So, I am using PDResources.getXObjects(). With this, I am not able to extract any image from the PDF and instead get this message at the console:
org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm cannot be cast to org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage
Now I am stuck and unable to find the solution. Please assist if anyone can.
//////UPDATE AS REPLY ON COMMENTS///
I am using pdfbox-1.8.10
Here is the code:
public void getimg ()throws Exception {
try {
String sourceDir = "C:/Users/admin/Desktop/pdfbox/mypdfbox/pdfbox/inputs/Yavaa.pdf";
String destinationDir = "C:/Users/admin/Desktop/pdfbox/mypdfbox/pdfbox/outputs/";
File oldFile = new File(sourceDir);
if (oldFile.exists()){
PDDocument document = PDDocument.load(sourceDir);
List<PDPage> list = document.getDocumentCatalog().getAllPages();
String fileName = oldFile.getName().replace(".pdf", "_cover");
int totalImages = 1;
for (PDPage page : list) {
PDResources pdResources = page.getResources();
Map pageImages = pdResources.getXObjects();
if (pageImages != null){
Iterator imageIter = pageImages.keySet().iterator();
while (imageIter.hasNext()){
String key = (String) imageIter.next();
Object obj = pageImages.get(key);
if(obj instanceof PDXObjectImage) {
PDXObjectImage pdxObjectImage = (PDXObjectImage) obj;
pdxObjectImage.write2file(destinationDir + fileName+ "_" + totalImages);
totalImages++;
}
}
}
}
} else {
System.err.println("File not exist");
}
}
catch (Exception e){
System.err.println(e.getMessage());
}
}
//// PARTIAL SOLUTION/////
I have solved the problem of the error message. I have updated the correct code in the post as well. However, the problem remains the same. I am still not able to extract the images from few of the files. Like the one, I have mentioned in this post. Any solution in that regards.

The first problem with the original code is that XObjects can be PDXObjectImage or PDXObjectForm, so it is needed to check the instance. The second problem is that the code doesn't walk PDXObjectForm recursively, forms can have resources too. The third problem (only in 1.8) is that you used getResources() instead of findResources(), getResources() doesn't check higher levels.
Code for 1.8 can be found here:
https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/ExtractImages.java?view=markup
Code for 2.0 can be found here:
https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/ExtractImages.java?view=markup&sortby=date
(Even these are not always perfect, see this answer)
The fourth problem is that your file doesn't have any XObjects at all. All "graphics" were really vector drawings, these can't be "extracted" like embedded images. All you could do is to convert the PDF pages to images, and then mark and cut what you need.

Extract page number from PDF file

I have a PDF document which might have been created by extracting few pages from another PDF document. I am wondering How do I get the page number. As the starting page number is 572, which for a complete PDF document should have been 1.
Do you think converting the PDF into an XMl will sort this issue?

Most probably the document contains /PageLabels entry in the Document Catalog. This entry specifies the numbering style for page numbers and the starting number, too.
You might have to update the starting number or remove the entry completely. The following document contains more information about /PageLabels entry:
Specifying consistent page numbering for PDF documents
The example 2 in the document might be useful if you decide to update the entry.

Finally figured it out using iText. Would not have been possible without Bovrosky's hint. Tons of thanks to him. Posting the code sample:
public void process(PdfReader reader) {
PRIndirectReference obj = (PRIndirectReference) dict.get(com.itextpdf.text.pdf.PdfName.PAGELABELS);
System.out.println(obj.getNumber());
PdfObject ref = reader.getPdfObject(obj.getNumber());
PdfArray array = (PdfArray)((PdfDictionary) ref).get(com.itextpdf.text.pdf.PdfName.NUMS);
System.out.println("Start Page: " + resolvePdfIndirectReference(array, reader));
}
private static int resolvePdfIndirectReference(PdfObject obj, PdfReader reader) {
if (obj instanceof PdfArray) {
PdfDictionary subDict = null;
PdfIndirectReference indRef = null;
ListIterator < PdfObject > itr = ((PdfArray) obj).listIterator();
while (itr.hasNext()) {
PdfObject pdfObj = itr.next();
if (pdfObj instanceof PdfIndirectReference)
indRef = (PdfIndirectReference) pdfObj;
if (pdfObj instanceof PdfDictionary) {
subDict = (PdfDictionary) pdfObj;
break;
}
}
if (subDict != null) {
return resolvePdfIndirectReference(subDict, reader);
} else if (indRef != null)
return resolvePdfIndirectReference(indRef, reader);
} else if (obj instanceof PdfIndirectReference) {
PdfObject ref = reader.getPdfObject(((PdfIndirectReference) obj).getNumber());
return resolvePdfIndirectReference(ref, reader);
} else if (obj instanceof PdfDictionary) {
PdfNumber num = (PdfNumber)((PdfDictionary) obj).get(com.itextpdf.text.pdf.PdfName.ST);
return num.intValue();
}
return 0;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to remove Images from PDF File? - java

It works same way like it does in example RemoveAllText.java, just with different tag. Use code from this example, just use "Do" instead of "Tj". Of course, if you need to load metadata, etc, you should enumerate and check images threw page resources (like in my example)

Related

Using PDFBox to remove Optional Content Groups that are not enabled

iText 7.0.5: How to combine PDF and have existing bookmarks indented under new bookmarks for each document?

How to remove a specific image from a PDF with PDFBox

Error: org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm cannot be cast to org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage

Extract page number from PDF file

Categories

Resources