How to remove a specific image from a PDF with PDFBox

How to remove a specific image from a PDF with PDFBox - java

I need to remove a specific image from PDF file according its metadata. Sadly. all examples I can find in Internet are using discarded methods.
I write it something like this:
try (PDDocument doc = PDDocument.load(new ByteArrayInputStream(pdf))) {
doc.getPages().forEach(page ->
{
PDResources resources = page.getResources();
List<COSName> itemsToRemove = new ArrayList<>();
resources.getXObjectNames().forEach(propertyName -> {
if(!resources.isImageXObject(propertyName)) {
return;
}
PDXObject pdxObject = resources.getXObject(propertyName);
PDImageXObject pdImageXObject = (PDImageXObject)pdxObject;
PDMetadata metadata = pdImageXObject.getMetadata();
if(checkMetadata(metadata)){
// What should I use here?
page.getCOSObject().removeItem(propertyName);
}
});
// Should I use page.setResources(resources); ?
});
doc.save(baos);
} catch (Exception e) {
//Code here
}

It works same way like it does in example RemoveAllText.java, just with different tag.
Use code from this example, just use "Do" instead of "Tj".
Of course, if you need to load metadata, etc, you should enumerate and check images threw page resources (like in my example)

Related

How to remove Images from PDF File?

Hello ,thank you for answer my question.This proble is perplex me for a long time.
I have search this QS for a long time,I read so many article in stack overFlow and google,but those articles is outdated or fragmented,so I have to seek for help.
I hope some one can help me ,please.
public class TEST04 {
public static void main(String[] args) throws IOException {
System.out.println("Hi");
//ori pdf file
String oriPDFFile = IFileUtils.getDesktopPath().getAbsoluteFile() + "\\1.pdf";
//out pdf file
String outPDFFile = IFileUtils.getDesktopPath().getAbsoluteFile() + "\\2.pdf";
strip(oriPDFFile, outPDFFile);
}
//parse
public static void strip(String pdfFile, String pdfFileOut) throws IOException {
//load ori pdf file
PDDocument document = PDDocument.load(new File(pdfFile));
//get All pages
List<PDPage> pageList = IterUtil.toList(document.getDocumentCatalog().getPages());
for (int i = 0; i < pageList.size(); i++) {
PDPage page = pageList.get(i);
COSDictionary newDictionary = new COSDictionary(page.getCOSObject());
PDFStreamParser parser = new PDFStreamParser(page);
List tokens = parser.getTokens();
List newTokens = new ArrayList();
for (int j = 0; j < tokens.size(); j++) {
Object token = tokens.get(j);
if (token instanceof Operator) {
Operator operator = (Operator) token;
if (operator.getName().equals("Do")) {
COSName cosName = (COSName) newTokens.remove(newTokens.size() - 1);
deleteObject(newDictionary, cosName);
continue;
}
}
newTokens.add(token);
}
PDStream newContents = new PDStream(document);
try (OutputStream outputStream = newContents.createOutputStream()) {
ContentStreamWriter writer = new ContentStreamWriter(outputStream);
writer.writeTokens(newTokens);
}
page.setContents(newContents);
// ContentStreamWriter writer = new ContentStreamWriter(newContents.createOutputStream());
// writer.writeTokens( newTokens );
// page.setContents(newContents);
PDResources newResources = new PDResources(newDictionary);
page.setResources(newResources);
}
document.save(pdfFileOut);
document.close();
}
//delete
public static boolean deleteObject(COSDictionary d, COSName name) {
for(COSName key : d.keySet()) {
if( name.equals(key) ) {
d.removeItem(key);
return true;
}
COSBase object = d.getDictionaryObject(key);
if(object instanceof COSDictionary) {
if( deleteObject((COSDictionary)object, name) ) {
return true;
}
}
}
return false;
}
}
The stack trace:

It works same way like it does in example RemoveAllText.java, just with different tag.
Use code from this example, just use "Do" instead of "Tj".
Of course, if you need to load metadata, etc, you should enumerate and check images threw page resources (like in my example)

Following the tip in Ali Yavari's answer you created a test class. Unfortunately that test code produced an exception. This answer focuses on fixing your code.
According to the stack trace you posted an image of the exception occurred while saving the document; some stream was asked to provide an InputStream and it failed with the message "Cannot read while there is an open stream writer".
So, let's have a look where your code opens a stream writer but does not close it again:
PDStream newContents = new PDStream(document);
ContentStreamWriter writer = new ContentStreamWriter(newContents.createOutputStream());
writer.writeTokens( newTokens );
page.setContents(newContents);
Indeed, here you ask a stream (the PDStream newContents) for something to write to (newContents.createOutputStream()) but don't close it.
You can do that like this:
PDStream newContents = new PDStream(document);
try (OutputStream outputStream = newContents.createOutputStream()) {
ContentStreamWriter writer = new ContentStreamWriter(outputStream);
writer.writeTokens(newTokens);
}
page.setContents(newContents);
A side note, you will have to re-write what you do with the newDictionary object. Currently you
initialize it with the page dictionary entries,
recursively remove all entries with a key that is a name of an image you remove, and
set the page resources to this dictionary.
Item 2 can delete much more than you actually want, the same name in a different dictionary may refer to an entry with a completely different meaning. Furthermore, you recurse without further checks; if there is a circular relation among the dictionaries, this may result in an infinite recursion, i.e. a stack overflow exception.
Item 3 sets this manipulated page clone inappropriately as the resources of the original page. This create a completely broken page structure.
Instead you should retrieve the resources from the page (resources = page.getResources()) and remove the images by putting them to null (resources.put(cosName, (PDXObject)null)).

In my other answer I focused on advise on how to fix the code in the question. Here I focus on a different approach to the task.
In your code you try to remove the bitmap images by inspecting the page content streams, finding Do operations therein drawing XObjects, and removing both this instruction and the referenced XObject.
It is a bit easier to instead simply replace all image XObjects in the resources by an empty form XObject. This is the approach used here.
As that approach is very easy to implement, I extended it to not only go through the immediate resources of the pages but also iterate into embedded form XObjects and patterns.
void replaceBitmapImagesResources(PDDocument document) throws IOException {
PDFormXObject pdFormXObject = new PDFormXObject(document);
pdFormXObject.setBBox(new PDRectangle(1, 1));
for (PDPage pdPage : document.getPages()) {
replaceBitmapImagesResources(pdPage.getResources(), pdFormXObject);
}
}
void replaceBitmapImagesResources(PDResources resources, PDFormXObject formXObject) throws IOException {
if (resources == null)
return;
for (COSName cosName : resources.getPatternNames()) {
PDAbstractPattern pdAbstractPattern = resources.getPattern(cosName);
if (pdAbstractPattern instanceof PDTilingPattern) {
PDTilingPattern pdTilingPattern = (PDTilingPattern) pdAbstractPattern;
replaceBitmapImagesResources(pdTilingPattern.getResources(), formXObject);
}
}
List<COSName> xobjectsToReplace = new ArrayList<>();
for (COSName cosName : resources.getXObjectNames()) {
PDXObject pdxObject = resources.getXObject(cosName);
if (pdxObject instanceof PDImageXObject) {
xobjectsToReplace.add(cosName);
} else if (pdxObject instanceof PDFormXObject) {
PDFormXObject pdFormXObject = (PDFormXObject) pdxObject;
replaceBitmapImagesResources(pdFormXObject.getResources(), formXObject);
}
}
for (COSName cosName : xobjectsToReplace) {
resources.put(cosName, formXObject);
}
}
(RemoveImages helper methods)
To apply this approach to a PDDocument simply call the first replaceBitmapImagesResources with that document as parameter.
Beware: I tried to keep the code simple; for production use remember to limit the recursion here to prevent endless recursions as in some PDFs XObjects or patterns call themselves directly or indirectly. Also you may want to inspect page annotations and the resources of template pages.

Error: org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm cannot be cast to org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage

I am trying to extract image from the pdf using pdfbox. I have taken help from this post . It worked for some of the pdfs but for others/most it did not. For example, I am not able to extract the figures in this file
After doing some research I found that PDResources.getImages is deprecated. So, I am using PDResources.getXObjects(). With this, I am not able to extract any image from the PDF and instead get this message at the console:
org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm cannot be cast to org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage
Now I am stuck and unable to find the solution. Please assist if anyone can.
//////UPDATE AS REPLY ON COMMENTS///
I am using pdfbox-1.8.10
Here is the code:
public void getimg ()throws Exception {
try {
String sourceDir = "C:/Users/admin/Desktop/pdfbox/mypdfbox/pdfbox/inputs/Yavaa.pdf";
String destinationDir = "C:/Users/admin/Desktop/pdfbox/mypdfbox/pdfbox/outputs/";
File oldFile = new File(sourceDir);
if (oldFile.exists()){
PDDocument document = PDDocument.load(sourceDir);
List<PDPage> list = document.getDocumentCatalog().getAllPages();
String fileName = oldFile.getName().replace(".pdf", "_cover");
int totalImages = 1;
for (PDPage page : list) {
PDResources pdResources = page.getResources();
Map pageImages = pdResources.getXObjects();
if (pageImages != null){
Iterator imageIter = pageImages.keySet().iterator();
while (imageIter.hasNext()){
String key = (String) imageIter.next();
Object obj = pageImages.get(key);
if(obj instanceof PDXObjectImage) {
PDXObjectImage pdxObjectImage = (PDXObjectImage) obj;
pdxObjectImage.write2file(destinationDir + fileName+ "_" + totalImages);
totalImages++;
}
}
}
}
} else {
System.err.println("File not exist");
}
}
catch (Exception e){
System.err.println(e.getMessage());
}
}
//// PARTIAL SOLUTION/////
I have solved the problem of the error message. I have updated the correct code in the post as well. However, the problem remains the same. I am still not able to extract the images from few of the files. Like the one, I have mentioned in this post. Any solution in that regards.

The first problem with the original code is that XObjects can be PDXObjectImage or PDXObjectForm, so it is needed to check the instance. The second problem is that the code doesn't walk PDXObjectForm recursively, forms can have resources too. The third problem (only in 1.8) is that you used getResources() instead of findResources(), getResources() doesn't check higher levels.
Code for 1.8 can be found here:
https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/ExtractImages.java?view=markup
Code for 2.0 can be found here:
https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/ExtractImages.java?view=markup&sortby=date
(Even these are not always perfect, see this answer)
The fourth problem is that your file doesn't have any XObjects at all. All "graphics" were really vector drawings, these can't be "extracted" like embedded images. All you could do is to convert the PDF pages to images, and then mark and cut what you need.

Converting a docx containing a chart to PDF

I've got a docx4j generated file which contains several tables, titles and, finally, an excel-generated curve chart.
I have tried many approaches in order to convert this file to PDF, but did not get to any successful result.
Docx4j with xsl-fo did not work, most of the things included in the docx file are not yet implemented and show up in red text as "not implemented".
JODConverter did not work either, I got a resulting PDF in which everything was pretty good (just little formatting/styling issues) BUT the graph did not show up.
Finally, the closest approach was using Apache POI: The resulting PDF was identical to my docx file, but still no chart showing up.
I already know Aspose would solve this pretty easily, but I am looking for an open source, free solution.
The code I am using with Apache POI is as follows:
public static void convert(String inputPath, String outputPath)
throws XWPFConverterException, IOException {
PdfConverter converter = new PdfConverter();
converter.convert(new XWPFDocument(new FileInputStream(new File(
inputPath))), new FileOutputStream(new File(outputPath)),
PdfOptions.create());
}
I do not know what to do to get the chart inside the PDF, could anybody tell me how to proceed?
Thanks in advance.

I don't know if this helps you but you could use "jacob" (I don't know if its possible with apache poi or docx4j)
With this solution you open "Word" yourself and export it as pdf.
!Word needs to be installed on the computer!
Heres the download-page: http://sourceforge.net/projects/jacob-project/
try {
if (System.getProperty("os.arch").contains("64")) {
System.load(DLL_64BIT_PATH);
} else {
System.load(DLL_32BIT_PATH);
}
} catch (UnsatisfiedLinkError e) {
//TODO
} catch (IOException e) {
//TODO
}
ActiveXComponent oleComponent = new ActiveXComponent("Word.Application");
oleComponent.setProperty("Visible", false);
Variant var = Dispatch.get(oleComponent, "Documents");
Dispatch document = var.getDispatch();
Dispatch activeDoc = Dispatch.call(document, "Open", fileName).toDispatch();
// https://msdn.microsoft.com/EN-US/library/office/ff845579.aspx
Dispatch.call(activeDoc, "ExportAsFixedFormat", new Object[] { "path to pdfFile.pdf", new Integer(17), false, 0 });
Object args[] = { new Integer(0) };//private static final int DO_NOT_SAVE_CHANGES = 0;
Dispatch.call(activeDoc, "Close", args);
Dispatch.call(oleComponent, "Quit");

How to create image from PDF using PDFBox in JAVA

I want to create an image from first page of PDF . I am using PDFBox . After researching in web , I have found the following snippet of code :
public class ExtractImages
{
public static void main(String[] args)
{
ExtractImages obj = new ExtractImages();
try
{
obj.read_pdf();
}
catch (IOException ex)
{
System.out.println("" + ex);
}
}
void read_pdf() throws IOException
{
PDDocument document = null;
try
{
document = PDDocument.load("H:\\ct1_answer.pdf");
}
catch (IOException ex)
{
System.out.println("" + ex);
}
List<PDPage>pages = document.getDocumentCatalog().getAllPages();
Iterator iter = pages.iterator();
int i =1;
String name = null;
while (iter.hasNext())
{
PDPage page = (PDPage) iter.next();
PDResources resources = page.getResources();
Map pageImages = resources.getImages();
if (pageImages != null)
{
Iterator imageIter = pageImages.keySet().iterator();
while (imageIter.hasNext()) {
String key = (String) imageIter.next();
PDXObjectImage image = (PDXObjectImage) pageImages.get(key);
image.write2file("H:\\image" + i);
i ++;
}
}
}
}
}
In the above code there is no error . But the output of this code is nothing . I have expected that the above code will produce a series of image which will be saved in H drive . But there is no image in that code produced from this code . Why ?

Without trying to be rude, here is what the code you posted does inside its main working loop:
PDPage page = (PDPage) iter.next();
PDResources resources = page.getResources();
Map pageImages = resources.getImages();
It's getting each page from the PDF file, getting the resources from the page, and extracting the embedded images. It then writes those to disk.
If you are to be a competent software developer you need to be able to research and read documentation. With Java, that means Javadocs. Googling PDPage (or explicitly going to the apache site) turns up the Javadoc for PDPage.
On that page you find two versions of the method convertToImage() for converting the PDPage to an image. Problem solved.
Except ...
Unfortunately, they return a java.awt.image.BufferedImage which based on other questions you have asked is a problem because it is not supported on the Android platform which is what you're working on.
In short, you can't use Apache's PDFBox on Android to do what you're trying to do.
Searching on StackOverflow you find this same question posed several times in different forms, which will lead you to this: https://stackoverflow.com/questions/4665957/pdf-parsing-library-for-android/4766335#4766335 with the following answer that would be of interest to you: https://stackoverflow.com/a/4779852/302916
Unfortunately even the one that the aforementioned answer says will work ... is not very user friendly; there's no "How to" or docs that I can find. It's also labeled as "alpha". This is probably not something for the feint hearted as it's going to require reading and understanding their code to even start using it.

I copied your above code and added following libs to my buildpath in eclipse. It is working.
Apache PDFBox 1.7.1 libs
Commons Logging 1.1.1 libs

Relative paths in Flying Saucer XHTML?

I am using Flying Saucer to render some PDF documents from strings to XHTML. My code is something like:
iTextRenderer.setDocument(documentGenerator.generate(xhtmlDocumentAsString));
iTextRenderer.layout();
iTextRenderer.createPDF(outputStream);
What I'm trying to understand is, when using this method, where are relative paths in the XHTML resolved from? For example, for images or stylesheets. I am able to use this method to successfully generate a text-based document, but I need to understand how to reference my images and CSS.

The setDocument() method takes two parameters: document and url.
The url parameter indicates the base url used to prepend to relative paths that appear in the xhtml, such as in img tags.
Suppose you have:
<img src="images/img1.jpg">
Now suppose the folder "images" is located at:
C:/physical/route/to/app/images/
You may use setDocument() as:
renderer.setDocument(xhtmlDoc, "file:///C:/physical/route/to/app/");
Notice the trailing slash, it won't work without it.
This is the way it worked for me. I assume you could use other types of urls such as "http://...".

This week I worked on this, and I give you what worked fine for me.
In real life, your XHTML document points to multiple resources (images, css, etc.) with relative paths.
You also have to explain to Flying Saucer where to find them. They can be in your classpath, or in your file system. (If they are on the network, you can just set the base url, so this won't help)
So you have to extend the ITextUserAgent like this:
private static class ResourceLoaderUserAgent extends ITextUserAgent {
public ResourceLoaderUserAgent(ITextOutputDevice outputDevice) {
super(outputDevice);
}
protected InputStream resolveAndOpenStream(String uri) {
InputStream is = super.resolveAndOpenStream(uri);
String fileName = "";
try {
String[] split = uri.split("/");
fileName = split[split.length - 1];
} catch (Exception e) {
return null;
}
if (is == null) {
// Resource is on the classpath
try{
is = ResourceLoaderUserAgent.class.getResourceAsStream("/etc/images/" + fileName);
} catch (Exception e) {
}
if (is == null) {
// Resource is in the file system
try {
is = new FileInputStream(new File("C:\\images\\" + fileName));
} catch (Exception e) {
}
}
return is;
}
}
And you use it like this:
// Output stream containing the result
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ITextRenderer renderer = new ITextRenderer();
ResourceLoaderUserAgent callback = new ResourceLoaderUserAgent(renderer.getOutputDevice());
callback.setSharedContext(renderer.getSharedContext());
renderer.getSharedContext().setUserAgentCallback(callback);
renderer.setDocumentFromString(htmlSourceAsString);
renderer.layout();
renderer.createPDF(baos);
renderer.finishPDF();
Cheers.

The best solution for me was:
renderer.setDocumentFromString(htmlContent, new ClassPathResource("/META-INF/pdfTemplates/").getURL().toExternalForm());
Then all the provided styles and images in html (like
<img class="logo" src="images/logo.png" />
<link rel="stylesheet" type="text/css" media="all" href="css/style.css"></link>
) were rendered as expected.

AtilaUy's answer is spot-on for the default way things work in Flying Saucer.
The more general answer is that it asks the UserAgentContext. It will call setBaseURL() on the UserAgentContext when the document is set in. Then it will call resolveURL() to resolve relative URLs and ultimately resolveAndOpenStream() when it wants to read the actual resource data.
Well, this answer is probably way too late for you to make use of it anyway, but I needed an answer like this when I set out, and setting a custom user agent context is the solution I ended up using.

You can either have file paths, which should be absolute, or http:// urls. Relative paths can work but aren't portable because it depends on what directory you ran your program from

I think a easier approach would be:
DomNodeList<DomElement> images = result.getElementsByTagName("img");
for (DomElement e : images) {
e.setAttribute("src", result.getFullyQualifiedUrl(e.getAttribute("src")).toString());
}

Another way to resolve paths is to override UserAgentCallback#resolveURI, which offers a more dynamic behavior than a fixed URL (as in AtilaUy's answer, which looks quite valid for most cases).
This is how I make an XHTMLPane use dynamically-generated stylesheets:
public static UserAgentCallback interceptCssResourceLoading(
final UserAgentCallback defaultAgentCallback,
final Map< URI, CSSResource > cssResources
) {
return new UserAgentCallback() {
#Override
public CSSResource getCSSResource( final String uriAsString ) {
final URI uri = uriQuiet( uriAsString ) ; // Just rethrow unchecked exception.
final CSSResource cssResource = cssResources.get( uri ) ;
if( cssResource == null ) {
return defaultAgentCallback.getCSSResource( uriAsString ) ;
} else {
return cssResource ;
}
}
#Override
public String resolveURI( final String uriAsString ) {
final URI uri = uriQuiet( uriAsString ) ;
if( cssResources.containsKey( uri ) ) {
return uriAsString ;
} else {
return defaultAgentCallback.resolveURI( uriAsString ) ;
}
}
// Delegate all other methods to defaultUserAgentCallback.
} ;
}
Then I use it like that:
final UserAgentCallback defaultAgentCallback =
xhtmlPanel.getSharedContext().getUserAgentCallback() ;
xhtmlPanel.getSharedContext().setUserAgentCallback(
interceptCssResourceLoading( defaultAgentCallback, cssResources ) ) ;
xhtmlPanel.setDocumentFromString( xhtml, null, new XhtmlNamespaceHandler() ) ;

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.