How to create image from PDF using PDFBox in JAVA

How to create image from PDF using PDFBox in JAVA - java

I want to create an image from first page of PDF . I am using PDFBox . After researching in web , I have found the following snippet of code :
public class ExtractImages
{
public static void main(String[] args)
{
ExtractImages obj = new ExtractImages();
try
{
obj.read_pdf();
}
catch (IOException ex)
{
System.out.println("" + ex);
}
}
void read_pdf() throws IOException
{
PDDocument document = null;
try
{
document = PDDocument.load("H:\\ct1_answer.pdf");
}
catch (IOException ex)
{
System.out.println("" + ex);
}
List<PDPage>pages = document.getDocumentCatalog().getAllPages();
Iterator iter = pages.iterator();
int i =1;
String name = null;
while (iter.hasNext())
{
PDPage page = (PDPage) iter.next();
PDResources resources = page.getResources();
Map pageImages = resources.getImages();
if (pageImages != null)
{
Iterator imageIter = pageImages.keySet().iterator();
while (imageIter.hasNext()) {
String key = (String) imageIter.next();
PDXObjectImage image = (PDXObjectImage) pageImages.get(key);
image.write2file("H:\\image" + i);
i ++;
}
}
}
}
}
In the above code there is no error . But the output of this code is nothing . I have expected that the above code will produce a series of image which will be saved in H drive . But there is no image in that code produced from this code . Why ?

Without trying to be rude, here is what the code you posted does inside its main working loop:
PDPage page = (PDPage) iter.next();
PDResources resources = page.getResources();
Map pageImages = resources.getImages();
It's getting each page from the PDF file, getting the resources from the page, and extracting the embedded images. It then writes those to disk.
If you are to be a competent software developer you need to be able to research and read documentation. With Java, that means Javadocs. Googling PDPage (or explicitly going to the apache site) turns up the Javadoc for PDPage.
On that page you find two versions of the method convertToImage() for converting the PDPage to an image. Problem solved.
Except ...
Unfortunately, they return a java.awt.image.BufferedImage which based on other questions you have asked is a problem because it is not supported on the Android platform which is what you're working on.
In short, you can't use Apache's PDFBox on Android to do what you're trying to do.
Searching on StackOverflow you find this same question posed several times in different forms, which will lead you to this: https://stackoverflow.com/questions/4665957/pdf-parsing-library-for-android/4766335#4766335 with the following answer that would be of interest to you: https://stackoverflow.com/a/4779852/302916
Unfortunately even the one that the aforementioned answer says will work ... is not very user friendly; there's no "How to" or docs that I can find. It's also labeled as "alpha". This is probably not something for the feint hearted as it's going to require reading and understanding their code to even start using it.

I copied your above code and added following libs to my buildpath in eclipse. It is working.
Apache PDFBox 1.7.1 libs
Commons Logging 1.1.1 libs

Related

How to append a PDF file to an existing one with iText?

In an application I am trying to append multiple PDF files to a single already existing file. Using iText I found this
Using iText I found this tutorial, which, in my case doesn't seem to work.
Here are some ways I've tried to make it work.
String path = "path/to/destination.pdf";
PdfCopy mergedFile = new PdfCopy(pdf, new FileOutputStream(path));
PdfReader reader;
for(String toMergePath : toMergePaths){
reader = new PdfReader(toMergePath);
mergedFile.addDocument(reader);
mergedFile.freeReader(reader);
reader.close();
}
mergedFile.close();
When I try to add the document logcat tells me that the document is not open.
But, pdf (the original document) is already open by other methods, and closed only after this one. And, mergedFile is exactly like in the tutorial, which, I believe, must be right.
Did anyone experience the same problem? Otherwise, do anyone know a better method to do what I want to do?
I've seen other solutions copying the bite from one page and append them to a new file but I'm affraid this will "compile" the annotations which I need.
Thank you for your help,
Cordially,
Matthieu Meunier

I hope this code will help you.
public static void mergePdfs(){
try {
String[] files = { "D:\\1.pdf" ,"D:\\2.pdf" ,"D:\\3.pdf" ,"D:\\4.pdf"};
Document pDFCombineUsingJava = new Document();
PdfCopy copy = new PdfCopy(pDFCombineUsingJava , new FileOutputStream("D:\\CombinedFile.pdf"));
pDFCombineUsingJava.open();
PdfReader ReadInputPDF;
int number_of_pages;
for (int i = 0; i < files.length; i++) {
ReadInputPDF = new PdfReader(files[i]);
copy.addDocument(ReadInputPDF);
copy.freeReader(ReadInputPDF);
}
pDFCombineUsingJava.close();
}
catch (Exception i)
{
System.out.println(i);
}
}

How to remove a specific image from a PDF with PDFBox

I need to remove a specific image from PDF file according its metadata. Sadly. all examples I can find in Internet are using discarded methods.
I write it something like this:
try (PDDocument doc = PDDocument.load(new ByteArrayInputStream(pdf))) {
doc.getPages().forEach(page ->
{
PDResources resources = page.getResources();
List<COSName> itemsToRemove = new ArrayList<>();
resources.getXObjectNames().forEach(propertyName -> {
if(!resources.isImageXObject(propertyName)) {
return;
}
PDXObject pdxObject = resources.getXObject(propertyName);
PDImageXObject pdImageXObject = (PDImageXObject)pdxObject;
PDMetadata metadata = pdImageXObject.getMetadata();
if(checkMetadata(metadata)){
// What should I use here?
page.getCOSObject().removeItem(propertyName);
}
});
// Should I use page.setResources(resources); ?
});
doc.save(baos);
} catch (Exception e) {
//Code here
}

It works same way like it does in example RemoveAllText.java, just with different tag.
Use code from this example, just use "Do" instead of "Tj".
Of course, if you need to load metadata, etc, you should enumerate and check images threw page resources (like in my example)

Jsoup code that works for Eclipse but not Android Studio (httpurlconnectionimpl)

I am working on a small app for myself and I just don't understand why my code is working in Eclipse but not on my phone using Android Studio.
public static ArrayList<Link> getLinksToChoose(String searchUrl) {
ArrayList<Link> linkList = new ArrayList<Link>();
try {
System.out.println(searchUrl);
Document doc = Jsoup.connect(searchUrl).timeout(3000).userAgent("Chrome").get();
Elements links = doc.select("tr");
links.remove(0);
Elements newLinks = new Elements();
for(Element link : links) {
Link newLink = new Link(getURL(link),getName(link),getLang(link));
linkList.add(newLink);
}
} catch(IOException e){
e.printStackTrace();
}
return linkList;
}
The problem is I can't even get the Document. I always get an httpurlconnectionimpl in the line where I try to get the html doc. I have read a bit about Jsoup in Android. Some people suggest using AsyncTask but it doesn't seem like that would solve my problem.

The loading of the content must happen outside the main thread, e.g. in an AsyncTask.

Error: org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm cannot be cast to org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage

I am trying to extract image from the pdf using pdfbox. I have taken help from this post . It worked for some of the pdfs but for others/most it did not. For example, I am not able to extract the figures in this file
After doing some research I found that PDResources.getImages is deprecated. So, I am using PDResources.getXObjects(). With this, I am not able to extract any image from the PDF and instead get this message at the console:
org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm cannot be cast to org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage
Now I am stuck and unable to find the solution. Please assist if anyone can.
//////UPDATE AS REPLY ON COMMENTS///
I am using pdfbox-1.8.10
Here is the code:
public void getimg ()throws Exception {
try {
String sourceDir = "C:/Users/admin/Desktop/pdfbox/mypdfbox/pdfbox/inputs/Yavaa.pdf";
String destinationDir = "C:/Users/admin/Desktop/pdfbox/mypdfbox/pdfbox/outputs/";
File oldFile = new File(sourceDir);
if (oldFile.exists()){
PDDocument document = PDDocument.load(sourceDir);
List<PDPage> list = document.getDocumentCatalog().getAllPages();
String fileName = oldFile.getName().replace(".pdf", "_cover");
int totalImages = 1;
for (PDPage page : list) {
PDResources pdResources = page.getResources();
Map pageImages = pdResources.getXObjects();
if (pageImages != null){
Iterator imageIter = pageImages.keySet().iterator();
while (imageIter.hasNext()){
String key = (String) imageIter.next();
Object obj = pageImages.get(key);
if(obj instanceof PDXObjectImage) {
PDXObjectImage pdxObjectImage = (PDXObjectImage) obj;
pdxObjectImage.write2file(destinationDir + fileName+ "_" + totalImages);
totalImages++;
}
}
}
}
} else {
System.err.println("File not exist");
}
}
catch (Exception e){
System.err.println(e.getMessage());
}
}
//// PARTIAL SOLUTION/////
I have solved the problem of the error message. I have updated the correct code in the post as well. However, the problem remains the same. I am still not able to extract the images from few of the files. Like the one, I have mentioned in this post. Any solution in that regards.

The first problem with the original code is that XObjects can be PDXObjectImage or PDXObjectForm, so it is needed to check the instance. The second problem is that the code doesn't walk PDXObjectForm recursively, forms can have resources too. The third problem (only in 1.8) is that you used getResources() instead of findResources(), getResources() doesn't check higher levels.
Code for 1.8 can be found here:
https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/ExtractImages.java?view=markup
Code for 2.0 can be found here:
https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/ExtractImages.java?view=markup&sortby=date
(Even these are not always perfect, see this answer)
The fourth problem is that your file doesn't have any XObjects at all. All "graphics" were really vector drawings, these can't be "extracted" like embedded images. All you could do is to convert the PDF pages to images, and then mark and cut what you need.

Converting a docx containing a chart to PDF

I've got a docx4j generated file which contains several tables, titles and, finally, an excel-generated curve chart.
I have tried many approaches in order to convert this file to PDF, but did not get to any successful result.
Docx4j with xsl-fo did not work, most of the things included in the docx file are not yet implemented and show up in red text as "not implemented".
JODConverter did not work either, I got a resulting PDF in which everything was pretty good (just little formatting/styling issues) BUT the graph did not show up.
Finally, the closest approach was using Apache POI: The resulting PDF was identical to my docx file, but still no chart showing up.
I already know Aspose would solve this pretty easily, but I am looking for an open source, free solution.
The code I am using with Apache POI is as follows:
public static void convert(String inputPath, String outputPath)
throws XWPFConverterException, IOException {
PdfConverter converter = new PdfConverter();
converter.convert(new XWPFDocument(new FileInputStream(new File(
inputPath))), new FileOutputStream(new File(outputPath)),
PdfOptions.create());
}
I do not know what to do to get the chart inside the PDF, could anybody tell me how to proceed?
Thanks in advance.

I don't know if this helps you but you could use "jacob" (I don't know if its possible with apache poi or docx4j)
With this solution you open "Word" yourself and export it as pdf.
!Word needs to be installed on the computer!
Heres the download-page: http://sourceforge.net/projects/jacob-project/
try {
if (System.getProperty("os.arch").contains("64")) {
System.load(DLL_64BIT_PATH);
} else {
System.load(DLL_32BIT_PATH);
}
} catch (UnsatisfiedLinkError e) {
//TODO
} catch (IOException e) {
//TODO
}
ActiveXComponent oleComponent = new ActiveXComponent("Word.Application");
oleComponent.setProperty("Visible", false);
Variant var = Dispatch.get(oleComponent, "Documents");
Dispatch document = var.getDispatch();
Dispatch activeDoc = Dispatch.call(document, "Open", fileName).toDispatch();
// https://msdn.microsoft.com/EN-US/library/office/ff845579.aspx
Dispatch.call(activeDoc, "ExportAsFixedFormat", new Object[] { "path to pdfFile.pdf", new Integer(17), false, 0 });
Object args[] = { new Integer(0) };//private static final int DO_NOT_SAVE_CHANGES = 0;
Dispatch.call(activeDoc, "Close", args);
Dispatch.call(oleComponent, "Quit");

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.