Error while retrieving images from pdf using Itext

Error while retrieving images from pdf using Itext - java

I have an existing PDF from which I want to retrieve images
NOTE:
In the Documentation, this is the RESULT variable
public static final String RESULT = "results/part4/chapter15/Img%s.%s";
I am not getting why this image is needed?I just want to extract the images from my PDF file
So Now when I use MyImageRenderListener listener = new MyImageRenderListener(RESULT);
I am getting the error:
results\part4\chapter15\Img16.jpg (The system
cannot find the path specified)
This is the code that I am having.
package part4.chapter15;
import java.io.IOException;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfReaderContentParser;
/**
* Extracts images from a PDF file.
*/
public class ExtractImages {
/** The new document to which we've added a border rectangle. */
public static final String RESOURCE = "resources/pdfs/samplefile.pdf";
public static final String RESULT = "results/part4/chapter15/Img%s.%s";
/**
* Parses a PDF and extracts all the images.
* #param src the source PDF
* #param dest the resulting PDF
*/
public void extractImages(String filename)
throws IOException, DocumentException {
PdfReader reader = new PdfReader(filename);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
MyImageRenderListener listener = new MyImageRenderListener(RESULT);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
parser.processContent(i, listener);
}
reader.close();
}
/**
* Main method.
* #param args no arguments needed
* #throws DocumentException
* #throws IOException
*/
public static void main(String[] args) throws IOException, DocumentException {
new ExtractImages().extractImages(RESOURCE);
}
}

You have two questions and the answer to the first question is the key to the answer of the second.
Question 1:
You refer to:
public static final String RESULT = "results/part4/chapter15/Img%s.%s";
And you ask: why is this image needed?
That question is wrong, because Img%s.%s is not a filename of an image, it's a pattern of the filename of an image. While parsing, iText will detect images in the PDF. These images are stored in numbered objects (e.g. object 16) and these images can be exported in different formats (e.g. jpg, png,...).
Suppose that an image is stored in object 16 and that this image is a jpg, then the pattern will resolve to Img16.jpg.
Question 2:
Why do I get an error:
results\part4\chapter15\Img16.jpg (The system cannot find the path specified)
In your PDF, there's a jpg stored in object 16. You are asking iText to store that image using this path: results\part4\chapter15\Img16.jpg (as explained in my answer to Question 1). However: you working directory doesn't have the subdirectories results\part4\chapter15\, hence an IOException (or a FileNotFoundException?) is thrown.
What is the general problem?
You have copy/pasted the ExtractImages example I wrote for my book "iText in Action - Second Edition", but:
You didn't read that book, so you have no idea what that code is supposed to do.
You aren't telling the readers on StackOverflow that this example depends on the MyImageRenderer class, which is where all the magic happens.
How can you solve your problem?
Option 1:
Change RESULT like this:
public static final String RESULT = "Img%s.%s";
Now the images will be stored in your working directory.
Option 2:
Adapt the MyImageRenderer class, more specifically this method:
public void renderImage(ImageRenderInfo renderInfo) {
try {
String filename;
FileOutputStream os;
PdfImageObject image = renderInfo.getImage();
if (image == null) return;
filename = String.format(path,
renderInfo.getRef().getNumber(), image.getFileType());
os = new FileOutputStream(filename);
os.write(image.getImageAsBytes());
os.flush();
os.close();
} catch (IOException e) {
System.out.println(e.getMessage());
}
}
iText calls this class whenever an image is encountered. It passed an ImageRenderInfo to this method that contains plenty of information about that image.
In this implementation, we store the image bytes as a file. This is how we create the path to that file:
String.format(path,
renderInfo.getRef().getNumber(), image.getFileType())
As you can see, the pattern stored in RESULT is used in such a way that the first occurrence of %s is replaced with a number and the second occurrence with a file extension.
You could easily adapt this method so that it stores the images as byte[] in a List if that is what you want.

Related

Extracting answers to a flattened PDF form with iText 7

We have a few forms created from Adobe LiveCycle where users fill the dynamic forms and submits the document to our office where we stamp it with our signature and flatten it (at least most of the time - I've seen a few documents in our system that haven't been flattened yet but that can be a separate question, I'll focus on the flattened documents here because that's most of what we have).
I'm trying to use iText 7 to parse/extract the user's answers to our forms for migrating to an electronic solution that will happen a few months from now. I was able to make the example work in Java but I don't understand the process.
/*
This file is part of the iText (R) project.
Copyright (c) 1998-2020 iText Group NV
Authors: iText Software.
For more information, please contact iText Software at this address:
sales#itextpdf.com
*/
/**
* Example written by Bruno Lowagie in answer to:
* http://stackoverflow.com/questions/24506830/can-we-use-text-extraction-strategy-after-applying-location-extraction-strategy
*/
package ca.umanitoba.ad.research;
import com.itextpdf.kernel.font.PdfFont;
import com.itextpdf.kernel.geom.Rectangle;
import com.itextpdf.kernel.pdf.PdfDocument;
import com.itextpdf.kernel.pdf.PdfReader;
import com.itextpdf.kernel.pdf.canvas.parser.EventType;
import com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor;
import com.itextpdf.kernel.pdf.canvas.parser.data.IEventData;
import com.itextpdf.kernel.pdf.canvas.parser.data.TextRenderInfo;
import com.itextpdf.kernel.pdf.canvas.parser.filter.TextRegionEventFilter;
import com.itextpdf.kernel.pdf.canvas.parser.listener.FilteredEventListener;
import com.itextpdf.kernel.pdf.canvas.parser.listener.LocationTextExtractionStrategy;
import java.io.File;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.FileOutputStream;
import java.io.Writer;
import java.io.BufferedWriter;
public class Main {
public static final String DEST = "./target/txt/parse_custom.txt";
public static final String SRC = "./src/main/resources/pdfs/nameddestinations.pdf";
public static void main(String[] args) throws IOException {
File file = new File(DEST);
file.getParentFile().mkdirs();
new Main().manipulatePdf(DEST);
}
protected void manipulatePdf(String dest) throws IOException {
PdfDocument pdfDoc = new PdfDocument(new PdfReader(SRC));
Rectangle rect = new Rectangle(36, 750, 523, 56);
CustomFontFilter fontFilter = new CustomFontFilter(rect);
FilteredEventListener listener = new FilteredEventListener();
// Create a text extraction renderer
LocationTextExtractionStrategy extractionStrategy = listener
.attachEventListener(new LocationTextExtractionStrategy(), fontFilter);
// Note: If you want to re-use the PdfCanvasProcessor, you must call PdfCanvasProcessor.reset()
PdfCanvasProcessor parser = new PdfCanvasProcessor(listener);
parser.processPageContent(pdfDoc.getFirstPage());
// Get the resultant text after applying the custom filter
String actualText = extractionStrategy.getResultantText();
pdfDoc.close();
// See the resultant text in the console
System.out.println(actualText);
try (Writer writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(dest)))) {
writer.write(actualText);
}
}
/*
* The custom filter filters only the text of which the font name ends with Bold or Oblique.
*/
protected class CustomFontFilter extends TextRegionEventFilter {
public CustomFontFilter(Rectangle filterRect) {
super(filterRect);
}
#Override
public boolean accept(IEventData data, EventType type) {
if (type.equals(EventType.RENDER_TEXT)) {
TextRenderInfo renderInfo = (TextRenderInfo) data;
PdfFont font = renderInfo.getFont();
if (null != font) {
String fontName = font.getFontProgram().getFontNames().getFontName();
return fontName.endsWith("Bold") || fontName.endsWith("Oblique");
}
}
return false;
}
}
}
Why is there a need to specify a Rectangle? Our forms are dynamic so users can add more fields as needed and we also accept paragraphs on some of the questions so the length will always vary so it's unlikely that the coordinates of the texts will be the same.
How can I change the flow so that I can perhaps just search for the question and then get the text right after it (presumably the answer) - I don't really know what the best way to parse a PDF is. If there's no other way except providing a Rectangle, can I programmatically determine the coordinates/dimensions of the rectangles?
From the example it looks like it's filtering the text based on whether it's bolded or italicized which I probably don't need but it looks to be easy enough to fix by modifying/removing the accept() method.

Please take a look at what that example is for: In the JavaDoc comment you can read
/**
* Example written by Bruno Lowagie in answer to:
* http://stackoverflow.com/questions/24506830/can-we-use-text-extraction-strategy-after-applying-location-extraction-strategy
*/
and that stack overflow question starts with
I used the following code to get data in PDF from a particular location. I want to get bold text present in that location
When you wonder, therefore,
Why is there a need to specify a Rectangle?
the answer is: because the example is about finding bold text in a particular location.
You mention your forms were dynamic before flattening and fields don't have fixed positions. Thus, this filter probably is not optimal for your use case.
How can I change the flow so that I can perhaps just search for the question and then get the text right after it
In that case simply don't filter at all but use a plain LocationTextExtractionStrategy to extract text, search for the question text in the extracted text, and use the text thereafter up to the next question text.
Alternatively, if you still have the unflattened dynamic forms, you may consider extracting the xfa xml and extract the filled-in data from that xml.

Reading Binary Picture Data from exiftool?

I'm working on a .opus music library software which converts audio/video files to .opus files and tags them with metadata automatically.
Previous versions of the program have saved the album art as binary data apparently as revealed by exiftool.
The thing is that when I run the command to output data as binary using the -b option, the entire thing is in binary seemingly. I'm not sure how to get the program to parse it. I was kind of expecting an entry like Picture : 11010010101101101011....
The output looks similar to this though:
How can I parse the picture data so I can reconstruct the image for newer versions of the program? (I'm using Java8_171 on Kubuntu 18.04)

It looks like you're trying to open the raw bytes in a text editor, which will of course give you gobble-dee-gook since those raw bytes do not represent characters that can be displayed by any text editor. I can see from your output from exiftool that you are able to know the length of the image in bytes. Providing you know the beginning byte position in the file, this should make your task relatively easy with a little bit of Java code. If you can get the starting position of the image inside your file, you should be able to do something like:
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.*;
public class SaveImage {
public static void main(String[] args) throws IOException {
byte[] imageBytes;
try (RandomAccessFile binaryReader =
new RandomAccessFile("your-file.xxx", "r")) {
int dataLength = 0; // Assign this the byte length shown in your
// post instead of zero
int startPos = 0; // I assume you can find this somehow.
// If it's not at the beginning
// change it accordingly.
imageBytes = new byte[dataLength];
binaryReader.read(imageBytes, startPos, dataLength);
}
try (InputStream in = new ByteArrayInputStream(imageBytes)) {
BufferedImage bImageFromConvert = ImageIO.read(in);
ImageIO.write(bImageFromConvert,
"jpg", // or whatever file format is appropriate
new File("/path/to/your/file.jpg"));
}
}
}

How to convert an image (rgb/gray) inside a pdf to monochrom/bitonal one using itext and java

I'm writing a java programm to swap images inside a pdf. Due to the process of generation they are stored as high dpi, rgb images, but are bitonal/monochrome images. I'm using itext 7.1.1, but also testet the latest dev version (7.1.2 snapshot).
I'm already able to extract the images from pdf and convert them to png or tif using indexed colours or gray (0 & 255 only) in imagemagick (also testet gimp).
I modified some code from itext, to replace the images inside the pdf, which does work for DeviceRGB- and DeviceGray-Images, but not for Bitonal ones:
public static Image readPng(String pImageFolder, int pImageNumber) throws IOException {
String url = "./" + pImageFolder + "/" + pImageNumber + ".png";
File ifile = new File(url);
if (ifile.exists() && ifile.isFile()) {
return new Image(ImageDataFactory.create(url));
} else {
return null;
}
}
public static void replaceStream(PdfStream orig, PdfStream stream) throws IOException {
orig.clear();
orig.setData(stream.getBytes());
for (PdfName name : stream.keySet()) {
orig.put(name, stream.get(name));
}
}
public static void replaceImages(String pFilename, String pImagefolder, String pOutputFilename) throws IOException {
PdfDocument pdfDoc = new PdfDocument(new PdfReader(pFilename), new PdfWriter(pOutputFilename));
for (int i = 0; i < pdfDoc.getNumberOfPages(); i++) {
PdfDictionary page = pdfDoc.getPage(i + 1).getPdfObject();
PdfDictionary resources = page.getAsDictionary(PdfName.Resources);
PdfDictionary xobjects = resources.getAsDictionary(PdfName.XObject);
Iterator<PdfName> iter = xobjects.keySet().iterator();
PdfName imgRef;
PdfStream stream;
Image img;
int number;
while (iter.hasNext()) {
imgRef = iter.next();
number = xobjects.get(imgRef).getIndirectReference().getObjNumber();
stream = xobjects.getAsStream(imgRef);
img = readPng(pImagefolder, number);
if (img != null) {
replaceStream(stream, img.getXObject().getPdfObject());
}
}
}
pdfDoc.close();
}
If i convert the images to tif and use them as replacement, there are dark images (all pixels are black) inside the pdf. If i try to use png-images, they are not shown and pdfimages complaints "Unknown compression method in flate stream".

FYI:
There was an error in my replaceStream: getBytes() deflates a PdfStream. All Stream-Attributes were copied, thus there was a Filter-Information saying FlateDecoding is necessary.
I had to tell getBytes()not to deflate by setting the decoded-Parameter to false: getBytes(false)
public static void replaceStream(PdfStream orig, PdfStream stream) throws IOException {
orig.clear();
orig.setData(stream.getBytes(false));
for (PdfName name : stream.keySet()) {
orig.put(name, stream.get(name));
}
}
Now everything works fine, except:
Bitone-images are not CCITT4, which they should be. (Doesn't matter, because they are converted to JBig2.)
Images are said to have an error by Acrobat, but every other viewer displays just fine: There seems to be an error inside the ColorSpace information. That should be DeviceGray, but is CalGray with some Gamma-Information, but missing WhitePoint. Changing to DeviceGray by hand makes it work. A workaround is to strip gAMA and cHRM.
Both are conversion errors in iText7:
CCITT4: PNGImageHelper line 254 should be RawImageHelper.updateRawImageParameters(png.image, png.width, png.height, components, bpc, png.idat.toByteArray(), null); to trigger conversion.
WhitePoint is correctly read from the file and stored inside the ImageData-Class, but is discarded inside PdfImageXObject -> createPdfStream.

Correct way to distinguish .xls from .doc file?

I searched how to detect that file is .xls and I've found a solution like this (but not deprecated):
POIFSFileSystem:
#Deprecated
#Removal(version="4.0")
public static boolean hasPOIFSHeader(InputStream inp) throws IOException {
return FileMagic.valueOf(inp) == FileMagic.OLE2;
}
But this one returns true for all microsoft word documents for example for .doc
Is there a way to detect .xls document?

Both .doc/.xls documents can are stored in the OLE2 storage format. The org.apache.poi.poifs.filesystem.FileMagic helps you to detect the file storage format only and not sufficient alone to distinguish between .doc/.xls files.
Also it does not appear that there is any direct API available in POI library to determine the document type (excel or document) for given inputstream/file.
Below example my be helpful to determine if given stream is a valid .xls (or .xlsx)file with the caveat that it read the given inputstram and close it.
// slurp content from given input and close it
public static boolean isExcelFile(InputStream in) throws IOException {
try {
// it slurp the input stream
Workbook workbook = org.apache.poi.ss.usermodel.WorkbookFactory.create(in);
workbook.close();
return true;
} catch (java.lang.IllegalArgumentException | org.apache.poi.openxml4j.exceptions.InvalidFormatException e) {
return false;
}
}
You may found more information on excel file format on this link
Update
Solution based on Apache Tika as suggested by gagravarr:
public class TikaBasedFileTypeDetector {
private Tika tika;
private TemporaryResources temporaryResources;
public void init() {
this.tika = new Tika();
this.temporaryResources = new TemporaryResources();
}
// clean up all the temporary resources
public void destroy() throws IOException {
temporaryResources.close();
}
// return content mime type
public String detectType(InputStream in) throws IOException {
TikaInputStream tikaInputStream = TikaInputStream.get(in, temporaryResources);
return tika.detect(tikaInputStream);
}
public boolean isExcelFile(InputStream in) throws IOException{
// see https://stackoverflow.com/a/4212908/1700467 for information on mimetypes
String type = detectType(in);
return type.startsWith("application/vnd.ms-excel") || //for Micorsoft document
type.startsWith("application/vnd.openxmlformats-officedocument.spreadsheetml"); // for OpenOffice xml format
}
}
See this answer on mime types.

You can work with Apache POI's - HSSF module.
That model (library) is written to read and write xls files (and latest for xlsx as well - although these are different languages).
With this code...
InputStream ExcelFileToRead = new FileInputStream("FileNameWithLink.xls");
HSSFWorkbook wb = new HSSFWorkbook(ExcelFileToRead);
HSSFSheet sheet = wb.getSheetAt(0);
...you can detect if it is readable xls file.
Going deeper you can use this code to try reading it etc. Actually that module is really easy to use.
There can be situations that it technically is .xls file, but it may not be readable (there can be various problems with it).
Extra - XSSF is for .xlsx and HSSF is for .xls.
I haven't used other techniques as I always want to be sure that I will be able read that file later.

You can use docx4j. Load the file with OpcPackage.load() and then check the content type.
OpcPackage.load()
* Convenience method to create a WordprocessingMLPackage
* or PresentationMLPackage
* from an inputstream (.docx/.docxm, .ppxtx or Flat OPC .xml).
* It detects the convenient format inspecting two first bytes of stream (magic bytes).
* For office 2007 'x' formats, these two bytes are 'PK' (same as zip file)
load() returns a OpcPackage which is the abstract class that GloxPackage, PresentationMLPackage, SpreadsheetMLPackage, WordprocessingMLPackage are based on. So this should work for word, excel and powerpoint docs.
A basic check
public final String XLSX_FILE = "application/vnd.openxmlformats-officedocument.presentationml.presentation.main+xml";
public final String WORD_FILE = "application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml";
public final String UNKNOWN_FILE = "UNKNOWN";
public boolean isFileXLSX(String fileLocation) {
return getContentTypeFromFile(fileLocation).equals(XLSX_FILE);
}
public String getContentTypeFromFile(String fileLocation) {
try {
return OpcPackage.load(new File(fileLocation)).getContentType();
} catch (Docx4JException e) {
return UNKNOWN_FILE;
}
}
You should see values like
application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml
application/vnd.openxmlformats-officedocument.presentationml.presentation.main+xml

Using PDFBox to get location of line of text

I'm using PDFBox to extract information from a pdf, and the information I'm currently trying to find is related to the x-position of the first character in the line. I can't find anything related to how to get that information though. I know pdfbox has a class called TextPosition, but I can't find out how to get a TextPosition object from the PDDocument either. How do I get the location information of a line of text from a pdf?

In general
To extract text (with or without extra information like positions, colors, etc.) using PDFBox, you instantiate a PDFTextStripper or a class derived from it and use it like this:
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
(There are a number of PDFTextStripper attributes allowing you to restrict the pages text is extracted from.)
In the course of the execution of getText the content streams of the pages in question (and those of form xObjects referenced from those pages) are parsed and text drawing commands are processed.
If you want to change the text extraction behavior, you have to change this text drawing command processing which you most often should do by overriding this method:
/**
* Write a Java string to the output stream. The default implementation will ignore the <code>textPositions</code>
* and just calls {#link #writeString(String)}.
*
* #param text The text to write to the stream.
* #param textPositions The TextPositions belonging to the text.
* #throws IOException If there is an error when writing the text.
*/
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
writeString(text);
}
If you additionally need to know when a new line starts, you may also want to override
/**
* Write the line separator value to the output stream.
* #throws IOException
* If there is a problem writing out the lineseparator to the document.
*/
protected void writeLineSeparator( ) throws IOException
{
output.write(getLineSeparator());
}
writeString can be overridden to channel the text information into separate members (e.g. if you might want a result in a more structured format than a mere String) or it can be overridden to simply add some extra information into the result String.
writeLineSeparator can be overridden to trigger some specific output between lines.
There are more methods which can be overridden but you are less likely to need them in general.
In the case at hand
I'm using PDFBox to extract information from a pdf, and the information I'm currently trying to find is related to the x-position of the first character in the line.
This can be implemented as follows (simply adding the information at the start of each line):
PDFTextStripper stripper = new PDFTextStripper()
{
#Override
protected void startPage(PDPage page) throws IOException
{
startOfLine = true;
super.startPage(page);
}
#Override
protected void writeLineSeparator() throws IOException
{
startOfLine = true;
super.writeLineSeparator();
}
#Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
if (startOfLine)
{
TextPosition firstProsition = textPositions.get(0);
writeString(String.format("[%s]", firstProsition.getXDirAdj()));
startOfLine = false;
}
super.writeString(text, textPositions);
}
boolean startOfLine = true;
};
text = stripper.getText(document);
(ExtractText.java method extractLineStart tested by testExtractLineStartFromSampleFile)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Error while retrieving images from pdf using Itext - java

Related

Extracting answers to a flattened PDF form with iText 7

Reading Binary Picture Data from exiftool?

How to convert an image (rgb/gray) inside a pdf to monochrom/bitonal one using itext and java

Correct way to distinguish .xls from .doc file?

Using PDFBox to get location of line of text

Categories

Resources