extracting names of the images embedded inside a pdf - java

I have a pdf document containing several images.
I want to retrieve names of these images.
How to achieve this using either iText or pdfbox?
I know that ExtractImages extracts images from PDF. I feel that this will somewhere have the functionality to fetch name of the image. However, I don't know the usage of ExtractImages.
The actual problem to fetch names of PDF is to use it to compress these images to reduce the size of the pdf. Is my approach correct?

What you can get with pdfbox is the key of the image and its suffix (type). You can also save that image.
String prefix = new File(pdfFilename).getName();
prefix = prefix.substring(0, prefix.indexOf(".pdf"));
PDDocument document = null;
try
{
document = PDDocument.loadNonSeq(new(pdfFilename), null); // use non-seq parser is better
List<PDPage> pages = document.getDocumentCatalog().getAllPages();
System.out.println(pdfFilename + ": Total pages: " + pages.size());
int p = 0;
for (PDPage page : pages)
{
++p;
PDResources resources = page.getResources();
Map<String, PDXObjectImage> imageResources = resources.getImages();
for (String key : imageResources.keySet())
{
PDXObjectImage objectImage = imageResources.get(key);
System.out.printf("image key '%s': %d x %d, type %s%n", key, objectImage.getHeight(), objectImage.getWidth(), objectImage.getSuffix());
// write that image
String fname = String.format("%s-%04d-%s", prefix, p, key);
objectImage.write2file(fname);
}
}
}
// put catch here
document.close();
However this won't help you unless you are sure that all these images were converted directly to PDF, i.e. without rotation, translation or scaling. If you need this, then you might want to have a look at the PrintImageLocations.java example in the PDFBOX src download.

Related

How to Extract Diagonal watermark from pdf using PDFBOX and Extract Text by maintaining alignment

How can I extract diagonal watermark text from PDF using PDFBox ?
After referring to ExtractText's rotationMagic option, I am now extracting vertical and horizontal watermarks but not diagonal. This is my code so far.
class AngleCollector extends PDFTextStripper {
private final Set<Integer> angles = new TreeSet<>();
AngleCollector() throws IOException {}
Set<Integer> getAngles() {
return angles;
}
#Override
protected void processTextPosition(TextPosition text) {
int angle = ExtractText.getAngle(text);
angle = (angle + 360) % 360;
angles.add(angle);
}
}
class FilteredTextStripper extends PDFTextStripper {
FilteredTextStripper() throws IOException {
}
#Override
protected void processTextPosition(TextPosition text) {
int angle = ExtractText.getAngle(text);
if (angle == 0) {
super.processTextPosition(text);
}
}
}
final class ExtractText {
static int getAngle(TextPosition text) {
//The Matrix containing the starting text position
Matrix m = text.getTextMatrix().clone();
m.concatenate(text.getFont().getFontMatrix());
return (int) Math.round(Math.toDegrees(Math.atan2(m.getShearY(), m.getScaleY())));
}
private List<String> getAnnots(PDPage page) throws IOException {
List<String> returnList = new ArrayList<>();
for (PDAnnotation pdAnnot : page.getAnnotations()) {
if(pdAnnot.getContents() != null && !pdAnnot.getContents().isEmpty()) {
returnList.add(pdAnnot.getContents());
}
}
return returnList;
}
public void extractPages(int startPage, int endPage, PDFTextStripper stripper, PDDocument document, Writer output) {
for (int p = startPage; p <= endPage; ++p) {
stripper.setStartPage(p);
stripper.setEndPage(p);
try {
PDPage page = document.getPage(p - 1);
for (var annot : getAnnots(page)) {
output.write(annot);
}
int rotation = page.getRotation();
page.setRotation(0);
var angleCollector = new AngleCollector();
angleCollector.setStartPage(p);
angleCollector.setEndPage(p);
angleCollector.writeText(document, output);
for (int angle : angleCollector.getAngles()) {
// prepend a transformation
try (var cs = new PDPageContentStream(document, page,
PDPageContentStream.AppendMode.PREPEND, false)) {
cs.transform(Matrix.getRotateInstance(-Math.toRadians(angle), 0, 0));
}
stripper.writeText(document, output);
// remove prepended transformation
((COSArray) page.getCOSObject().getItem(COSName.CONTENTS)).remove(0);
}
page.setRotation(rotation);
} catch (IOException ex) {
System.err.println("Failed to process page " + p + ex);
}
}
}
}
public class pdfTest {
private pdfTest() {
}
public static void main(String[] args) throws IOException {
var pdfFile = "test-resources/pdf/pdf_sample_2.pdf";
Writer output = new OutputStreamWriter(System.out, StandardCharsets.UTF_8);
var etObj = new ExtractText();
var rawDoc = PDDocument.load(new File(pdfFile));
PDFTextStripper stripper = new FilteredTextStripper();
if(rawDoc.getDocumentCatalog().getAcroForm() != null) {
rawDoc.getDocumentCatalog().getAcroForm().flatten();
}
etObj.extractPages(1, rawDoc.getNumberOfPages(), stripper, rawDoc, output);
output.flush();
}
}
Edit 1:
I am also unable to detect form (Acro, XFA) field contents via TextExtractor code with correct Alignment. How can I do that ?
I am attaching the sample PDFs for references.
Sample PDF 1
Sample PDF 2
I require following things using PDFBox
Diagonal text detection. (including watermarks).
Form fields extraction by maintaining Proper alignment.
In your "question" you actually ask multiple distinct questions. I'll look into each of them. The answers will be less specific than you'd probably wish because your questions are based on assumptions that are not all true.
"How can I extract diagonal watermark text from PDF using PDFBox ?"
First of all, PDF text extraction works by inspecting the instructions in content streams of a page and contained XObjects, finding text drawing instructions therein, taking the coordinates and orientations and the string parameters thereof, mapping the strings to Unicode, and arranging the many individual Unicode strings by their coordinates and orientations in a single content string.
In case of PDFBox the PDFTextStripper as-is does this with a limited support for orientation processing, but it can be extended to filter the text pieces by orientation for better orientation support as shown in the ExtractText example with rotation magic activated.
double_watermark.pdf
In case of your double_watermark.pdf example PDF, though, the diagonal text "Top Secret" is not created using text drawing instructions but instead path construction and painting instructions, as Tilman already remarked. (Actually the paths here all are sequences of very short lines, no curves are used, which you can see using a high zoom factor.) Thus, PDF text extraction cannot extract this text.
To answer your question
How can I extract diagonal watermark text from PDF using PDFBox ?
in this context, therefore: You can not.
(You can of course use PDFBox as a PDF processing framework based on which you also collect paths and try to match them to characters, but would be a greater project by itself. Or you can use PDFBox to draw the pages as bitmaps and apply OCR to those bitmaps.)
"I am also unable to detect form (Acro, XFA) field contents via TextExtractor code with correct Alignment. How can I do that ?"
Form data in AcroForm or XFA form definitions are not part of the page content streams or the XObject content streams referenced from therein. Thus, they are not immediately subject to text extraction.
AcroForm forms
AcroForm form fields are abstract PDF data objects which may or may not have associated content streams for display. To include them into the content streams text extraction operates on, you can first flatten the form. As you mentioned in your own answer, you also have to activate sorting to extract the field contents in context.
Beware, PDF renderers do have certain freedoms when creating the visualization of a form field. Thus, text extraction order may be slightly different from what you expect.
XFA forms
XFA form definitions are a cuckoo's egg in PDF. They are XML streams which are not related to regular PDF objects; furthermore, XFA in PDFs has been deprecated a number of years ago. Thus, most PDF libraries don't support XFA forms.
PDFBox only allows to extract or replace the XFA XML stream. Thus, there is no immediate support for XFA form contents during text extraction.
Form fields extraction by maintaining Proper alignment.
This is solved by
setSortByPosition

PDF Box: extract images from PDF document and keeping the image orientation

I found in this forum some pretty good solutions how to extract images from PDF documents by using PDFBox. I used the following code snipped, that I found in one post:
PDPageTree list = document.getPages();
for (PDPage page : list) {
PDResources pdResources = page.getResources();
for (COSName c : pdResources.getXObjectNames()) {
try {
PDXObject imageObj = pdResources.getXObject(c);
if (imageObj instanceof PDImageXObject) {
// same image to list
BufferedImage bImage = ((PDImageXObject) imageObj).getImage();
acceptedImages.add(bImage);
}
} catch (MissingImageReaderException mex) {
log.warn("Missing Image Reader for format: ", mex);
}
}
}
But I got the problem, that in rare cases, some extracted images have a wrong orientation. When I look at the PDF document, the pictures are displayed correctl. But some of the extracted images are rotated by n x 90° degrees. I guess the rotation information is stored somewhere in the PDF?
Run the PrintImageLocations.java example from the source code download (or here) and analyse the CTM ("current transformation matrix") to extract the rotation with Math.round(Math.toDegrees(Math.atan2(ctmNew.getShearY(), ctmNew.getScaleY()))).

iTextPDF(Java/Android): How to convert Image to bitmap [duplicate]

Please let me know what method can be used to convert pdf to image in iText7.
In Itexsharp, there was an option to convert pdf file to images. Following is the link. PDF to Image Using iTextSharp
http://www.c-sharpcorner.com/UploadFile/a0927b/create-pdf-document-and-convert-to-image-programmatically/
Below is the sample code created using the following refernce link.
itext7 pdf to image
this is not working as expected. It is not converting the pdf to image. It is creating a 1kb blank image.
string fileName = System.IO.Path.GetFileNameWithoutExtension(inputFilePath);
var pdfReader = new PdfReader(inputFilePath);
var pdfDoc = new iText.Kernel.Pdf.PdfDocument(pdfReader);
int pagesLength = pdfDoc.GetNumberOfPages()+1;
for (int i = 1; i < pagesLength; i++)
{
if (!File.Exists(System.IO.Path.Combine(imageFileDir, fileName + "_" +
`enter code here`(startIndex + i) + ".png")) && i < pagesLength)
{
PdfPage pdfPages = pdfDoc.GetPage(i);
PdfWriter writer = new PdfWriter(System.IO.Path.Combine(imageFileDir, fileName + "_" + (startIndex + i) + ".png"), new WriterProperties().SetFullCompressionMode(true));
PdfDocument pdf = new PdfDocument(writer);
PdfFormXObject pageCopy = pdfPages.CopyAsFormXObject(pdf);
iText.Layout.Element.Image image = new iText.Layout.Element.Image(pageCopy);
}
}
Quoting Bruno:
iText does not convert PDFs to raster images (such as .jpg, .png,...). You are misinterpreting the examples that create an Image instance based on an existing page. Those examples create an XObject that can be reused in a new PDF as if it were a vector image; they don't convert a PDF page to a raster image.
What you can use for this (which is what we at iText internally use for testing) is GhostScript. It takes a pdf as input and converts it to a series of images (one image per page).

How to identitfy tables, images and list in pdf file using java?

I am new to java programming.... I need extract each and every tables and images as per source, i try to extract text using by pdfbox but am get text only and text properties. How to identify tables, images, list, etc.. using java program.
Is it possible to identify in pdf files...?
I using module is PDFbox, if any idea further process...,
Below code can be used to extract images:
List pages = document.getDocumentCatalog().getAllPages();
Iterator iter = pages.iterator();
while( iter.hasNext() )
{
PDPage page = (PDPage)iter.next();
PDResources resources = page.getResources();
Map images = resources.getImages();
if( images != null )
{
Iterator imageIter = images.keySet().iterator();
while( imageIter.hasNext() )
{
String key = (String)imageIter.next();
PDXObjectImage image = (PDXObjectImage)images.get( key );
String name = getUniqueFileName( key, image.getSuffix() );
System.out.println( "Writing image:" + name );
image.write2file( name );
}
}
}
You can refer here for similar issue.

Can't get img tags

I have a question. When I try to get images from a web page by using Jsoup in Java.
Here is the code:
String link = "http://truyentranhtuan.com/detective-conan/856/doc-truyen/";
Document docs = Jsoup.connect(link).timeout(60000).get();
Elements comics = docs.select("#hienthitruyen img");
System.out.println(comics.size());
for (Element comic : comics) {
int i = 0;
System.out.println(comic);
String linkImage = comic.attr("src");
if (!"".equals(linkImage)) {
URL url = new URL(linkImage);
BufferedImage image = ImageIO.read(url);
ImageIO.write(image, "jpg", new File(i + ".jpg"));
i++;
}
}
The problem is I can't get any img tag in this web page. The size of Elements always be zero.
But when I view source in this web page the img tag always be there.
If you look at the real source, not the DOM structure (for example, save the HTML page and open it in Notepad), you will see that there are no img tags there. They are all populated dynamically by the means of Javascript.
Now the problem is that Jsoup is not meant to execute Javascript, therefore you can only parse the original DOM structure, before it is modified (filled with images) by Javascript.
To do what you want, you can use HTMLUnit which can execute most of the Javascript.

Categories

Resources