I am trying to extract images from a pdf . pdfbox is able to extract images from most of the pdfs but their are some pdfs whose images are not getting extracted by pdfbox.
For extracting the image I am using following code :
Not able to extract images from PDFA1-a format document
You can download a sample pdf with this problem from this link :
http://myslams.com/test/2.pdf
is their something wrong the code maybe something I forgot to handle or is their something wrong with the pdf all together ?
As the OP has not yet replaced his stale sample PDF link by a working one, the question can only be answered in general terms.
The code referenced by the OP (with the corrections in the answer of #Tilman) iterates the immediate image resources of each page and stores the respective files.
Thus, the code may store too many images because image resources of a page may not necessarily be used on the page in question:
On one hand it may not be used at all in the file or at least nowhere visible, merely a left-over from some prior PDF editing session.
On the other hand multiple pages may have a shared resources dictionary containing all images on all these pages; in this case the OP's code exports many duplicates.
And the code may store too few images because there are other places where images may be put:
Image data may be directly included in the page content stream, aka inline images.
Constructs with their own resources (form xobjects, patterns, Type 3 font glyphs) used from the page content may provide their own image resources or inline immages.
Annotations, e.g. AcroForm form fields, may have also their own appearance streams with their own resources and, therefore, may provide their own image resources or inline immages, too.
XFA forms may provide their own images, too.
As soon as the OP provides a representative sample file, the type of images he misses can be determined and a specific solution may be outlined.
EDIT
According to a comment by the OP, his image extraction problems have been resolved by making use of the information from this answer to his question "pdfbox and itext extracting image with incorrect dpi". Especially pointing to example code appropriate for the PDFBox version 1.8.8 used by the OP sems to have been important.
Thus, any kind of wrong output may also occur as a result of software orchestration issues.
Related
Starting with an apology if I am breaking some process here.
I am aware that there is a question with exactly the same problem
PDFBox returns missing descendant font dictionary but the thread ends abruptly because the author wasn't able to give the details, unfortunately. Also due to low reputation wasn't able to continue that thread.
And it very well states the problem of missing composite font. I wanted to know if there is some way to fix it since the PDF opens fine in our browser but we are not able to deal with it programmatically.
Tried it on some variety of versions including the latest 2.0.21
I will share the PDF
Looking forward to you
#mkl, #Tilman Hausherr
Please let me know if you need more details.
My code trying to convert the PDF to images
PDDocument document = PDDocument.load(new File(pdfPath+"//"+fileName));
PDFRenderer pdfRenderer = new PDFRenderer(document);
for (int page = 0; page < document.getNumberOfPages(); ++page) {
BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
}
Having downloaded the file when the link was available, I analyzed it.
Adobe Acrobat Reader shows error messages when opening the document. iText RUPS reports cross reference issues. First impression, therefore: That PDF is broken.
Nonetheless I looked closer but the result of that closer look was not better...
According to the cross references and trailers the PDF should contain 58 indirect objects with IDs 1 through 58. It turned out, though, that objects 32 through 49 are missing albeit most of them are referenced, some as descendant fonts. This explains why PDFBox reports missing descendant fonts.
Furthermore, objects 50 through 57 and 1 through 10 are not at the locations they should be according to the cross reference tables. Also the second cross reference table is at a wrong location and the file length is incorrect according to the linearization dictionary.
The way this is broken leaves the impression that the file is a mix of two slightly different versions of the same file; as if a download of the file was attempted but interrupted at some point and continued from a new version of the file; or as if some PDF processor somehow changed the file and tried to save the changed copy into the same file but was interrupted.
Summarized: The PDF is utterly broken.
If a PDF processor tries to repair it, you cannot be sure information from which version of the file you'll get, different PDF processors (if they can somehow make sense of it) are likely to interpret the file differently.
If possible, you should reject the file and request a non-broken version of it.
If not possible, copy the data from a viewer that appears to best repair it, manually check the copy for accuracy, and then check the whole extracted data for plausibility in regard to other information you have on the accounts in question. A little prayer won't hurt either.
I have tried with PDFTextStripperByArea and PDPageContentStream classes to extract the number values from my pdf file. They work fine!
But my requirement is to use PDFTable or PDFTableExtractor class to read the pdf contents. Can you tell me what is the maven dependency and jar file I need to use to access the above said classes?
Also mention the required methods to get the values from a particular position.
I have another doubt. Can we extract the table formatted data from PDF file as it is? I meant the data with rows and columns with table lines. If a page contains some text and a table, can we just read only the table headers and the rows? I have uploaded my page in GitHub. Click here! From that image, I only need the values of Gross premium, GST and Total Payable. Please let me know whether it's possible
First, don't use classes from packages com.lowagie
That code is old, obsolete and no longer supported. Furthermore, this code belonged to the very early version of iText.
Afterwards a thorough investigation was done into the intellectual property rights of all the code (since iText has had a lot of contributors). When you use the old code, you may (unknowingly) be using code for which you do not have the copyright.
Second, if you just want to solve the problem of extracting numbers and tables from a PDF document, have a look at pdf2Data. It's an iText add-on that makes things a lot easier.
It gives you a nice UI, where you can build templates for data extraction. Then you can call a single method to match an existing (XML) template against an input PDF document, and you'd get a datastructure that contains all the information about the match.
http://pdf2data.online/
PDFTable
I have found two PDFTable classes:
com.lowagie.text.pdf.PdfPTable
com.itextpdf.text.pdf.PdfPTable
Documentation of both of this class (this may help you to learn the methods you need):
https://www.coderanch.com/how-to/javadoc/itext-2.1.7/com/lowagie/text/pdf/PdfPTable.html
http://itextsupport.com/apidocs/itext5/5.5.9/com/itextpdf/text/pdf/PdfPTable.html
If you want to use this classes, you can copy the dependency to your pom.file from:
https://mvnrepository.com/artifact/com.itextpdf/itextpdf
https://mvnrepository.com/artifact/com.lowagie/itext - As mentioned in this link, This artifact was moved to com.itextpdf
Examples of how to use this classes you may found here:
https://developers.itextpdf.com/examples/itext-action-second-edition/chapter-4
https://www.programcreek.com/java-api-examples/index.php?api=com.lowagie.text.pdf.PdfPTable
I have a PDF that contains placeholders like <%DATE_OF_BIRTH%>, i want to be able to read in the PDF and change the PDF placeholder values to text using iText.
So read in PDF, use maybe a replaceString() method and change the placeholders then generate the new PDF.
Is this possible?
Thanks.
The use of placeholders in PDF is very, very limited. Theoretically it can be done and there are some instances where it would be feasible to do what you say, but because PDF doesn't know about structure very much, it's hard:
simply extracting words is difficult so recognising your placeholders in the PDF would already be difficult in many cases.
Replacing text in PDF is a nightmare because PDF files generally don't have a concept of words, lines and paragraphs. Hence no nice reflow of text for example.
Like I said, it could theoretically work under special conditions, but it's not a very good solution.
What would be a better approach depends on your use case:
1) For some forms it may be acceptable to have the complete form as a background image or PDF file and then generate your text as an overlay to that background (filling in the blanks so to speak) As pointed out by Bruno and mlk in comments, in this case you can also look into using form fields which can be dynamically filled.
2) For other forms it may be better to have your template in a structured format such as XML or HTML, do the text replacement in that format and then convert it into PDF.
First off, let me thank the SO community for helping me so many times in the past; you guys are an amazing resource!
At my job I work on a web application that uses PDF templates created in Scribus and the iText Java library to populate the templates with data from our database. Sometimes, a user-supplied field is required and not touched by iText. When the .pdf is downloaded, a field is edited, and a copy is saved with Evince the resulting file will not display the edited text upon reopen. However, upon focus of an edited field it will show the saved text. Unfocus, text disappears. Cut the text, paste back into field; it stays visible - until you save and reopen the document. After save and reopen the original problem manifests. I've found many extremely similar posts regarding this problem, but none of the solutions to which seem to work for me.
Also, the problem is quirky. If I open the Scribus template (the .pdf file untouched by iText) with Evince, then edit fields and save, they will show up properly upon reopen. Once the library touches the template, however, the problem occurs. Similarly, I can reproduce the issue with PDF files I have found while searching for the cause of this problem; like this one:
http://www.quask.com/samples/pdfforms/pcpurchase.pdf
This leads me to believe that the misbehaving files may be corrupted in some way, and that iText may be the cause of my problem, but iText isn't the only avenue in which I can reproduce the issue so I'm not sure what to think. I can't seem to find a working solution among the many I've seen. Is anybody familiar enough with this issue to be able to tell me where I can get to the bottom of this or offer some insight regarding the tools I'm using? Chances are good that if you search for the issue using google I've seen it..
I'm using Ubuntu 12.04 (precise), Evince 3.4.0, iText 2.1.5, and can try to fill you in on any other relevant details upon request. I'm apprehensive to post any code as I'm not sure it is Kosher, and it works fine for constructing forms except with this particular issue; let alone the fact that I can reproduce the problem without the use of our webapp.
This is my first post here, and I am a novice programmer (still in school!) so please do let me know if I have violated any conventions or could improve my future inquiries in any way.
Thanks for any help you can offer!
An inspection of the files supplied by jbowman in the comments to his question --- with special regard to the password field (which is one of the fields entually filled in by evince) --- shows:
Template.pdf
is the original form which was generated by Scribus PDF Library 1.4.1.svn;
contains an AcroForm with 9 fields and the flag NeedAppearances set to true;
has the password field (named passwordField) which contains an empty value and a normal appearance stream painting a rectangle with an empty text.
after_itext.pdf
is the original form edited by iText 2.1.5, unfortunately not in append mode which would have made analysis easier;
contains an Acroform with 8 fields (the member number field has been filled in and flattened) without a NeedAppearances flag;
has the password field (named passwordField:u4woYY1FK9) value and appearances left untouched.
after_itext_edited.pdf
is the form formerly edited by iText now edited by some other software (evince) in append mode;
contains an Acroform with 8 fields without a NeedAppearances flag; the only changes have been made to the fields passwordField:u4woYY1FK9 and memberPrefix:u4woYY1FK9:
has the password field (named passwordField:u4woYY1FK9) with a new associated value asdf but has left its appearances untouched;
has the member prefix field (named memberPrefix:u4woYY1FK9) with a new associated value asdf but has left its appearances untouched.
Thus, the observed behavior that the value by default is not shown, is to be expected:
The final Acroform has no NeedAppearances flag. This flag is defined in the specification ISO 32000-1:2008 as:
A flag specifying whether to construct appearance streams and
appearance dictionaries for all widget annotations in the document
(see 12.7.3.3, “Variable Text”). Default value: false.
Thus, your PDF document in its final form says: No appearances for widgets (e.g. AcroForm field visualizations) need to be generated, take the appearances from the document.
The appearance of the password field from the document is the original one, the rectangle with the empty text.
So you see this empty rectangle.
When you click into the field, the PDF viewer prepares for editing its contents and therefore displays the value as it sees fit.
If editing PDF files with evince is intended to have visible results, evince upon changing the value of the fields must also add updated appearance streams or make sure the AcroForm NeddAppearances flag is set. Therefore, this is where evince failed.
I have accepted mkl's answer as it hits the nail on the head regarding why the fields do not display properly, and contains much more information than I can provide regarding the issue. However, the suggested fix in the answer's comments did not work because the documents are generated (in this particular case) using iText 2.1.5's PdfCopyFields, which does not respect (strips) the original document's NeedAppearances flag, and calling setNeedAppearances(true) for AcroForm did not solve the issue because of this.
Hacking the createAcroForms() method in PdfCopyFieldsImp to include the line
form.put(PdfName.NEEDAPPEARANCES, PdfBoolean.PDFTRUE);
is what ultimately seems to have solved the issue for me. With this addition, evince properly displays changes to fields after saving and reopening the document.
I have a use case in which I need to render an unformatted text in the format of a given web page programmatically in Java. i.e. The text should automatically be formatted like the web page with styles, paragraphs, bullet points etc.
As I see first I will have to analyze the piece of unformatted text to find out the candidates for paragraphs, bullet points, headings etc. I intend to use Lucene analyzers/tokenizers for this task. Are there any alternatives?
The second problem is to convert the formatted web page into some kind of template (e.g. velocity template) with place holders for various entities like titles, bullet points etc.
Is there any text analysis/templating library in Java that can help me do this? Preferably open source.
Are there any other suggestions for doing this sort of task in a better way in Java?
Thanks for your help.
There are a lot of hard parts to what you're doing.
The user input
If you don't ask your user to provide any context, you're never going to guess the structure of the text. At least, you should ask them to provide a title, and a series of paragraph in your GUI.
Ideally, you could ask them to follow a well-know markup language (Markdown, Textile, etc...) and use the open source parser to extract the structure.
The external page
If any page is used, the only things you can rely on are the "structural markup". So assuming you know the title of the page should be "Hello World", and there is a "h1" element somewhere in the page, you can maybe assume that this is where the header could go.
But if the pages is a div tag-soup, and only CSS is used to differentiate the rendering of the header as opposed to the bulk of the text, you're going to have to guess how the styling is done : that's plain impossible if you don't know how the page is made.
I don't think Lucene would help fo this (as far as I know Lucene is made to create an index of the words used in a bulk of text ; I don't think it can help you guessing which part of the text is meant to be a title, a subtitle, etc...)
Generating templates from external page
Assuming you have "guessed" right, you could generate the content by
copy pasting the page
replacing the parts to change with tags of your template language of choice
storing the template somewhere the templating system can access it
configure your template / view system (viewResolver for velocity) to use the right template for the rigth person
That would of course pose terrible legal questions, since your templates would incorporate works by the original website author (most probably copyrighted material)
A more realistic solution
I would suggest you constrain your problem to :
using input that has some structure information available (use a GUI to enter it, use a markup language, whatever)
using templates that you provide, know the structure of (and can reuse very easily)
Note that none of those points are related to the template system.
Otherwise, I'm afraid you're heading to an unreasonnable amount of work...