Missing descendant font dictionary - java

Starting with an apology if I am breaking some process here.
I am aware that there is a question with exactly the same problem
PDFBox returns missing descendant font dictionary but the thread ends abruptly because the author wasn't able to give the details, unfortunately. Also due to low reputation wasn't able to continue that thread.
And it very well states the problem of missing composite font. I wanted to know if there is some way to fix it since the PDF opens fine in our browser but we are not able to deal with it programmatically.
Tried it on some variety of versions including the latest 2.0.21
I will share the PDF
Looking forward to you
#mkl, #Tilman Hausherr
Please let me know if you need more details.
My code trying to convert the PDF to images
PDDocument document = PDDocument.load(new File(pdfPath+"//"+fileName));
PDFRenderer pdfRenderer = new PDFRenderer(document);
for (int page = 0; page < document.getNumberOfPages(); ++page) {
BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
}

Having downloaded the file when the link was available, I analyzed it.
Adobe Acrobat Reader shows error messages when opening the document. iText RUPS reports cross reference issues. First impression, therefore: That PDF is broken.
Nonetheless I looked closer but the result of that closer look was not better...
According to the cross references and trailers the PDF should contain 58 indirect objects with IDs 1 through 58. It turned out, though, that objects 32 through 49 are missing albeit most of them are referenced, some as descendant fonts. This explains why PDFBox reports missing descendant fonts.
Furthermore, objects 50 through 57 and 1 through 10 are not at the locations they should be according to the cross reference tables. Also the second cross reference table is at a wrong location and the file length is incorrect according to the linearization dictionary.
The way this is broken leaves the impression that the file is a mix of two slightly different versions of the same file; as if a download of the file was attempted but interrupted at some point and continued from a new version of the file; or as if some PDF processor somehow changed the file and tried to save the changed copy into the same file but was interrupted.
Summarized: The PDF is utterly broken.
If a PDF processor tries to repair it, you cannot be sure information from which version of the file you'll get, different PDF processors (if they can somehow make sense of it) are likely to interpret the file differently.
If possible, you should reject the file and request a non-broken version of it.
If not possible, copy the data from a viewer that appears to best repair it, manually check the copy for accuracy, and then check the whole extracted data for plausibility in regard to other information you have on the accounts in question. A little prayer won't hurt either.

Related

PDFtk throws a Java Exception when attempting to use 'fill_form' function

I have a PHP application that fills out a form from a database call. At present I am putting this together using PDFtk, I am able to run a number of PDFtk commands with no issue and I am currently working out the desired command at command line.
My call is currently this:
pdftk /var/www/html/CSR/template/job_card.pdf fill_form /var/www/html/CSR/template/wwwwu7mMH.fdf output /var/www/html/CSR/template/filled4.pdf
This exact call run multiple times generates this error sometimes:
Unhandled Java Exception in create_output():
java.lang.ClassCastException: pdftk.com.lowagie.text.pdf.PdfNull cannot be cast to pdftk.com.lowagie.text.pdf.PdfDictionary
at pdftk.com.lowagie.text.pdf.FdfReader.readFields(pdftk)
at pdftk.com.lowagie.text.pdf.FdfReader.readPdf(pdftk)
at pdftk.com.lowagie.text.pdf.PdfReader.<init>(pdftk)
at pdftk.com.lowagie.text.pdf.PdfReader.<init>(pdftk)
at pdftk.com.lowagie.text.pdf.FdfReader.<init>(pdftk)
and this error sometimes:
Unhandled Java Exception in create_output():
Unhandled Java Exception in main():
java.lang.NullPointerException
at gnu.gcj.runtime.NameFinder.lookup(libgcj.so.10)
at java.lang.Throwable.getStackTrace(libgcj.so.10)
at java.lang.Throwable.stackTraceString(libgcj.so.10)
at java.lang.Throwable.printStackTrace(libgcj.so.10)
at java.lang.Throwable.printStackTrace(libgcj.so.10)
The error message alternates but the command never works and the form is never filled. As I say though, the PDFtk works with other commands, I have been able to generate encrypted PDFs and run the fixed commands succesfully.
My question is what is causing this error and how do I fix it?
I see my name in the StackTrace. That's not a coincidence: PdfTk is based on a mighty old version of iText. iText is a Java PDF library that was originally written by me, but used by a third party to create PdfTk.
The error tells you that iText is parsing a PDF that has either an error, or an unexpected feature.
A PDF consists of PDF objects such as PDF string objects, PDF number objects, PDF array objects, PDF dictionary objects, PDF stream objects, and so on. iText is able to retrieve these objects and to reuse them to create a new PDF. In your case, a new PDF with some form fields that are filled out is created based on the objects of the original PDF.
It is impossible to answer your question without seeing the PDF that causes the problem, but let's say that your PDF contains an /AcroForm entry with a /Fields array. In this fields array, there is a reference to a field dictionary. Suppose that one of the field dictionaries in your PDF isn't a dictionary, but a PDF null object. The form shows up perfectly in Adobe Reader, but internally, there is a flaw that prevents proper processing of the form.
In that case, iText will loop over the entries in the fields array, and one of those entries won't return a field dictionary, but a PdfNull object. In that case, you'll get a ClassCastException, because you can't cast PdfNull to PdfDictionary.
This being said:
If I see my name in your stack trace, this triggers an alarm, because it means that you're using an iText version that predates iText 5. Such a version should no longer be used. You should use a more recent version of iText. There is a high chance that a more recent version of iText gives you either a better error message, or tolerates (and maybe even fixes) the error in the PDF.
If you find a PdfTk version that uses a more recent version of iText, that would surprise me, because as far as I know, PdfTk isn't available under the AGPL, nor is PDF Labs (the owner of PdfTk) a customer of iText Software.
If you want to keep on using PdfTk, you shouldn't expect an answer as long as you don't share the PDF document that you're trying to fill.
One thing you could try: open the form in Adobe Acrobat. Save the form in Adobe Acrobat. There is a chance that the saved form no longer has the problem. Adobe Acrobat is very tolerant towards errors in PDFs. It tries to fix as many as it can. Then when you save the form, the error is gone.
As it turns out the issue was not as Bruno Lowagie suggested regarding the consistency of the PDF.
I had run out of ideas and just thought I would try generating the FDF a different way. By running the command:
pdftk /full/path/to/template.pdf generate_fdf output /full/path/to/output.fdf
And then inspecting the resulting file, I was able to get a more accurate FDF and then when I ran the fill_form command:
pdftk /full/path/to/template.pdf fill_form /full/path/to/output.fdf output /full/path/to/output.pdf
I got a proper response and everything worked. So the problem I was getting was in fact caused by the FDF being malformed in some way.
My final solution was this if anyone is interested. It takes a template PDF with fields, generates an FDF to fill it, creates a new PDF by adding the data from FDF with the template PDF, redirects the browser to the PDFs location.
Big thanks to Bruno Lowagie for helping understand the system better and rule out a few things.
It looks like PDF TK was not able to process stings that had char ( and ) I replaced them with \) and \( to escape them, and it worked well.
I had the same problem. In my case changing the string encoding solved it.
Previously I was encoding it in utf-8 then I changed it to utf_16_be.
Root cause is that form fields data are stored in fdf form where values are stored inside brackets so if your data has brackets then it throws error.
Font issue:
https://stackoverflow.com/a/44442957/2150220
The link above is a better solution than just changing your font.
I was receiving the same error, however, none of the above solutions worked for me.
As I was testing:
pdftk a.pdf fill_form a.fdf output b.pdf
I was able to generate a pdf if my original pdf had not been altered, IE: all of the acrobat settings where default.
Only when I changed the font to "Arial" for a fill_form element did I receive the error.
I changed the font, and it was working again.
I just wanted to follow up for anyone else who encounters this. In our case, the problem was in the contents of the FDF file. Specifically, we were automating the process of filling in PDFs and user-generated content sometimes includes an unclosed ( [ or { character. These cause this same exception. If this is happening to you, verify that the contents of your FDF file do not contain "unclosed" parens, brackets, or curly braces.

pdfbox and itext not able to extract image

I am trying to extract images from a pdf . pdfbox is able to extract images from most of the pdfs but their are some pdfs whose images are not getting extracted by pdfbox.
For extracting the image I am using following code :
Not able to extract images from PDFA1-a format document
You can download a sample pdf with this problem from this link :
http://myslams.com/test/2.pdf
is their something wrong the code maybe something I forgot to handle or is their something wrong with the pdf all together ?
As the OP has not yet replaced his stale sample PDF link by a working one, the question can only be answered in general terms.
The code referenced by the OP (with the corrections in the answer of #Tilman) iterates the immediate image resources of each page and stores the respective files.
Thus, the code may store too many images because image resources of a page may not necessarily be used on the page in question:
On one hand it may not be used at all in the file or at least nowhere visible, merely a left-over from some prior PDF editing session.
On the other hand multiple pages may have a shared resources dictionary containing all images on all these pages; in this case the OP's code exports many duplicates.
And the code may store too few images because there are other places where images may be put:
Image data may be directly included in the page content stream, aka inline images.
Constructs with their own resources (form xobjects, patterns, Type 3 font glyphs) used from the page content may provide their own image resources or inline immages.
Annotations, e.g. AcroForm form fields, may have also their own appearance streams with their own resources and, therefore, may provide their own image resources or inline immages, too.
XFA forms may provide their own images, too.
As soon as the OP provides a representative sample file, the type of images he misses can be determined and a specific solution may be outlined.
EDIT
According to a comment by the OP, his image extraction problems have been resolved by making use of the information from this answer to his question "pdfbox and itext extracting image with incorrect dpi". Especially pointing to example code appropriate for the PDFBox version 1.8.8 used by the OP sems to have been important.
Thus, any kind of wrong output may also occur as a result of software orchestration issues.

Android pdf writer APW high resolution images cause out of memory expection

I am using android pdf writer
(apw) in my app successfully for the most part. However, when I try to include a high resolution in a pdf document, I get an out of memory exception.
Immediately before creating the pdf file, the library must have the content itself converted into a string (representing the raw pdf content), which is then converted to a byte array. The byte array is written to the file in a file output stream (see example via website).
The out of memory expection occurs when the string is generated because representing all the pixels of a bitmap image in string format is very memory intensive. I could downsample the image using the android API, however, it is essential that the images are put into the pdf at high resolution (~2000 x 1000).
There are many scanner type apps which seem to be able to able generate pdf high res images, so there must be a way around it, surely. Granted, they may be using other libraries, but surely there is someone who has figured out a way around it with this library given that it is free and therefore popular(?)
I emailed the developer, but there was no response.
Potential solutions (I can think of) include:
Modifying the library to load a string representing e.g. the first 10% of the PDF, and writing to file chunk by chunk. (edit)
Modifying the library to output a stringoutput stream, or other output stream to a temp file (or final file) as the actual pdf content is being written in the pdfwriter object.
However as a relative java noob (and even more of a pdf specification noob), I am unable to understand the library well enough to do this myself.
Has anyone come across this problem and found a way around it? Anyone willing to hazard a suggestion, or take a look at the library itself even to see if there is a fix of some sort.
Thanks for your help.
nme32
Edit:
Logcat says heap size is in the range on 40 to 60mb before the crash. I understand (do correct me if not) that Android limits the available memory to apps depending on what else is running, though it is in the 50mb ballpark, depending on device.
When loading the image, I think APW essentially converts it to bitmap, that is represents the image pixel by pixel then puts it into string format, meaning it doesn't matter which image format you use, it may as well be bitmap.
First of all the resolution you are mentioning is very high. And i have already mentioned the issues related to Images in Android in this Answer
Secondly in case first solution doesn't work for you i would suggest Disk based LruCache.And store the chunks into that disk based cache and then retrieve and use it. Here is an Example of that.
Hope this would help. If it doesn't comment on this answer and i will add more solutions.

Editable .pdf fields disappear (but visible on field focus) after save with evince

First off, let me thank the SO community for helping me so many times in the past; you guys are an amazing resource!
At my job I work on a web application that uses PDF templates created in Scribus and the iText Java library to populate the templates with data from our database. Sometimes, a user-supplied field is required and not touched by iText. When the .pdf is downloaded, a field is edited, and a copy is saved with Evince the resulting file will not display the edited text upon reopen. However, upon focus of an edited field it will show the saved text. Unfocus, text disappears. Cut the text, paste back into field; it stays visible - until you save and reopen the document. After save and reopen the original problem manifests. I've found many extremely similar posts regarding this problem, but none of the solutions to which seem to work for me.
Also, the problem is quirky. If I open the Scribus template (the .pdf file untouched by iText) with Evince, then edit fields and save, they will show up properly upon reopen. Once the library touches the template, however, the problem occurs. Similarly, I can reproduce the issue with PDF files I have found while searching for the cause of this problem; like this one:
http://www.quask.com/samples/pdfforms/pcpurchase.pdf
This leads me to believe that the misbehaving files may be corrupted in some way, and that iText may be the cause of my problem, but iText isn't the only avenue in which I can reproduce the issue so I'm not sure what to think. I can't seem to find a working solution among the many I've seen. Is anybody familiar enough with this issue to be able to tell me where I can get to the bottom of this or offer some insight regarding the tools I'm using? Chances are good that if you search for the issue using google I've seen it..
I'm using Ubuntu 12.04 (precise), Evince 3.4.0, iText 2.1.5, and can try to fill you in on any other relevant details upon request. I'm apprehensive to post any code as I'm not sure it is Kosher, and it works fine for constructing forms except with this particular issue; let alone the fact that I can reproduce the problem without the use of our webapp.
This is my first post here, and I am a novice programmer (still in school!) so please do let me know if I have violated any conventions or could improve my future inquiries in any way.
Thanks for any help you can offer!
An inspection of the files supplied by jbowman in the comments to his question --- with special regard to the password field (which is one of the fields entually filled in by evince) --- shows:
Template.pdf
is the original form which was generated by Scribus PDF Library 1.4.1.svn;
contains an AcroForm with 9 fields and the flag NeedAppearances set to true;
has the password field (named passwordField) which contains an empty value and a normal appearance stream painting a rectangle with an empty text.
after_itext.pdf
is the original form edited by iText 2.1.5, unfortunately not in append mode which would have made analysis easier;
contains an Acroform with 8 fields (the member number field has been filled in and flattened) without a NeedAppearances flag;
has the password field (named passwordField:u4woYY1FK9) value and appearances left untouched.
after_itext_edited.pdf
is the form formerly edited by iText now edited by some other software (evince) in append mode;
contains an Acroform with 8 fields without a NeedAppearances flag; the only changes have been made to the fields passwordField:u4woYY1FK9 and memberPrefix:u4woYY1FK9:
has the password field (named passwordField:u4woYY1FK9) with a new associated value asdf but has left its appearances untouched;
has the member prefix field (named memberPrefix:u4woYY1FK9) with a new associated value asdf but has left its appearances untouched.
Thus, the observed behavior that the value by default is not shown, is to be expected:
The final Acroform has no NeedAppearances flag. This flag is defined in the specification ISO 32000-1:2008 as:
A flag specifying whether to construct appearance streams and
appearance dictionaries for all widget annotations in the document
(see 12.7.3.3, “Variable Text”). Default value: false.
Thus, your PDF document in its final form says: No appearances for widgets (e.g. AcroForm field visualizations) need to be generated, take the appearances from the document.
The appearance of the password field from the document is the original one, the rectangle with the empty text.
So you see this empty rectangle.
When you click into the field, the PDF viewer prepares for editing its contents and therefore displays the value as it sees fit.
If editing PDF files with evince is intended to have visible results, evince upon changing the value of the fields must also add updated appearance streams or make sure the AcroForm NeddAppearances flag is set. Therefore, this is where evince failed.
I have accepted mkl's answer as it hits the nail on the head regarding why the fields do not display properly, and contains much more information than I can provide regarding the issue. However, the suggested fix in the answer's comments did not work because the documents are generated (in this particular case) using iText 2.1.5's PdfCopyFields, which does not respect (strips) the original document's NeedAppearances flag, and calling setNeedAppearances(true) for AcroForm did not solve the issue because of this.
Hacking the createAcroForms() method in PdfCopyFieldsImp to include the line
form.put(PdfName.NEEDAPPEARANCES, PdfBoolean.PDFTRUE);
is what ultimately seems to have solved the issue for me. With this addition, evince properly displays changes to fields after saving and reopening the document.

flying saucer (xhtmlrenderer) requests image 4 times

in my xhtml i have the following:
...
<img src="myImage.jpg" />
...
and I render like so:
ITextRenderer renderer = new ITextRenderer();
renderer.setDocument(XMLResource.load(in).getDocument(), url);
renderer.layout();
renderer.createPDF(out);
the resulting PDF is as expected, however I notice that the image (which is included only once in the xhtml and renders only once) is being requested 4 times.
now, besides the obvious problem of the extra data download this wouldn't really be a problem for most people.
however, i need to implement an 'expire on use' image cache for dynamic images and this is becoming a real headache...
why does flying saucer need to make 4 requests for the image if it only renders it once?
This is fixed in the latest version of FlyingSaucer. I've confirmed myself with 9.0.3, although I believe several minor versions prior to that also contain the fix.
I've just gone through the code and there is no solution here (without a re-write of itext & flying saucer).
the first time the stream is open is just to test whether it can be opened, the data is not read.
the second time is itext reading the header to determine the file type, only the first 4 bytes are read.
the third time is itext determining the dimensions of the image it seems - i'm not sure but i don't think much other than headers is read here either.
the last read is to render the image.
so the download impact is not great, 4 url connections - yes, but the entire stream is only transferred once
and my 'expire on use' cache will have to be 'expire on 4th use' instead.

Categories

Resources