flying saucer (xhtmlrenderer) requests image 4 times - java

in my xhtml i have the following:
...
<img src="myImage.jpg" />
...
and I render like so:
ITextRenderer renderer = new ITextRenderer();
renderer.setDocument(XMLResource.load(in).getDocument(), url);
renderer.layout();
renderer.createPDF(out);
the resulting PDF is as expected, however I notice that the image (which is included only once in the xhtml and renders only once) is being requested 4 times.
now, besides the obvious problem of the extra data download this wouldn't really be a problem for most people.
however, i need to implement an 'expire on use' image cache for dynamic images and this is becoming a real headache...
why does flying saucer need to make 4 requests for the image if it only renders it once?

This is fixed in the latest version of FlyingSaucer. I've confirmed myself with 9.0.3, although I believe several minor versions prior to that also contain the fix.

I've just gone through the code and there is no solution here (without a re-write of itext & flying saucer).
the first time the stream is open is just to test whether it can be opened, the data is not read.
the second time is itext reading the header to determine the file type, only the first 4 bytes are read.
the third time is itext determining the dimensions of the image it seems - i'm not sure but i don't think much other than headers is read here either.
the last read is to render the image.
so the download impact is not great, 4 url connections - yes, but the entire stream is only transferred once
and my 'expire on use' cache will have to be 'expire on 4th use' instead.

Related

Missing descendant font dictionary

Starting with an apology if I am breaking some process here.
I am aware that there is a question with exactly the same problem
PDFBox returns missing descendant font dictionary but the thread ends abruptly because the author wasn't able to give the details, unfortunately. Also due to low reputation wasn't able to continue that thread.
And it very well states the problem of missing composite font. I wanted to know if there is some way to fix it since the PDF opens fine in our browser but we are not able to deal with it programmatically.
Tried it on some variety of versions including the latest 2.0.21
I will share the PDF
Looking forward to you
#mkl, #Tilman Hausherr
Please let me know if you need more details.
My code trying to convert the PDF to images
PDDocument document = PDDocument.load(new File(pdfPath+"//"+fileName));
PDFRenderer pdfRenderer = new PDFRenderer(document);
for (int page = 0; page < document.getNumberOfPages(); ++page) {
BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
}
Having downloaded the file when the link was available, I analyzed it.
Adobe Acrobat Reader shows error messages when opening the document. iText RUPS reports cross reference issues. First impression, therefore: That PDF is broken.
Nonetheless I looked closer but the result of that closer look was not better...
According to the cross references and trailers the PDF should contain 58 indirect objects with IDs 1 through 58. It turned out, though, that objects 32 through 49 are missing albeit most of them are referenced, some as descendant fonts. This explains why PDFBox reports missing descendant fonts.
Furthermore, objects 50 through 57 and 1 through 10 are not at the locations they should be according to the cross reference tables. Also the second cross reference table is at a wrong location and the file length is incorrect according to the linearization dictionary.
The way this is broken leaves the impression that the file is a mix of two slightly different versions of the same file; as if a download of the file was attempted but interrupted at some point and continued from a new version of the file; or as if some PDF processor somehow changed the file and tried to save the changed copy into the same file but was interrupted.
Summarized: The PDF is utterly broken.
If a PDF processor tries to repair it, you cannot be sure information from which version of the file you'll get, different PDF processors (if they can somehow make sense of it) are likely to interpret the file differently.
If possible, you should reject the file and request a non-broken version of it.
If not possible, copy the data from a viewer that appears to best repair it, manually check the copy for accuracy, and then check the whole extracted data for plausibility in regard to other information you have on the accounts in question. A little prayer won't hurt either.

Jasper Reports cutting large String between pages

I don't know if "cutting" is the right term...
I've got to finish doing a large and complex report based on an Applet legacy system, a fellow and I decided trying reuse all the logic in the applet to avoid the complexity of doing a lot of sub-reports. What we did was copy all the logic in the applet that include a lot of condictionals/SQL and make a huge and properly formated String, so that in our Jasper file it would just have a method called "myVo.getBody()" besides the header and footer stuff.
Unfortunately we found out a problem that some part of text get lost between pages. I think that as the text get bigger and reach Jasper page limit for some reason it keeps being writed in a "no visible area" and when the next page content starts some part was lost.
For example, there is a list of 19 items and what happens is:
End of 2nd page
1 - item
2 - item
beggining of 3rd page
18th - item
19th - item
Items from 3 to 17 are not being showed.
Is there any Jasper configuration for this situation?
We tried:
Position type: Fix Relative to the Top and Float
Stretch Type: Relative to the Tallers Object and Relative to Band Height
Stretch With Overflot: true or false
I don't think showing Java code would be useful as it just use a StringBuffer to build the String, put it on body property in a PreparedDocumentVO so that Jasper model can consumes it. It seems to be some Jasper setting, or the idea of creating a huge String is not so good as we thought.
I would consider breaking the result up.
Jasper formats information based on a relative page size. This means that at some point in time, when dealing with information that is not likely to fit on a page, Jasper will probably make an assumption that doesn't hold (and your data will likely not be formatted into the page).
If you have an exceptionally long string, consider splitting it up. Besides, people scroll web pages down, not the side, so a heavy side-scrolling document is likely to cause user issues unless every record scrolls to the side just as heavily.

pdfbox and itext not able to extract image

I am trying to extract images from a pdf . pdfbox is able to extract images from most of the pdfs but their are some pdfs whose images are not getting extracted by pdfbox.
For extracting the image I am using following code :
Not able to extract images from PDFA1-a format document
You can download a sample pdf with this problem from this link :
http://myslams.com/test/2.pdf
is their something wrong the code maybe something I forgot to handle or is their something wrong with the pdf all together ?
As the OP has not yet replaced his stale sample PDF link by a working one, the question can only be answered in general terms.
The code referenced by the OP (with the corrections in the answer of #Tilman) iterates the immediate image resources of each page and stores the respective files.
Thus, the code may store too many images because image resources of a page may not necessarily be used on the page in question:
On one hand it may not be used at all in the file or at least nowhere visible, merely a left-over from some prior PDF editing session.
On the other hand multiple pages may have a shared resources dictionary containing all images on all these pages; in this case the OP's code exports many duplicates.
And the code may store too few images because there are other places where images may be put:
Image data may be directly included in the page content stream, aka inline images.
Constructs with their own resources (form xobjects, patterns, Type 3 font glyphs) used from the page content may provide their own image resources or inline immages.
Annotations, e.g. AcroForm form fields, may have also their own appearance streams with their own resources and, therefore, may provide their own image resources or inline immages, too.
XFA forms may provide their own images, too.
As soon as the OP provides a representative sample file, the type of images he misses can be determined and a specific solution may be outlined.
EDIT
According to a comment by the OP, his image extraction problems have been resolved by making use of the information from this answer to his question "pdfbox and itext extracting image with incorrect dpi". Especially pointing to example code appropriate for the PDFBox version 1.8.8 used by the OP sems to have been important.
Thus, any kind of wrong output may also occur as a result of software orchestration issues.

How to check if a PDF document contains an image

I am reading text from PDF documents using the iText library. However, some pdf documents might have an image embedded with-in them in addition to text.
I'm wondering whether there is any way, through iText or something else, to determine if the pdf document contains an image?
You can do a correct and 100% reliable check using a PDF library.
However you can probably do a fairly reliable check just by reading the PDF as text and processing it that way. You need to first check it is a PDF by looking for the PDF header at the start,
%PDF...
Then scan through looking for the phrase,
/XObject
When you hit this tag you need to check backwards and forwards in the stream to the << and >> dictionary boundaries to pull out the full XObject dictionary. There may be nested << and >> so you might want to check back to the 'obj' and forwards to the 'stream' entry. Anyhow you'll end up with something that looks like this,
<<
/Type /XObject /Subtype /Image /Name /I1
/Width 800 /Height 128
/BitsPerComponent 1 /ImageMask true
/Filter [/FlateDecode]
/Length 2302 >>
The thing you need to check here is that there is this /Subtype entry and an /Image separated by some whitespace. If you hit that then you have an image.
So what are the limits of this approach?
Well it is possible to embed an image in the document but not use it. That would result in a false positive. I think this is pretty unlikely though. It would be very inefficient to do so and only a really skanky producer would do it.
Images can be embedded in page content streams as mentioned by Hugo above. That would result in a false negative. These are pretty uncommon though. It's one of those bits of the spec which was never a good idea and it's not widely used. If you have documents from a single producer (as is often the case) it will beome apparent very quickly if it does this or not. However I think it would be pretty uncommon. At a guess I can't imagine that more than 1% of wild PDFs would contain this construct.
It is possible to embed these XObject tags as references rather than direct objects. But I think you can completely discount that. While legal it would be absolutely bizare. I don't think you'll ever see that.
The correct way involves scanning and parsing all the content streams in the PDF. It's what we do in ABCpdf (which I work on) but it is a lot more work and a lot more processing power. It could be many seconds on a large document.
Think if 99% reliability is going to be good enough. :-)
Images in PDF are either FormXObjects or embedded images using BI-EI commands into content.
So you have to parse Resources dictionary of the page and recursively examine it's Xobjects to check whether they contain an image also(same Resources dictionary). Also you will have to parse all content streams and check whether Embedded image is present. Additionaly images may be defined in Patterns -> it's a way to go if you are going to implement own image presence checker. Read the spec first and estimate the time expenses.3d party lib might be not that expensive at the end.

Reduce PDF file size in itext (java)

I'm creating a Web-based label printing system. For every label, there should be a unique s/n. So when a user decided to create 1000 labels (with the same data), all of it should have unique s/n, therefore the pdf will have 1000 pages, which increases the file size.
My problem is when the user decided to create more copies, the file size will get bigger.
Is there any way that I can reduce the file size of the pdf using Itext? Or is there any way that I can generated the pdf and output it in the browser without saving it neither to server/client's HDD?
Thanks for the help!
On approach is to compress the file. It should be highly compressible.
(I imagine that you should be able to generate the PDF on the server side without writing it to disc, though you could use a lot of memory / Java heap in the process. I don't think it is possible to deliver a PDF to the browser without the file going to the client PC's hard drive in some form.)
If everything except the s/n is the same for the thousands of labels, you only have to add the equal things one time as a template and put the s/n text on top of it.
Take a look at PDFTemplate in itext. If I recall correctly that creates and XObject for the recurring drawing/label/image.... and it is exactly the same object every time you use it.
Even with thousands of labels, the only thing that grows your document size is the s/n (and every page) but the graphics or text of the 'label' is only added once. That should reduce your file size.

Categories

Resources