I am reading text from PDF documents using the iText library. However, some pdf documents might have an image embedded with-in them in addition to text.
I'm wondering whether there is any way, through iText or something else, to determine if the pdf document contains an image?
You can do a correct and 100% reliable check using a PDF library.
However you can probably do a fairly reliable check just by reading the PDF as text and processing it that way. You need to first check it is a PDF by looking for the PDF header at the start,
%PDF...
Then scan through looking for the phrase,
/XObject
When you hit this tag you need to check backwards and forwards in the stream to the << and >> dictionary boundaries to pull out the full XObject dictionary. There may be nested << and >> so you might want to check back to the 'obj' and forwards to the 'stream' entry. Anyhow you'll end up with something that looks like this,
<<
/Type /XObject /Subtype /Image /Name /I1
/Width 800 /Height 128
/BitsPerComponent 1 /ImageMask true
/Filter [/FlateDecode]
/Length 2302 >>
The thing you need to check here is that there is this /Subtype entry and an /Image separated by some whitespace. If you hit that then you have an image.
So what are the limits of this approach?
Well it is possible to embed an image in the document but not use it. That would result in a false positive. I think this is pretty unlikely though. It would be very inefficient to do so and only a really skanky producer would do it.
Images can be embedded in page content streams as mentioned by Hugo above. That would result in a false negative. These are pretty uncommon though. It's one of those bits of the spec which was never a good idea and it's not widely used. If you have documents from a single producer (as is often the case) it will beome apparent very quickly if it does this or not. However I think it would be pretty uncommon. At a guess I can't imagine that more than 1% of wild PDFs would contain this construct.
It is possible to embed these XObject tags as references rather than direct objects. But I think you can completely discount that. While legal it would be absolutely bizare. I don't think you'll ever see that.
The correct way involves scanning and parsing all the content streams in the PDF. It's what we do in ABCpdf (which I work on) but it is a lot more work and a lot more processing power. It could be many seconds on a large document.
Think if 99% reliability is going to be good enough. :-)
Images in PDF are either FormXObjects or embedded images using BI-EI commands into content.
So you have to parse Resources dictionary of the page and recursively examine it's Xobjects to check whether they contain an image also(same Resources dictionary). Also you will have to parse all content streams and check whether Embedded image is present. Additionaly images may be defined in Patterns -> it's a way to go if you are going to implement own image presence checker. Read the spec first and estimate the time expenses.3d party lib might be not that expensive at the end.
Related
Starting with an apology if I am breaking some process here.
I am aware that there is a question with exactly the same problem
PDFBox returns missing descendant font dictionary but the thread ends abruptly because the author wasn't able to give the details, unfortunately. Also due to low reputation wasn't able to continue that thread.
And it very well states the problem of missing composite font. I wanted to know if there is some way to fix it since the PDF opens fine in our browser but we are not able to deal with it programmatically.
Tried it on some variety of versions including the latest 2.0.21
I will share the PDF
Looking forward to you
#mkl, #Tilman Hausherr
Please let me know if you need more details.
My code trying to convert the PDF to images
PDDocument document = PDDocument.load(new File(pdfPath+"//"+fileName));
PDFRenderer pdfRenderer = new PDFRenderer(document);
for (int page = 0; page < document.getNumberOfPages(); ++page) {
BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
}
Having downloaded the file when the link was available, I analyzed it.
Adobe Acrobat Reader shows error messages when opening the document. iText RUPS reports cross reference issues. First impression, therefore: That PDF is broken.
Nonetheless I looked closer but the result of that closer look was not better...
According to the cross references and trailers the PDF should contain 58 indirect objects with IDs 1 through 58. It turned out, though, that objects 32 through 49 are missing albeit most of them are referenced, some as descendant fonts. This explains why PDFBox reports missing descendant fonts.
Furthermore, objects 50 through 57 and 1 through 10 are not at the locations they should be according to the cross reference tables. Also the second cross reference table is at a wrong location and the file length is incorrect according to the linearization dictionary.
The way this is broken leaves the impression that the file is a mix of two slightly different versions of the same file; as if a download of the file was attempted but interrupted at some point and continued from a new version of the file; or as if some PDF processor somehow changed the file and tried to save the changed copy into the same file but was interrupted.
Summarized: The PDF is utterly broken.
If a PDF processor tries to repair it, you cannot be sure information from which version of the file you'll get, different PDF processors (if they can somehow make sense of it) are likely to interpret the file differently.
If possible, you should reject the file and request a non-broken version of it.
If not possible, copy the data from a viewer that appears to best repair it, manually check the copy for accuracy, and then check the whole extracted data for plausibility in regard to other information you have on the accounts in question. A little prayer won't hurt either.
I have a component that converts PDF documents to images, one image per page. Since the component uses converters producing in-memory images, it hits the JVM heap heavily and takes some time to finish conversions.
I'm trying to improve the overall performance of the conversion process, and found a native library with a JNI binding to convert PDFs to TIFFs. That library can convert PDFs to single TIFF files only (requires intermediate file system storage; does not even consume conversion streams), therefore result TIFF files have converted pages embedded, and not per-page images on the file system. Having a native library improves the overall conversion drastically and the performance gets really faster, but there is a real bottleneck: since I have to make a source-page to destination-page conversion, now I must extract every page from the result file and write all of them elsewhere. A simple and naive approach with RenderedImages:
final SeekableStream seekableStream = new FileSeekableStream(tempFile);
final ImageDecoder imageDecoder = createImageDecoder("tiff", seekableStream, null);
...
// V--- heap is wasted here
final RenderedImage renderedImage = imageDecoder.decodeAsRenderedImage(pageNumber);
// ... do the rest stuff ...
Actually speaking, I would really like just to extract a concrete page input stream from the TIFF container file (tempFile) and just redirect it to elsewhere without having it to be stored as an in-memory image. I would imagine an approach similar to containers processing where I need to seek for a specific entry to extract data from it (say, something like ZIP files processing, etc). But I couldn't find anything like that in ImageDecoder, or I'm probably wrong with my expectations and just missing something important here...
Is it possible to extract TIFF container page input streams using JAI API or probably third-party alternatives? Thanks in advance.
I could be wrong, but don't think JAI has support for splitting TIFFs without decoding the files to in-memory images. And, sorry for promoting my own library, but I think it does exactly what you need (the main part of the solution used to split TIFFs is contributed by a third party).
By using the TIFFUtilities class from com.twelvemonkeys.contrib.tiff, you should be able to split your multi-page TIFF to multiple single-page TIFFs like this:
TIFFUtilities.split(tempFile, new File("output"));
No decoding of the images are done, only splitting each IFD into a separate file, and writing the streams with corrected offsets and byte counts.
Files will be named output/0001.tif, output/0002.tif etc. If you need more control over the output name or have other requirements, you can easily modify the code. The code comes with a BSD-style license.
I am using android pdf writer
(apw) in my app successfully for the most part. However, when I try to include a high resolution in a pdf document, I get an out of memory exception.
Immediately before creating the pdf file, the library must have the content itself converted into a string (representing the raw pdf content), which is then converted to a byte array. The byte array is written to the file in a file output stream (see example via website).
The out of memory expection occurs when the string is generated because representing all the pixels of a bitmap image in string format is very memory intensive. I could downsample the image using the android API, however, it is essential that the images are put into the pdf at high resolution (~2000 x 1000).
There are many scanner type apps which seem to be able to able generate pdf high res images, so there must be a way around it, surely. Granted, they may be using other libraries, but surely there is someone who has figured out a way around it with this library given that it is free and therefore popular(?)
I emailed the developer, but there was no response.
Potential solutions (I can think of) include:
Modifying the library to load a string representing e.g. the first 10% of the PDF, and writing to file chunk by chunk. (edit)
Modifying the library to output a stringoutput stream, or other output stream to a temp file (or final file) as the actual pdf content is being written in the pdfwriter object.
However as a relative java noob (and even more of a pdf specification noob), I am unable to understand the library well enough to do this myself.
Has anyone come across this problem and found a way around it? Anyone willing to hazard a suggestion, or take a look at the library itself even to see if there is a fix of some sort.
Thanks for your help.
nme32
Edit:
Logcat says heap size is in the range on 40 to 60mb before the crash. I understand (do correct me if not) that Android limits the available memory to apps depending on what else is running, though it is in the 50mb ballpark, depending on device.
When loading the image, I think APW essentially converts it to bitmap, that is represents the image pixel by pixel then puts it into string format, meaning it doesn't matter which image format you use, it may as well be bitmap.
First of all the resolution you are mentioning is very high. And i have already mentioned the issues related to Images in Android in this Answer
Secondly in case first solution doesn't work for you i would suggest Disk based LruCache.And store the chunks into that disk based cache and then retrieve and use it. Here is an Example of that.
Hope this would help. If it doesn't comment on this answer and i will add more solutions.
A PDF file extension can be verified by the magic signature: 25 50 44 46
However, I want to detect whether a PDF contains text or image (i.e. whether the PDF contains text that can be searched with ctrl+f OR whether it contains scanned documents)
Is there a way to do this?
Well technically, you could parse the PDF document structure and look for elements that contain text. I imagine this would require a big effort to implement.
So you may want to use a premade PDF package to do the parsing for you (PDFBox, BfoPDF or something similar). Still, I think it will require some effort to implement.
The simplest way that I know of would be to use a package that can extract the plain text for you. Apache TIKA can do this. Just feed it the document and see if you get something back.
In any case it will be hard to classify PDF's that contain both images and text.
We need to load and display large files (rich text) using swing, about 50mb. The problem is that the performance to render the files is incredibly poor. We tried both JTextPane and JEditorPane with no luck.
Does someone have experience with this and could give me some advise ?
thanks,
I don't have any experience in this but if you really need to load big files I suggest you do some kind of lazy loading with JTextPane/JEditorPane.
Define a limit that JTextPane/JEditorPane can handle well (like 500KB or 1MB). You'll only need to load a chunk of the file into the control with this size.
Start by loading the 1st partition of the file.
Then you need to interact with the scroll container and see if it has reached the end/beginning of the current chunk of the file. If so, show a nice waiting cursor and load the previous/next chunk to memory and into the text control.
The loading chunk is calculated from your current cursor position in the file (offset).
loading chunk = offset - limit/2 to offset + limit/2
The text on the JTextPane/JEditorPane must not change when loading chunks or else the user feels like is in another position of the file.
This is not a trivial solution but if you don't find any other 3rd party control to do this I would go this way.
You could use Memory Mapped File I/O to create a 'window' into the file and let the operating system handle the reading of the file.
Writing an efficient WYSIWYG text editor that can handle large documents is a pretty hard problem--Even Word has problems when you get into large books.
Swing is general purpose, but you have to build up a toolset around it involving managing documents separately and paging them.
You might look at Open Office, you can embed an OO document editor screen right into your app. I believe it's called OOBean...
JTextPane/JEditorPane do not handle well even 1mb of text (especially text with long lines).
You can try JEdit (StandaloneTextArea) - it is much faster than Swing text components, but I doubt it will handle this much text. I tried with 45m file, and while it was loaded (~25 seconds) and I could scroll down, I started getting "outofmemory" with 1700m heap.
In order to build a really scalable solution there are two obvious options really:
Use pagination. You can do just fine with standard Swing by displaying text in pages.
Build a custom text renderer. It can be as simple as a scrollable pane where only the visible part is drawn using BufferedReader to skip to the desired line in the file and read a limited number of lines to display. I did it before and it is a workable solution. If you need to have 'text selection' capabilities, this is a little more work, of course.
For really large files you could build an index file that contains offsets of each line in characters, so getting the "offset" is a quick "RandomAccess" lookup by line number, and reading the text is a "skip" with this offset. Very large files can be viewed with this technique.