Read PDF text and/or all content - java

I have a scenario where I need a Java app to be able to extract content from a PDF file in one of 2 modes: TEXT_ONLY or ALL. In text mode, only visible text ("visible" as if a human being was reading the PDF) is read out into strings. In all mode, all content (text, images, etc.) is read out of the file.
For instance, if a PDF file was to have 1 page in it, and that page had 3 paragraphs of contiguous text, and was word-wrapping 2 images, then TEXT_ONLY would extract all 3 paragraphs, and ALL would extract all 3 paragraphs and both images:
while(page.hasMoreText())
textList.add(page.nextTextChunk());
if(allMode)
while(page.hasMoreImages())
imageList.add(page.nextImage());
I know Apache Tika uses PDFBox under the hood, but am worried that this kind of functionality is shaded/prohibited by Tika (in which case, I probably need to do this directly from PDFBox).
So I ask: is this possible, and if so, which library is more appropriate for me to use? Am I going about this entirely the wrong way? Any pitfalls/caveats I am not considering here?

To expound some aspects of why #markStephens points you towards some resources giving some background on PDF.
In text mode, only visible text ("visible" as if a human being was reading the PDF) is read out into strings.
Your definition "visible" as if a human being was reading the PDF is not yet very well-defined:
Is text 1 pt in size visible? When zooming in, a human can read it; in standard magnification not, though. Which size would be the limit?
Is text in RGB (128, 129, 128) in a background of (128, 128, 128) visible? How different have the colors to be?
Is text displayed in some white noise pattern on a background of some other white noise pattern visible? How different have patterns to be?
Is text only partially on-screen visible? If yes, is one visible pixel enough? And what about some character 'I' in a giant size where the visible page area fits into the dot on the letter?
What about text covered by some annotation which can easily be moved, probably even by some automatically executed JavaScript code in the file?
What about text in some optional content group only visible when printing?
*...
I would expect most available PDF text parsing libraries to ignore all these circumstances and extract the text, at most respecting a crop box. In case of images with added, invisible OCR'ed text the extraction of that text in general is desired.
For instance, if a PDF file was to have 1 page in it, and that page had 3 paragraphs of contiguous text, and was word-wrapping 2 images, then TEXT_ONLY would extract all 3 paragraphs, and ALL would extract all 3 paragraphs and both images:
PDF (in general) does not know about paragraphs, just some groups of glyphs positioned somewhere on the page. Recognizing paragraphs is a task which cannot be guaranteed to work properly as there are heuristics at work. If, furthermore, you have multicolumn text with an irregular separation, maybe even some image in between (making it hard to decide whether there are two columns divided by the image or whether there is one column with an integrated image), you can count on recognition of the text flow let alone text elements like paragraphs, sections, etc. to fail miserably.
If your PDFs are either properly tagged or all generated by a tool chain for which patterns in the created PDF content streams betray text structures, you may be more lucky. In case of the latter, though, your solution would have to be custom-made for that tool chain.
but am worried that this kind of functionality is shaded/prohibited by Tika (in which case, I probably need to do this directly from pdfBox).
There you point towards another point of interest: PDFs can be marked that text extraction is forbidden while they otherwise can be displayed by anyone. While technically PDFs marked like that can be handled just like documents without that mark with just one decoding step (essentially they are encrypted with a publicly known password), doing so is clearly acting against the declared intention of the author and violating his copyright.
So I ask: is this possible, and if so, which library is more appropriate for me to use? Am I going about this entirely the wrong way? Any pitfalls/caveats I am not considering here?
As long as you expect 100% accuracy for generic input, you should reconsider your architecture.
If the PDFs are all you have and a solution as effective is possible is OK, on the other hand, there are multiple possible libraries for you, iText, and PDFBox to name but two while there are more. Which is best for you depends on more factors, e.g. on whether you need some generic solution or all PDFs are created by a tool chain as above.
In any case you'll have to do some programming yourself, though, to fine-tune them for your use case.

Related

Best way to extract text from PDF in java

I want to make a program that is able to read PDF files and parse it's contents.
Thus I need to extract the text using some kind of library. I found 3 ways to do so.
OCR libraries (like Tesseract)
ScanPdf libraries (like iText)
Converters from PDF to text.
I fail to understand the big differences between them since all of them will produce in the end a text file from the PDF. So which is the best way to go about this?
PDF is a complex format. If you open a PDF and you're staring at a bunch of text, that doesn't really tell you much. It could be that you're staring at an image file someone decided to wrap into a PDF file. This is 99%+ certain what you have if someone scanned a document and told their scanner to 'scan to PDF', and 100% certain what you got if you have a PNG or JPG and 'save as PDF', or try to 'print to PDF' such a thing.
There is no text in the PDF then. There are pixels.
To turn pixels into text, that's where OCR libraries come in. That's what they do. That is all they do. It's an AI bonanza and error prone. No guarantees.
However, PDF is more complex than that, it isn't like PNG/JPG: It's more like HTML. You can put actual text in there.
This has different issues, though. You can place text blobs (i.e. a 'rectangle', with coordinates, and then the text that is supposed to go inside). Again a lot like HTML: You can do something like:
<p class="foo">
World!
</p>
<p class="bar">
Hello,
</p>
and then create CSS so that the foo is rendered after the bar block (can be as simple as .foo, .bar { display: block; } .foo {float: right}).
Turning that HTML into "World! Hello," is not all that tricky. Realizing that during a render, you end up seeing "Hello, World!", and thus writing code that returns "Hello, World!", that's way more complicated.
The same problem applies to PDF. For simple PDFs, extracting the raw text inside is not too difficult, but be aware that for even mildly complex PDFs, the text can arrive in a jumbled mess.
iText is trying to give you enough power, at least, to provide the latter: To give you a full hierarchical breakdown. It returns 'here is a text box, here is its positioning, and here is the text inside. and now here is another text box, etc'. It does not return a big string.
In other words: The answer depends a lot on what PDFs you have / what PDFs you expect to be able to read, and how complex they are. If they are scans, you need an OCR library. If they are simple, a basic pdf2text converter will do fine. If you want to attempt to take into account fancily positioned PDFs with forms inside and 'popups' that can be opened and closed, oof. Probably all these tools are insufficient and you're signing up to many personweeks worth of effort.
There definitely IS text embedded PDFs, it is NOT just pixels.
It depends on if the PDF is a "true" PDF (ie you can highlight the text and copy and paste it elsewhere) or if the PDF is a scanned image.
With scanned images, you'll have to use an OCR API. All of the major cloud providers have OCR APIs (ie Amazon Textract, Google Document AI, Microsoft Form Recognizer, etc). If it's a true PDF, then I've found the pdf.js library (https://mozilla.github.io/pdf.js/) quite helpful in doing a direct text extraction.
Just know that doing this only gets you the text that is literally on the page, and there's quite a bit of work still to do to get key/value data fields programmatically across many documents.
This is something that my startup is working on (www.sensible.so/) too if you're interested in something more powerful!

setBackgroundColor pdfBox Java

I want to change the background color of an already present pdf to transparent or white,
and I am using pdfBox for performing other tasks on the pdf, I found some documentation here:
setBackroundColor - pdfBox
But I am completely unaware of how to use it as I am not accustomed to java.
Can someone possibly provide some example code on doing it ?
I want to change the background color of an already present pdf to transparent or white
According to the PDF specification ISO 32000-1, section 11.4.7:
Ordinarily, the page shall be imposed directly on an output medium, such as paper or a display screen. The page group shall be treated as an isolated group, whose results shall then be composited with a backdrop colour appropriate for the medium. The backdrop is nominally white, although varying according to the actual properties of the medium. However, some conforming readers may choose to provide a different backdrop, such as a checker board or grid to aid in visualizing the effects of transparency in the artwork.
PDF viewers most often do use this white backdrop. Thus, if your PDF on standard viewers displays a different color in the back, this normally is due to some area filling operation(s) somewhere in the page content stream.
Thus, there is not a simple single attribute of the PDF to set somewhere but instead you have to parse the page content, find the operations which paint what you perceive as background, and change them. There are numerous different operations which may be used for this task, though, and these operations may also be used for other purposes than background coloring. Thus, there is not the method to change backgrounds.
If you have a single specific PDF or PDFs generated alike, please provide a sample document to allow helping you to find find a way to change the perceived background color.
PS: The PDLayoutAttributeObject.setBackgroundColor method you found refers to the creation of so called Layout Attributes which
specify parameters of the layout process used to produce the appearance described by a
document’s PDF content. [...]
NOTE The intent is that these parameters can be used to reflow the content or export it to some other document format with at least basic styling preserved.
(section 14.8.5.4 in the PDF specification ISO 32000-1)
Thus, they are provided only in PDFs intended for content reflow or content export and are not used by regular PDF viewers.

If an image is tampered with some additional content, how to remove that additional content from the image in Java?

I want to know if there is any solution for the following scenario:
I have an application which uploads the files, after scanning and transcoding them, onto a server. Suppose, an image file is being uploaded which has been tampered with some additional contents over it. Now, as the uploaded file is illegitimate, I want to remove the additional tampered contents and upload just the original part of this image file. Is it possible to do so in Java?
Thanks.
It's not possible to detect in the general case, but there are some heuristic methods available to determine whether an image has been edited. Try using the tools at http://imageedited.com/ to get an idea of what's possible.
Removing the edit is a much more difficult problem, which is probably impossible with current methods.
I'm just speculating here, and I don't know how well it would work in practice, but you could do it if you limit to specific sources of tampering. E.g., suppose you want to remove the logo added to an image by memegenerator.net.
You know in advance what the text looks like and where it is. Create a transparent png template that matches the text. Then sum the differences between the image and template pixel colors, multiplying each by the alpha of the template pixel. Since for this particular logo, it's basically white (although it seems to have a thin black shadow) you would get false positives for a picture with a white part there, so you'd also need to verify that the surrounding pixels are (within a tolerance) not white. It's not clever but it could work for certain sites.
For anything more flexible (e.g., logos on images which have subsequently been resized) you're in to the territory of OCR and TinEye-like image matching, which are more advanced than I could advise you on.
To correctly detect all kinds of "tampering" and filter "illegitimate" from "legitimate" in general, you'd need an artificial intelligence that could understand the meaning and context of what it's seeing. The short answer is: you can't. That's what humans are for.
If this is for a website, probably the best thing you can do is a report button that lets users of your site report images that don't fit with your site's rules.

Printing data to a pre printed form/stationery

We have a requirement where we already have pre printed stationery and want user to put data in a HTML form and be able to print data on that form. Alignment/text size etc are very important since the pre-printed stationery already has boxes for each character. What could be a good way to achieve this in java? I have thinking of using jasper reports. Any other options? May be overlay image with text or something?
Also we might need to capability to print on plain paper in which case the boxes needs to be printed by our application and the form should match after the printed with the already printed blank stationery containing data.
Do we have some open source framework to do such stuff?
Jaspersoft reports -- http://sourceforge.net/projects/jasperreports/
You will then create XML templates, then you will be able to produce a report in PDF, HTML, CSV, XLS, TXT, RTF, and more. It has all the necessary options to customize the report. Used it before and recommend it.
You will create the templates with iReport then write the code for the engine to pass the data in different possible ways.
check http://www.jaspersoft.com/jasperreports
Edit:
You can have background images and overlay the boxes over it and set a limit on the max character size ... and many more
It is very powerful and gives you plenty of options
Here is one of iReport's tutorial for a background image http://ireport-tutorial.blogspot.com/2008/12/background-image-in-ireport.html
The big problem when printing form content that has been filled in electronically, is aligning it correctly on the pre-printed form. You may get content to align for one printer, but when you use another it is completely misaligned.
Fly Software have a form design product called InForm Designer that gets around the problem nicely by allowing users to specify and save vertical and horizontal offsets for printers. This ensures filled in form content is always aligned. I've tried it and it works perfectly. Take a look for yourself here...
http://www.flysoftware.com/products/inform_designer/overview.asp
It might be worth implementing a printer offset similar to InForm's in your own application (if possible).
Some things to think about.
First in terms of the web page, do you want use the stationery as the form layout?
Does it have to be exact?
Combed boxes (one for each character)
Do you want to show it like that on the web page, or deal with the combing later.
How are you going to deal with say a combed 6 digit number. Is this right aligned. What if they enter 7 digits. Same for text. what if it won't fit.
Font choices, we had a lot of fun with W...
How aligned do you want the character within the box, what font limitations does that imply, some of the auto magic software we looked at did crap like change the size of each character.
Combed editing is a nightmare, we display combed, but raise an edit surface the size of the full box on selection.
Another thing that might drive you barking mad, you find find small differences in the size and layout of the boxes, so they look okay from a distance but a column of boxes sort of shifts about by a pixel. Some of testing guys had to lend us their electron microscopes, so we could see how many ink molecules we were out by. :(
Expect to spend a lot of time in the UI side of things, and remember printed stationery changes, so giving yourself some sort of meta description of the form to start with will save you loads of trouble later on.

Read Text and Image Locations (x.y coordinates) using PDFBox

I am doing a java program to read encrypted PDF files and extract the contents of the file page by page including the text, images and their positions(x,y coordinates) in the file. Now I'm using PDFBox for this purpose and I'm getting the text and images. But I couldn't get the text position and image position. Also there are some problems reading some encrypted PDF files.
Take a look at org.apache.pdfbox.examples.util.PrintTextLocations. I've used it quite a bit and it's very helpful to make analyses on the layout of elements and bounding boxes in PDF documents. It also revealed items printed in white ink, or outside the printable area (presumably document watermarks, or "forgotten" items pushed out of sight by the author).
Usage example:
java -cp app/target/pdfbox-app-1.5.0.jar org.apache.pdfbox.examples.util.PrintTextLocations ~/tmp/mydoc.pdf >~/tmp/out-text-locations.txt
You'll get something like that:
Processing page: 0
String[53.9,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=4.6679993]A
String[58.568,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=2.6640015]f
String[61.232002,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=1.6679993]e
...
Which you can easily parse and use to plot element's position, bounding-box, and the "flow" (trajectory through all the elements), etc. for each page. As I'm sure you are already aware of, you'll find that PDF can be almost impossible to convert to text. It is really just a graphic description format (i.e. for the printer or the screen), not a markup language. You could easily make a PDF that prints "Hello world", but that jumps randomly through the character positions (and that uses different glyphs than any ISO char encoding, if you so choose), making the PDF very hard to convert to text. There is no notion of "word" or "paragraph". A two-column document, for example, can be a nightmare to parse into text.
For the second part of your question, I had good results using xpdf version 3.02, after fixing Xref.cc (make XRef::okToPrint(),XRef::okToChange(),XRef::okToCopy() and XRef::okToAddNotes() all return gTrue). That's to handle locked documents, not encrypted ones (there are other utils out there for that).

Categories

Resources