How to get given paragraph content of pdf file using iText library?

How to get given paragraph content of pdf file using iText library? - java

Is there any way to get number of paragraphs or content of given paragraph in pdf file using iText library ?. I saw some classes like Paragraph, Chunk in some code to create new pdf file but I can not find any way to get these classes in reading file. Every idea is appreciated

Is the PDF you're talking about a Tagged PDF? If not, you are making the wrong assumptions about PDF. In a PDF, content is drawn on a page. For instance: an iText PdfPTable is converted into text state operators that draw snippets of text to a canvas, as well as graphics state operators that draw paths and shapes. If the PDF isn't tagged, the lines don't know that they are borders of a table; a word doesn't know to which cell it belongs.
The same goes for paragraphs: a snippet of text doesn't know whether it belongs to a sentence, to a paragraph, to a title line,...
Due to the very nature of PDF, what you're looking for may be impossible (using iText or any other software product), or may require heuristics (artificial intelligence) to examine all text state operators and the semantics of the content to get a result that mimics how humans would interpret text.
It's very easy to achieve if your PDF is tagged correctly. See the ParseTaggedPdf example.

Related

How to add a row in a itext PdfPTable dynamically?

I have to create a pdf using itext which will contain a button, when clicked should add a row in an existing PdfPTable. I wrote some code to create a PushbuttonField. While trying to set action I can only find PdfAction.javaScript. I am not able to figure out how to add a row in a table. I tried searching online but all I could find is PdfAction.javaScript
Any help would be greatly appreciated. Thank you.

When you create a PDF file, you draw text, lines and shapes to a canvas. That is also what happens when you add a PdfPTable to a Document. If you look at the syntax of the PDF page, you won't recognize a table. You'll find text (the content of the cells), lines (the borders), and shapes (the backgrounds), but you won't find a table. If the table is distributed over different pages, the "table" on one page won't know that it is related to the "table" on the other page.
Sure, you can add semantic structure to the document by introducing marked content, and by creating a structure tree, but that mechanism which we call Tagged PDF can't be used to make the PDF "editable" the same way a Word document is editable. Tagged PDF is (among others) used to allow assistive technology to present the content to the visually impaired (e.g. in the context of PDF/UA). The presence of structure doesn't change the fact that all text, all lines, and all shapes are added at absolute positions.
This is very different from HTML where the position on a page of a <table>, <tr>, <th>, or <td> is calculated at the moment the page is rendered. In HTML this position can even change when you resize the browser window.
There is no such thing in PDF (except if you use XFA (*), a technology that is deprecated since ISO 32000-2). All content on a page has a fixed position, hardcoded into the page's content stream. Changing the size of the PDF viewer window won't change anything to the position of the page content.
Because of all of this, your question is invalid. It is impossible to create a button in PDF that adds a row to a table, because:
In many cases there is no table: there is just a bunch of text, lines, and shapes at absolute positions,
Even if there is the notion of a table (using Tagged PDF): the visual represenation of that table is fixed at creation time, it can't be changed at consumption time.
You want to use an ordinary PDF viewer as if it were a PDF editor. That is impossible for all the reasons listed above.
(*) XFA was deprecated for different reasons. One of the most important reasons it is the lack of support for XFA. There aren't many viewers that support XFA. If you would post a follow-up question asking *"How can I create an XFA document?", the answer would be: "Don't do this!" Creating XFA is extremely complex, and once you've succeeded in creating an XFA form, you'll discover that many of your customers won't be able to consume the file because their viewer doesn't support the format.

How to hide text in an PDF file?

How can I add text to a pdf document, which is not visible?
The document manipulation should be done in java. The usecase is to add further metadata to a document (in a proprietary format, about 40kb), before the document is signed and archived.
I tried:
annotation field with size 0,0
.txt file attachment
but, this annoys readers of the PDF, because they see a difference (comment / attachment bar).
Is there a comment object or a syntax to comment out lines in a PDF document?
EDIT:
I've tried adding text between PDF objects. This works, the problem is: acrobat reader asks to resave the file when closing window.
Adding the text after %EOF is not a solution, because signing is not applied to the metadata, which is a needed feature.

The proper way to add metadata to a PDF would be through XMP. It allows you to add arbitrary metadata and allows defining the metadata types inside of the same PDF file (which you really should do if you're archiving and which is a requirement in archival standards such as PDF/A).
XMP data can be extracted by readers who don't understand the PDF format using a simple text scanning algorithm yet at the same time it will be inside of the document so will be protected by the digital signature you apply.
You can read more about it here: http://www.adobe.com/products/xmp/

I have seen PDF's who had a bunch of metadata in the footer, just in color white while the background was also white, so normally you wouldn't recognize it when you're looking at the PDF. But that's quite nasty..

Read PDF text and/or all content

I have a scenario where I need a Java app to be able to extract content from a PDF file in one of 2 modes: TEXT_ONLY or ALL. In text mode, only visible text ("visible" as if a human being was reading the PDF) is read out into strings. In all mode, all content (text, images, etc.) is read out of the file.
For instance, if a PDF file was to have 1 page in it, and that page had 3 paragraphs of contiguous text, and was word-wrapping 2 images, then TEXT_ONLY would extract all 3 paragraphs, and ALL would extract all 3 paragraphs and both images:
while(page.hasMoreText())
textList.add(page.nextTextChunk());
if(allMode)
while(page.hasMoreImages())
imageList.add(page.nextImage());
I know Apache Tika uses PDFBox under the hood, but am worried that this kind of functionality is shaded/prohibited by Tika (in which case, I probably need to do this directly from PDFBox).
So I ask: is this possible, and if so, which library is more appropriate for me to use? Am I going about this entirely the wrong way? Any pitfalls/caveats I am not considering here?

To expound some aspects of why #markStephens points you towards some resources giving some background on PDF.
In text mode, only visible text ("visible" as if a human being was reading the PDF) is read out into strings.
Your definition "visible" as if a human being was reading the PDF is not yet very well-defined:
Is text 1 pt in size visible? When zooming in, a human can read it; in standard magnification not, though. Which size would be the limit?
Is text in RGB (128, 129, 128) in a background of (128, 128, 128) visible? How different have the colors to be?
Is text displayed in some white noise pattern on a background of some other white noise pattern visible? How different have patterns to be?
Is text only partially on-screen visible? If yes, is one visible pixel enough? And what about some character 'I' in a giant size where the visible page area fits into the dot on the letter?
What about text covered by some annotation which can easily be moved, probably even by some automatically executed JavaScript code in the file?
What about text in some optional content group only visible when printing?
*...
I would expect most available PDF text parsing libraries to ignore all these circumstances and extract the text, at most respecting a crop box. In case of images with added, invisible OCR'ed text the extraction of that text in general is desired.
For instance, if a PDF file was to have 1 page in it, and that page had 3 paragraphs of contiguous text, and was word-wrapping 2 images, then TEXT_ONLY would extract all 3 paragraphs, and ALL would extract all 3 paragraphs and both images:
PDF (in general) does not know about paragraphs, just some groups of glyphs positioned somewhere on the page. Recognizing paragraphs is a task which cannot be guaranteed to work properly as there are heuristics at work. If, furthermore, you have multicolumn text with an irregular separation, maybe even some image in between (making it hard to decide whether there are two columns divided by the image or whether there is one column with an integrated image), you can count on recognition of the text flow let alone text elements like paragraphs, sections, etc. to fail miserably.
If your PDFs are either properly tagged or all generated by a tool chain for which patterns in the created PDF content streams betray text structures, you may be more lucky. In case of the latter, though, your solution would have to be custom-made for that tool chain.
but am worried that this kind of functionality is shaded/prohibited by Tika (in which case, I probably need to do this directly from pdfBox).
There you point towards another point of interest: PDFs can be marked that text extraction is forbidden while they otherwise can be displayed by anyone. While technically PDFs marked like that can be handled just like documents without that mark with just one decoding step (essentially they are encrypted with a publicly known password), doing so is clearly acting against the declared intention of the author and violating his copyright.
So I ask: is this possible, and if so, which library is more appropriate for me to use? Am I going about this entirely the wrong way? Any pitfalls/caveats I am not considering here?
As long as you expect 100% accuracy for generic input, you should reconsider your architecture.
If the PDFs are all you have and a solution as effective is possible is OK, on the other hand, there are multiple possible libraries for you, iText, and PDFBox to name but two while there are more. Which is best for you depends on more factors, e.g. on whether you need some generic solution or all PDFs are created by a tool chain as above.
In any case you'll have to do some programming yourself, though, to fine-tune them for your use case.

iText PDF Text Extraction with fonts and styles

I am using iText to extract text from PDF to a String but I have encountered a problem
with some PDF. When I tried to extract text, the reader extract only blanks/destroyed text
on SOME pdfs.
Example of destroyed text:
"th isbe long to t he t est fo r extr act ion tex t"
What is the cause of this problem?
I am thinking of removing the fonts and change the font to a suitable one to be read by
the reader. I have tried researching about this, but what I found does not help me.

This is caused by the way text is stored in the PDF file. It just puts letters with information for rendering and location. The text extraction algorithm is smart in that it finds letters that seem to be close together and, if so, it puts them together. If they aren't that close, it puts in some space.
I can't tell you what to do about it, though.

Read Text and Image Locations (x.y coordinates) using PDFBox

I am doing a java program to read encrypted PDF files and extract the contents of the file page by page including the text, images and their positions(x,y coordinates) in the file. Now I'm using PDFBox for this purpose and I'm getting the text and images. But I couldn't get the text position and image position. Also there are some problems reading some encrypted PDF files.

Take a look at org.apache.pdfbox.examples.util.PrintTextLocations. I've used it quite a bit and it's very helpful to make analyses on the layout of elements and bounding boxes in PDF documents. It also revealed items printed in white ink, or outside the printable area (presumably document watermarks, or "forgotten" items pushed out of sight by the author).
Usage example:
java -cp app/target/pdfbox-app-1.5.0.jar org.apache.pdfbox.examples.util.PrintTextLocations ~/tmp/mydoc.pdf >~/tmp/out-text-locations.txt
You'll get something like that:
Processing page: 0
String[53.9,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=4.6679993]A
String[58.568,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=2.6640015]f
String[61.232002,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=1.6679993]e
...
Which you can easily parse and use to plot element's position, bounding-box, and the "flow" (trajectory through all the elements), etc. for each page. As I'm sure you are already aware of, you'll find that PDF can be almost impossible to convert to text. It is really just a graphic description format (i.e. for the printer or the screen), not a markup language. You could easily make a PDF that prints "Hello world", but that jumps randomly through the character positions (and that uses different glyphs than any ISO char encoding, if you so choose), making the PDF very hard to convert to text. There is no notion of "word" or "paragraph". A two-column document, for example, can be a nightmare to parse into text.
For the second part of your question, I had good results using xpdf version 3.02, after fixing Xref.cc (make XRef::okToPrint(),XRef::okToChange(),XRef::okToCopy() and XRef::okToAddNotes() all return gTrue). That's to handle locked documents, not encrypted ones (there are other utils out there for that).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.