Extract "Graphic" Line positions with PDFBox

Extract "Graphic" Line positions with PDFBox - java

I am currently working on a little java application to transform some PDF bound data and I am using PDFBox for this. The pdf itself is very simple and just contains some headers and a table which seperates rows with a line. I am trying to find the coordinates of this line so that I can dynamically extract by area as some rows vary with their height. I have not really found any information on this during my search as almost all results deal with "text lines" and not actual lines. Is this even possible with PDFBox or will I have to look for another pdf library.
Any information would be greatly appreciated.

Related

opencv find block of text areas / detect document layout

I have color image document with text and images and tables.
Document can have two columns.
Document is composite from areas: area header and text (bigger font, can have different font color and something like sub-header additional data).
This is exemplary image but real one can be color:
What i need to do.
I need find on image document this areas of text with headers.
What i need to know.
Method how to divide document to divide document on particular parts.
I try with opencv in java(if someone have python and c++ version i can convert it for java version by myself). I found few similar problem on stack overflow, but none of them can help me. You must know that my opencv knowledge is not very well and it is only from on-line tutorials and stack overflow.
Is there any fine solution on my problem in opencv way or i need use something else, different library or application to achieve this?
One and only requirement is that it must be done from command line.
If i had this areas i can do what i need next, but this is step which stops me.

have you solved the problem?
I'm working on a similar problem.
My solution is to use HoughLines https://docs.opencv.org/3.4.0/d9/db0/tutorial_hough_lines.html

You can use text detection combined with dilation to detect bold text i.e. headers and then group the text boxes between two consecutive headers as the text under first header.

How to add a row in a itext PdfPTable dynamically?

I have to create a pdf using itext which will contain a button, when clicked should add a row in an existing PdfPTable. I wrote some code to create a PushbuttonField. While trying to set action I can only find PdfAction.javaScript. I am not able to figure out how to add a row in a table. I tried searching online but all I could find is PdfAction.javaScript
Any help would be greatly appreciated. Thank you.

When you create a PDF file, you draw text, lines and shapes to a canvas. That is also what happens when you add a PdfPTable to a Document. If you look at the syntax of the PDF page, you won't recognize a table. You'll find text (the content of the cells), lines (the borders), and shapes (the backgrounds), but you won't find a table. If the table is distributed over different pages, the "table" on one page won't know that it is related to the "table" on the other page.
Sure, you can add semantic structure to the document by introducing marked content, and by creating a structure tree, but that mechanism which we call Tagged PDF can't be used to make the PDF "editable" the same way a Word document is editable. Tagged PDF is (among others) used to allow assistive technology to present the content to the visually impaired (e.g. in the context of PDF/UA). The presence of structure doesn't change the fact that all text, all lines, and all shapes are added at absolute positions.
This is very different from HTML where the position on a page of a <table>, <tr>, <th>, or <td> is calculated at the moment the page is rendered. In HTML this position can even change when you resize the browser window.
There is no such thing in PDF (except if you use XFA (*), a technology that is deprecated since ISO 32000-2). All content on a page has a fixed position, hardcoded into the page's content stream. Changing the size of the PDF viewer window won't change anything to the position of the page content.
Because of all of this, your question is invalid. It is impossible to create a button in PDF that adds a row to a table, because:
In many cases there is no table: there is just a bunch of text, lines, and shapes at absolute positions,
Even if there is the notion of a table (using Tagged PDF): the visual represenation of that table is fixed at creation time, it can't be changed at consumption time.
You want to use an ordinary PDF viewer as if it were a PDF editor. That is impossible for all the reasons listed above.
(*) XFA was deprecated for different reasons. One of the most important reasons it is the lack of support for XFA. There aren't many viewers that support XFA. If you would post a follow-up question asking *"How can I create an XFA document?", the answer would be: "Don't do this!" Creating XFA is extremely complex, and once you've succeeded in creating an XFA form, you'll discover that many of your customers won't be able to consume the file because their viewer doesn't support the format.

Itext get special letters from pdf

I am trying to extract accented words from pdf e book . The best results are produced when using itext library , but I fail to get accents from words .
example :
побеђивање -should come out as- побеђи́ва̄ње (accents are missing)
The letters are Cyrillic Serbian .
I tried many of the ocr solutions but they all give bad results . Is there a way for me to extract all of this pdf data the way they are in the pdf using itext. I know that this has a lot to do with the way pdf works and that this is a hard thing to get , but again I realy need this , the alternative is to retype all of the data.
The pdf file pdf example file

The sample document actually contains one big image, a scanned page, and invisible text information on top of the scanned printed letters. Most likely this text information is the result of some OCR process.
Unfortunately already this text information is missing the accents in question. E.g. the text for the first entry
is added as
(\340\361\362\340\353\367\355)Tj 0 Tc (\236)Tj
...
As you can see, the same letter \340 is used at position 1 and 4 while according to the scanned page one of the matching printed letters has an accent and one not.
This happens throughout the whole page.
Thus, any attempt at regular text extraction will fail to return the accents in question. The only chance you have is to use OCR.
You say you
tried many of the ocr solutions but they all give bad results
Probably you applied the OCR applications to the PDF or a rendered version of it. I would suggest you instead extract the scanned images; this way you get all the quality there is. iText can help you with image extraction.

How to get given paragraph content of pdf file using iText library?

Is there any way to get number of paragraphs or content of given paragraph in pdf file using iText library ?. I saw some classes like Paragraph, Chunk in some code to create new pdf file but I can not find any way to get these classes in reading file. Every idea is appreciated

Is the PDF you're talking about a Tagged PDF? If not, you are making the wrong assumptions about PDF. In a PDF, content is drawn on a page. For instance: an iText PdfPTable is converted into text state operators that draw snippets of text to a canvas, as well as graphics state operators that draw paths and shapes. If the PDF isn't tagged, the lines don't know that they are borders of a table; a word doesn't know to which cell it belongs.
The same goes for paragraphs: a snippet of text doesn't know whether it belongs to a sentence, to a paragraph, to a title line,...
Due to the very nature of PDF, what you're looking for may be impossible (using iText or any other software product), or may require heuristics (artificial intelligence) to examine all text state operators and the semantics of the content to get a result that mimics how humans would interpret text.
It's very easy to achieve if your PDF is tagged correctly. See the ParseTaggedPdf example.

Read Text and Image Locations (x.y coordinates) using PDFBox

I am doing a java program to read encrypted PDF files and extract the contents of the file page by page including the text, images and their positions(x,y coordinates) in the file. Now I'm using PDFBox for this purpose and I'm getting the text and images. But I couldn't get the text position and image position. Also there are some problems reading some encrypted PDF files.

Take a look at org.apache.pdfbox.examples.util.PrintTextLocations. I've used it quite a bit and it's very helpful to make analyses on the layout of elements and bounding boxes in PDF documents. It also revealed items printed in white ink, or outside the printable area (presumably document watermarks, or "forgotten" items pushed out of sight by the author).
Usage example:
java -cp app/target/pdfbox-app-1.5.0.jar org.apache.pdfbox.examples.util.PrintTextLocations ~/tmp/mydoc.pdf >~/tmp/out-text-locations.txt
You'll get something like that:
Processing page: 0
String[53.9,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=4.6679993]A
String[58.568,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=2.6640015]f
String[61.232002,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=1.6679993]e
...
Which you can easily parse and use to plot element's position, bounding-box, and the "flow" (trajectory through all the elements), etc. for each page. As I'm sure you are already aware of, you'll find that PDF can be almost impossible to convert to text. It is really just a graphic description format (i.e. for the printer or the screen), not a markup language. You could easily make a PDF that prints "Hello world", but that jumps randomly through the character positions (and that uses different glyphs than any ISO char encoding, if you so choose), making the PDF very hard to convert to text. There is no notion of "word" or "paragraph". A two-column document, for example, can be a nightmare to parse into text.
For the second part of your question, I had good results using xpdf version 3.02, after fixing Xref.cc (make XRef::okToPrint(),XRef::okToChange(),XRef::okToCopy() and XRef::okToAddNotes() all return gTrue). That's to handle locked documents, not encrypted ones (there are other utils out there for that).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.