How to determine coordinates of multi-line text in PDF

How to determine coordinates of multi-line text in PDF - java

I am using Apache PdfBox 2.0 in order to parse a pdf file. Having some fixed strings, I was able to create a system based on:
A fixed text, as a starting point
The next cell/text position, or null
The bottom area, to determine the height of the rectangle.
Using the starting point, I am computing the x and y (see below pic for pdf structure in PDF Box:
Using the "next" text block (which is another fixed value, for example a field or a table header), I am determining the width of the desired region, using formula:
width = second.x - first.x
or something similar. So, in a table, for example, knowing in advance the header names, it's easy to detect the columns. What I am trying to do (and so far fail to do so in an accurate way) is to determine the lines in a pdf table. This table sometimes contains missing values in some columns and also multiple lines values for some rows/columns. I have extended my "system" (first, next, bottom) to work dinamycally with table rows, and this works great when I have "normalized" tables (e.g. no whitespaces and/or at least, no multiple line values). But it's not working with real world data, because so far I could not find a way of determining the location (x, y, width, height) of a multi-line value. Is this even possible with PDF Box? Some people suggested to convert the pdf to html first and then to parse the html instead. Is this a viable option? Has anyone worked with this library? I will try to use this next.

Like I said in my previous comments, I have found a partial solution for my issue. This is based on two things:
First, I assume that one column for each table contains only distinct values which never occupy more than 1 row.
Next, since I also have some fixed texts in the document, I have determined these texts coordinates and use them as a delimiter of the area which contains the text I want to extract. For example, the "current, next, bottom" system (as I call it) can contain for example: "Column name A", "Column name B", "Fixed text C" (or second row from the same table - determined based on the unique single-row values).
It is not perfect, and problems may occur if the fixed texts may occur more than once in the document. Of course, improvements can be made by filtering the correct occurrence using the vertical coordinates and so on, but for the moment, I will close this question, as it seems that this problem has no standard answer and currently there is no open source library able to extract tabular data from pdfs.

Related

JTextArea: how to translate row, column to offset? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Rows in JTextArea
In JTextArea, operations such as select, highlight and so on seem all to depend on offset from beginning of text. In an app that displays line-oriented text, I need to select, highlight (based on info from elsewhere, not caret) based on row and column.
Is there some functionality built-in, or in some helper class, to get offset from row,col? I realize I could maintain separate data on line-start offsets etc and calculate row,col-->offset, but surely JTextArea (or its model) already knows this in order to display the text, so I'm persuaded there must already be a way to do this.
I did see examples that use something like this one, using textarea.viewToModel(new Point(x,y));, where x and y were purportedly row,col, but so far as I can see, x, and y are pixel coordinates, not row,col... so not sure what to make of that.
Clues? Thanks!
Edited: So the question has been closed under the impression of five commenters that this is a duplicate of other questions, which it is not. I did not ask about how to convert offset to row, col, nor about how to convert screen pixel coordinates to offset, which are the subjects covered in the other articles.
In case someone else stumbles in here looking for the answer to what I actually did ask, I've now discovered that it's as follows.
JTextArea has the functionality I expected to find, but evidently overlooked on earlier browsing: getLineStartOffset(int line) which will give the offset-from-start-of-text for a particular line (row) of text. To this one can easily add the char-posn-in-line, and thus arrive at the offset of a particular character.

A textArea contains a stream of text. There's no magic method that will locate a row/column in your text model, since that can be ambiguous if your text contains variable-width characters or different font sizes. You must maintain the data necessary to map from your idea of what a particular (row,column) means with regard to your data.

That's simple example of rectangular fragment selection http://java-sl.com/tip_vertical_selection.html
You can use javax.swing.text.Utilities.getRowStart()/getRowEnd() methods. First find start row offset for the row number. Then just add col number to get the offset.

open office api text size

I am using open office API with Java UNO. I need to get size of selected text in the document content (for example embedded pictures have own size in mm but text inserted via XText.insertString(...) method doesn't have any size).
In other words: I want to get size (preferably in mm) of the box which surrounds part of text (it can be whole paragraph or selected text via some type of cursor). Is there any possibility to achieve that?

After searching, I think there is no option to achieve this at the moment. For my purposes I write small method for getting height of the paragraph in 1/100 mm.
Here is how this method works:
Get XTextViewCursor of the XTextDocumment controller for going left/right.
Go to paragraph to measure.
Loop through paragraph getting each char. For each char do: check its height (CharHeight property of the paragraph); get XLineCursor from XTextViewCursor and check if there is end of the line - if is then add (to the result) biggest height of the character in line.
This is temporary solution (still wait for something better) and has number of bugs (example line-spacing different than single; paragraph should only contain text) but maybe it will be helpful for someone.

How can I present multiline data in grid like component in Java and AWT

I have a few records of data (less then 10). Each record consists of a few lines of text.
I want to present records to the user in a kind of grid, where user can select one of the records.
I was thinking about List component or jTable, but I couldn't make them displaying more then one line of text. What component should I use then, or how to approach this?
In subject I suggested AWT because size does matter, i.e. I want use this functionality in the applet and would like to avoid any extra libraries.
Thanks in advance

Thanks to maksimov's link I found examples of how to tackle this issue, and also very interesting link I missed somehow - http://docs.oracle.com/javase/tutorial/uiswing/components/html.html
To specify that a component's text has HTML formatting, just put the
tag at the beginning of the text, then use any valid HTML in
the remainder. Here is an example of using HTML in a button's text:
button = new JButton("<html><b><u>T</u>wo</b><br>lines</html>");
In my case it was just enough to set height of the row and add tag just before string data to be displayed. HTML tagging also let me use extra formatting, colors, etc,
Brilliant,
Thank you maksimiov

Printing data to a pre printed form/stationery

We have a requirement where we already have pre printed stationery and want user to put data in a HTML form and be able to print data on that form. Alignment/text size etc are very important since the pre-printed stationery already has boxes for each character. What could be a good way to achieve this in java? I have thinking of using jasper reports. Any other options? May be overlay image with text or something?
Also we might need to capability to print on plain paper in which case the boxes needs to be printed by our application and the form should match after the printed with the already printed blank stationery containing data.
Do we have some open source framework to do such stuff?

Jaspersoft reports -- http://sourceforge.net/projects/jasperreports/
You will then create XML templates, then you will be able to produce a report in PDF, HTML, CSV, XLS, TXT, RTF, and more. It has all the necessary options to customize the report. Used it before and recommend it.
You will create the templates with iReport then write the code for the engine to pass the data in different possible ways.
check http://www.jaspersoft.com/jasperreports
Edit:
You can have background images and overlay the boxes over it and set a limit on the max character size ... and many more
It is very powerful and gives you plenty of options
Here is one of iReport's tutorial for a background image http://ireport-tutorial.blogspot.com/2008/12/background-image-in-ireport.html

The big problem when printing form content that has been filled in electronically, is aligning it correctly on the pre-printed form. You may get content to align for one printer, but when you use another it is completely misaligned.
Fly Software have a form design product called InForm Designer that gets around the problem nicely by allowing users to specify and save vertical and horizontal offsets for printers. This ensures filled in form content is always aligned. I've tried it and it works perfectly. Take a look for yourself here...
http://www.flysoftware.com/products/inform_designer/overview.asp
It might be worth implementing a printer offset similar to InForm's in your own application (if possible).

Some things to think about.
First in terms of the web page, do you want use the stationery as the form layout?
Does it have to be exact?
Combed boxes (one for each character)
Do you want to show it like that on the web page, or deal with the combing later.
How are you going to deal with say a combed 6 digit number. Is this right aligned. What if they enter 7 digits. Same for text. what if it won't fit.
Font choices, we had a lot of fun with W...
How aligned do you want the character within the box, what font limitations does that imply, some of the auto magic software we looked at did crap like change the size of each character.
Combed editing is a nightmare, we display combed, but raise an edit surface the size of the full box on selection.
Another thing that might drive you barking mad, you find find small differences in the size and layout of the boxes, so they look okay from a distance but a column of boxes sort of shifts about by a pixel. Some of testing guys had to lend us their electron microscopes, so we could see how many ink molecules we were out by. :(
Expect to spend a lot of time in the UI side of things, and remember printed stationery changes, so giving yourself some sort of meta description of the form to start with will save you loads of trouble later on.

How to implement jtable with variable row-height

None of the answers to two previous questions (here and here) resolve my problem.
I have a multi-column jtable for which I want to display string-content of some columns over more than one line within the cell based on newline char's ("\n") within the string. The number of newlines per string is random, known only at run-time. Only the affected row must be adjusted across all columns to the new height. There may be a different number of lines per affected column, and the row-height needs to be adjusted to the maximum height of these, across the columns.
How do I do this?If possible some sample code would be very much appreciated.TIA

If I got you right, I think you need a MultilineCellRenderer . There are already plenty of examples around. Normally they are based on a JTextArea to get the line wrap functionality.
I haven't used it myself yet, but here is an example, which looks kinda good at first view:
MultilineCellRenderer

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.