opencv find block of text areas / detect document layout

opencv find block of text areas / detect document layout - java

I have color image document with text and images and tables.
Document can have two columns.
Document is composite from areas: area header and text (bigger font, can have different font color and something like sub-header additional data).
This is exemplary image but real one can be color:
What i need to do.
I need find on image document this areas of text with headers.
What i need to know.
Method how to divide document to divide document on particular parts.
I try with opencv in java(if someone have python and c++ version i can convert it for java version by myself). I found few similar problem on stack overflow, but none of them can help me. You must know that my opencv knowledge is not very well and it is only from on-line tutorials and stack overflow.
Is there any fine solution on my problem in opencv way or i need use something else, different library or application to achieve this?
One and only requirement is that it must be done from command line.
If i had this areas i can do what i need next, but this is step which stops me.

have you solved the problem?
I'm working on a similar problem.
My solution is to use HoughLines https://docs.opencv.org/3.4.0/d9/db0/tutorial_hough_lines.html

You can use text detection combined with dilation to detect bold text i.e. headers and then group the text boxes between two consecutive headers as the text under first header.

Related

Extract "Graphic" Line positions with PDFBox

I am currently working on a little java application to transform some PDF bound data and I am using PDFBox for this. The pdf itself is very simple and just contains some headers and a table which seperates rows with a line. I am trying to find the coordinates of this line so that I can dynamically extract by area as some rows vary with their height. I have not really found any information on this during my search as almost all results deal with "text lines" and not actual lines. Is this even possible with PDFBox or will I have to look for another pdf library.
Any information would be greatly appreciated.

Why does JavaFX add extra spacing between letters when using the Text component and how do I fix it?

I'm trying to use the text component of JavaFX to do some nice headline typography in my application. How ever the letters in the text are not spaced evenly. For example in the word "visiting", the "iting" part seems disconnected from the first part.
In the sample image I'm using Arial but this kind of bad spacing happens with every font I tried.
This only happens when "gray" anti-aliasing is used (-fx-font-smoothing-type: gray;). One obvious solution would be to change -fx-font-smoothing-type to lcd, but that would result in the text having jagged edges.
The only thing remotely mentioning something like this is the jira issue RT-14187, but that seem to have been resolved in javafx 8 (jre 8).

Read PDF text and/or all content

I have a scenario where I need a Java app to be able to extract content from a PDF file in one of 2 modes: TEXT_ONLY or ALL. In text mode, only visible text ("visible" as if a human being was reading the PDF) is read out into strings. In all mode, all content (text, images, etc.) is read out of the file.
For instance, if a PDF file was to have 1 page in it, and that page had 3 paragraphs of contiguous text, and was word-wrapping 2 images, then TEXT_ONLY would extract all 3 paragraphs, and ALL would extract all 3 paragraphs and both images:
while(page.hasMoreText())
textList.add(page.nextTextChunk());
if(allMode)
while(page.hasMoreImages())
imageList.add(page.nextImage());
I know Apache Tika uses PDFBox under the hood, but am worried that this kind of functionality is shaded/prohibited by Tika (in which case, I probably need to do this directly from PDFBox).
So I ask: is this possible, and if so, which library is more appropriate for me to use? Am I going about this entirely the wrong way? Any pitfalls/caveats I am not considering here?

To expound some aspects of why #markStephens points you towards some resources giving some background on PDF.
In text mode, only visible text ("visible" as if a human being was reading the PDF) is read out into strings.
Your definition "visible" as if a human being was reading the PDF is not yet very well-defined:
Is text 1 pt in size visible? When zooming in, a human can read it; in standard magnification not, though. Which size would be the limit?
Is text in RGB (128, 129, 128) in a background of (128, 128, 128) visible? How different have the colors to be?
Is text displayed in some white noise pattern on a background of some other white noise pattern visible? How different have patterns to be?
Is text only partially on-screen visible? If yes, is one visible pixel enough? And what about some character 'I' in a giant size where the visible page area fits into the dot on the letter?
What about text covered by some annotation which can easily be moved, probably even by some automatically executed JavaScript code in the file?
What about text in some optional content group only visible when printing?
*...
I would expect most available PDF text parsing libraries to ignore all these circumstances and extract the text, at most respecting a crop box. In case of images with added, invisible OCR'ed text the extraction of that text in general is desired.
For instance, if a PDF file was to have 1 page in it, and that page had 3 paragraphs of contiguous text, and was word-wrapping 2 images, then TEXT_ONLY would extract all 3 paragraphs, and ALL would extract all 3 paragraphs and both images:
PDF (in general) does not know about paragraphs, just some groups of glyphs positioned somewhere on the page. Recognizing paragraphs is a task which cannot be guaranteed to work properly as there are heuristics at work. If, furthermore, you have multicolumn text with an irregular separation, maybe even some image in between (making it hard to decide whether there are two columns divided by the image or whether there is one column with an integrated image), you can count on recognition of the text flow let alone text elements like paragraphs, sections, etc. to fail miserably.
If your PDFs are either properly tagged or all generated by a tool chain for which patterns in the created PDF content streams betray text structures, you may be more lucky. In case of the latter, though, your solution would have to be custom-made for that tool chain.
but am worried that this kind of functionality is shaded/prohibited by Tika (in which case, I probably need to do this directly from pdfBox).
There you point towards another point of interest: PDFs can be marked that text extraction is forbidden while they otherwise can be displayed by anyone. While technically PDFs marked like that can be handled just like documents without that mark with just one decoding step (essentially they are encrypted with a publicly known password), doing so is clearly acting against the declared intention of the author and violating his copyright.
So I ask: is this possible, and if so, which library is more appropriate for me to use? Am I going about this entirely the wrong way? Any pitfalls/caveats I am not considering here?
As long as you expect 100% accuracy for generic input, you should reconsider your architecture.
If the PDFs are all you have and a solution as effective is possible is OK, on the other hand, there are multiple possible libraries for you, iText, and PDFBox to name but two while there are more. Which is best for you depends on more factors, e.g. on whether you need some generic solution or all PDFs are created by a tool chain as above.
In any case you'll have to do some programming yourself, though, to fine-tune them for your use case.

Detect background color of a website

I am trying to detect color of different elements in a webpage(saved on machine). Currently I am trying to write a code in python. The initial approach which I followed is:
find color word in html file in different tags using regular expressions.
try to read the hex value.
But this approach is very stupid. I am new to website design, can you please help me with this.

There can be multiple stylesheets, and many cascading styles. You don't know which elements visually end up being the "background" elements. I think if you're looking for something robust that will work on most webpages, you need to leverage a browsers rendering engine and focus on identifying what a user would see.
Consider using a web browser to render the page, taking a screen shot, and then doing image processing to find the most frequent color near the sides of the page. You can use a scriptable browser like phantomjs.
If you're new to programming, this approach is going to be wayyyyy over your head.

In java you can use JSOUP. Its quite good
Document doc = Jsoup.connect("http://YourPage.html").get();
Elements colors = doc.select("[bgcolor]");

I don't know anything about Java or Python, but could you have it parse the html code and look for something like 'background-color: < color >'?

Printing data to a pre printed form/stationery

We have a requirement where we already have pre printed stationery and want user to put data in a HTML form and be able to print data on that form. Alignment/text size etc are very important since the pre-printed stationery already has boxes for each character. What could be a good way to achieve this in java? I have thinking of using jasper reports. Any other options? May be overlay image with text or something?
Also we might need to capability to print on plain paper in which case the boxes needs to be printed by our application and the form should match after the printed with the already printed blank stationery containing data.
Do we have some open source framework to do such stuff?

Jaspersoft reports -- http://sourceforge.net/projects/jasperreports/
You will then create XML templates, then you will be able to produce a report in PDF, HTML, CSV, XLS, TXT, RTF, and more. It has all the necessary options to customize the report. Used it before and recommend it.
You will create the templates with iReport then write the code for the engine to pass the data in different possible ways.
check http://www.jaspersoft.com/jasperreports
Edit:
You can have background images and overlay the boxes over it and set a limit on the max character size ... and many more
It is very powerful and gives you plenty of options
Here is one of iReport's tutorial for a background image http://ireport-tutorial.blogspot.com/2008/12/background-image-in-ireport.html

The big problem when printing form content that has been filled in electronically, is aligning it correctly on the pre-printed form. You may get content to align for one printer, but when you use another it is completely misaligned.
Fly Software have a form design product called InForm Designer that gets around the problem nicely by allowing users to specify and save vertical and horizontal offsets for printers. This ensures filled in form content is always aligned. I've tried it and it works perfectly. Take a look for yourself here...
http://www.flysoftware.com/products/inform_designer/overview.asp
It might be worth implementing a printer offset similar to InForm's in your own application (if possible).

Some things to think about.
First in terms of the web page, do you want use the stationery as the form layout?
Does it have to be exact?
Combed boxes (one for each character)
Do you want to show it like that on the web page, or deal with the combing later.
How are you going to deal with say a combed 6 digit number. Is this right aligned. What if they enter 7 digits. Same for text. what if it won't fit.
Font choices, we had a lot of fun with W...
How aligned do you want the character within the box, what font limitations does that imply, some of the auto magic software we looked at did crap like change the size of each character.
Combed editing is a nightmare, we display combed, but raise an edit surface the size of the full box on selection.
Another thing that might drive you barking mad, you find find small differences in the size and layout of the boxes, so they look okay from a distance but a column of boxes sort of shifts about by a pixel. Some of testing guys had to lend us their electron microscopes, so we could see how many ink molecules we were out by. :(
Expect to spend a lot of time in the UI side of things, and remember printed stationery changes, so giving yourself some sort of meta description of the form to start with will save you loads of trouble later on.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.