I'm just extracting text from image but when I try to process form program doesn't work for character extraction due to form boundaries. How to extract characters from a form which contains boundaries?
Recognize the lines in the form , collect their positions in an array and write the image without considering values in the array using ImageIO.write
Related
I am using the Tesseract Java API (tess4J) to convert TIFF files to readable PDFs.
When I have a single source TIFF file, the results have been quite pleasing:
TessResultRenderer renderer = TessAPI1.TessPDFRendererCreate("outpath/my_new_pdf.pdf", dataPath, 0);
TessAPI1.TessResultRendererInsert(renderer, TessAPI1.TessPDFRendererCreate("output/my_new_pdf.pdf", dataPath, 0));
int result = TessAPI1.TessBaseAPIProcessPages(handle, sourceTiffFile.getAbsolutePath(), null, 0, renderer);
However, the API documentation states that you should be able to supply a list of files, as well as just a single file: Recognizes all the pages in the named file, as a multi-page tiff or list of filenames, or single image...
This would be very handy as I would like to pass in several TIFFS to produce a multi-page PDF, one page per image, but I haven't yet been able to work out how to pass in a list of images. The obvious first attempt was to pass in a comma separated list of absolute file paths to the TIFFs, where the above example passes in sourceTiffFile.getAbsolutePath(), but the result is a very small, apparently corrupt PDF file.
Any suggestions would be most welcome.
Try a filelist with each entry on a separate line (i.e, delimited by \n character).
I have an image of form which contains different fields like name, number, address etc. I want to recognize data from these fields and save them to database. Now, my OCR is working fine but I don't know how to extract specific field data(name, address) from image to be used for OCR. simply I want to know how to recognize characters in output files are from name field or address field or any other field.
Since you know the exact areas of the form the different fields will be in, you can use some image manipulation library crop the image and send only specific regions to the OCR engine.
Check this SO question.
You have two solutions to get the data you want either you use #osiris's solution or you have to add a text mining layer.
First solution : you get the image and cut it into pieces (the pieces that contains the needed data).
For example, you cut the image into 2 pieces one that contains the name and the second one that contains the address by cropping the original image based on fields position (X & Y)and for that you have to use an image library to manipulate your original image .
The second solution is to use a text mining layer without doing the cropping.
In this solution you have to use models that detects the names and addresses (duckling.ai), you can train your own model or you can even use some chatbot engines and you train your chatbot engine to detect the names and addresses as entities (recast.ai or rasa for example).
We have a java application which generates PDF forms which have at the footer 2 dimensional bar codes. The bar codes contain the companies form id and the customer id.
Now we are looking for a way to test the generated PDF automatically without a physical scanner. Means we just like to read the PDF via Java API, render & capture somehow internally the bar codes and read their content. So this is the rough idea.
The question if this approach is achievable? Does anybody know some APIs in Java which may able to do such a thing? e.g.
reading a PDF and rendering its content as an image in the background?
reading a bar code image and extracts its conent?
processing an image and extract text content (OCR)?
How to convert an ascii print file ( text file with line feed and form feed ctrl characters) into a PDF Document with the pre printed stationery as template or background image. How can this be done in Java.
You can create a PDF using either one of two ways:
In Java code using iText
By creating an FO using something like Velocity (mapping your text data into a template) and running it through an FO transformer to create a PDF.
That gets you the PDF. You can print it by either opening it in Adobe Reader and printing from there OR by sending it to a printer using the Java print API.
I am doing a java program to read encrypted PDF files and extract the contents of the file page by page including the text, images and their positions(x,y coordinates) in the file. Now I'm using PDFBox for this purpose and I'm getting the text and images. But I couldn't get the text position and image position. Also there are some problems reading some encrypted PDF files.
Take a look at org.apache.pdfbox.examples.util.PrintTextLocations. I've used it quite a bit and it's very helpful to make analyses on the layout of elements and bounding boxes in PDF documents. It also revealed items printed in white ink, or outside the printable area (presumably document watermarks, or "forgotten" items pushed out of sight by the author).
Usage example:
java -cp app/target/pdfbox-app-1.5.0.jar org.apache.pdfbox.examples.util.PrintTextLocations ~/tmp/mydoc.pdf >~/tmp/out-text-locations.txt
You'll get something like that:
Processing page: 0
String[53.9,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=4.6679993]A
String[58.568,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=2.6640015]f
String[61.232002,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=1.6679993]e
...
Which you can easily parse and use to plot element's position, bounding-box, and the "flow" (trajectory through all the elements), etc. for each page. As I'm sure you are already aware of, you'll find that PDF can be almost impossible to convert to text. It is really just a graphic description format (i.e. for the printer or the screen), not a markup language. You could easily make a PDF that prints "Hello world", but that jumps randomly through the character positions (and that uses different glyphs than any ISO char encoding, if you so choose), making the PDF very hard to convert to text. There is no notion of "word" or "paragraph". A two-column document, for example, can be a nightmare to parse into text.
For the second part of your question, I had good results using xpdf version 3.02, after fixing Xref.cc (make XRef::okToPrint(),XRef::okToChange(),XRef::okToCopy() and XRef::okToAddNotes() all return gTrue). That's to handle locked documents, not encrypted ones (there are other utils out there for that).