I have MCA final year project to extract data from image (jpg, gif, etc.).
I want to recognize data from image.
I have used java ocr but it is not working.
Are there any open source libraries which can help me?
Have a look at zxing, http://code.google.com/p/zxing/downloads/list
Matlab has a trainable OCR that has been used to break capthcas. Unfortunately the group that broke the capthcas didn't release source code. However, here is example code of training Matlab's ocr.
The matlab code will easily compile into your java project.
Here is a java based OCR tool. The page claims that the tool can recognize triangles and other patterns from letters - they have given sample images too. The code is open source and downloadble.
Did you try Asprise?
tesseract is an open source OCR tool, but it's not in Java. See tesseract in action
Related
I'm trying to read a PDF file with images of a scanned document.
Aspire OCR is giving me this text as the output of the OCR.
<error: failed to read pdf. error code: ÿ>
There could be a few possible causes: 1) the PDF file doesn't exist; 2) the PDF file is malformed.
We recommend that you should get the latest version of Asprise OCR and Barcode Recognition Library for Java, C# VB.NET, Python, C/C++ and Delphi and refer to the Developer’s Guide to Asprise OCR SDK for Java, C# VB.NET, Python, C/C++ & Delphi Pascal. Alternatively, you can contact our support team directly.
Currently, I am using Tess4j to recognize the text in my image. It accurately reads 95% of the characters I throw at it, but I would like to get 100%. I know the font that the image is using and I was wondering if there was a way I can get Tess4j to learn the font I am looking for.
Thanks in advance!
Tess4J is just a wrapper around Tesseract OCR engine. You can train Tesseract and use the produced .traineddata files with Tess4J.
Are there any JAVA APIs or tools that can convert Handwritten Scanned Doc to txt files?
I have tried google tesseract and few other tools , but I am not getting satisfactory results for hand written scanned docs.
Strange that other answers here are pointing out to OCR tools while question clearly states handwriting recongition.
Handwriting is even more difficult area than OCR and number of technologies available is very narrow. I don't think you will be able to find any open source tool for that, while there are few commertial vendors:
http://www.a2ia.com
http://www.parascript.com/
I don't know if they have Java API, but it is better to start researching from contacting them.
You can try the Java OCR Project. I think that you might do the writing to a text file section yourself though.
Also, hand writing tends to vary from one individual to another, so I guess you will need to select some good training data to get good results.
Have a look at these :
Java OCR
Java OCR is a suite of pure java libraries for image processing and character recognition. Provides modular structure for easier deployment .
GOCR
GOCR is an OCR program, developed under the GNU Public License. It converts scanned images of text back to text files.
Well, I DID READ ALMOST ALL THE QUESTIONS HERE ABOUT THIS TOPIC!
I need an API - not at tool to convert in a very high quality from PDF to image.
So I didn't find any direct tool, and I used: HTML to PDF and PDF to image.
I tried:
PDFRenderer
FDFBOX
PDFONE
HTML2IMAGE
FLYING-SAUCE
ITEXT
JPEDAL
PDFCrown
Only the commerical ones (PDFCrown and PDFBox) came out with good results.
I thought that Java is for open source projects!
Am I missing any library that prints out in a high quality images from HTML (could be also from PDF, I can pay for half the way..)
I used the wonderful tool:
WKHTMLTOPDF.
It's very easy - just a command line.
Installation:
Download from here the version. To windows I used: wkhtmltox-0.11.0_rc1-installer.exe file.
Run the installer
Save your HTML to a file on your disk.
In java, Runtime.getRuntime().exec(wkhtmltopdf file.jpg myhomepage.jpg)
That's it! so easy to use:)
I am trying to implement OCR in my java project.
I need to read text from an image.
While browsing I came across lot of examples for using MODI in .NET project.
Can this be done using java?
If not possible can you suggest an alternate tool
There are various Java-.NET interoperability tools. But I'd suggest you to refer to the following resources:
Java OCR implementation
https://stackoverflow.com/questions/971344/java-based-ocr-sdk-api
There are a couple of Java wrappers for Tesseract OCR engine:
tesjeract
Tess4J
Or you can simply use ProcessBuilder/Process to invoke Tesseract executable and read the output text file as it does in VietOCR.