Are there any JAVA APIs or tools that can convert Handwritten Scanned Doc to txt files?
I have tried google tesseract and few other tools , but I am not getting satisfactory results for hand written scanned docs.
Strange that other answers here are pointing out to OCR tools while question clearly states handwriting recongition.
Handwriting is even more difficult area than OCR and number of technologies available is very narrow. I don't think you will be able to find any open source tool for that, while there are few commertial vendors:
http://www.a2ia.com
http://www.parascript.com/
I don't know if they have Java API, but it is better to start researching from contacting them.
You can try the Java OCR Project. I think that you might do the writing to a text file section yourself though.
Also, hand writing tends to vary from one individual to another, so I guess you will need to select some good training data to get good results.
Have a look at these :
Java OCR
Java OCR is a suite of pure java libraries for image processing and character recognition. Provides modular structure for easier deployment .
GOCR
GOCR is an OCR program, developed under the GNU Public License. It converts scanned images of text back to text files.
Related
Background:
I am not able to install a 3rd party library as iText or anything similiar. I have to write a PDF package myself.
I am looking for a resource covering and explaining how to embedd fonts and texts in the PDF file format. So far, I have a pretty good coverage about adding rectangles, ISO-8859-1 texts and n-hedrons using Path elements.
Now, the next step is supporting UTF-8 charset (or overall just different charsets) with different fonts. Reading the ISO-32000:2008 I cannot understand how to that (the overall document is very techy and I am still a junior developer).
I found PDFBox, but I am having a hard time understanding the overcomplicated principles and decisions made.
If anyone has a reference to simple code examples or cookboks how to handle texts properly in the PDF file format, I appreciate if you link it.
If the language matters:
I am using Java. But I am more looking for a generous text/ article covering the topic with examples in any language.
I know we can extract text from image using ocr. But I need to extract the text present in video, like those in video lectures. Or in other words is it possible to transcribe a video to text. Is that possible? If so please suggest me how to do it in java or any other language.
My naive linux driven approach would be:
check: does the OCR work in my operating system?
extract some samples from the video using the normal runner. Each runner (for example VLC) has such a functionality.
check: how good is the OCR in extracting text from image files?
check: how good is the OCR in extracting text from image files with the background the video is providing?
get software to extract frames from videos in batch -> there is various software which allows to create contact-sheets, this should also be able to extract images in full resolution at abitrary points in time out of the video. Full resolution might be necessary to allow the OCR to work. Perhaps you can clip the images first, if you know, that the text is positioned in fixed rectangles.
Worst case you let OCR analyse each frame of the movie.
That mostly depends on how good and how fast your OCR is working. Everything else to me is very proven software. The language might be bash-shell-script, since the components will probably be separate linux programs. As I mentioned, it depends on the quality, performance and runtime environment of your OCR.
Yes, You can do that and there are 3 ways you can achieve it.
Split, Classify and train on your own.
Get a performance server,
A. Extract images from the video
B. Develop and Train your machine learning model. You can use tensor flow to do the same.
Note: If you prefer to train models on your own, make sure you have enough time as sometime the developing and training requires few months and you should have data to train them.
Use an OCR framework
USE API(Freemium model). There are many available in the market. Just google them and your will have many in hand.
I want to convert Notes Richtext into PDF in a server program (preferably Java). Is there any sample code how to do that. Converting to HTML/MIME isn't an option since the conversion process is too lossy.
I did some tests with DXL, some XSLT code and XSL:FO, via FOP. It produced some PDF output. Project abandoned due to lack of funding (read: no customer).
The basics, in a recent document: http://www.ibm.com/developerworks/xml/library/x-xslfo/
You'll need to find some third-party software to help, as this isn't possible out of the box. Here's one that looks promising: SWING Software's Lotus Notes Export to PDF
By "not depend on automation", I mean that it should not require a Microsoft Office installation to work; let alone interact with a live instance of a Microsoft Office component. One such library is Aspose.Total for Java. Are there any more out there?
Another solution I'm considering is to use OpenOffice.org. However, I'm not sure if I'm going to run into the same problems as with Microsoft Office as detailed here.
For Office Documents: http://poi.apache.org/
I have not tried this myself, but Apache usually deliver good libraries
For just Excel: JExcel API for Java
I use this for one application, and it works quite well. May use a fair bit of RAM for larger documents.
One designed specifically to with with the newer XML formats is docx4j: http://dev.plutext.org/trac/docx4j
There are two further answers for this question. Depending on your application.
can borrow from the OpenOffice library code that deals with opening and saving MS Office files. (See: http://www.artofsolving.com/opensource/jodconverter or jOpenDocument )
You might just use OpenOffice itself by scripting or automating that.
I faced this question a while back with a Ruby app and because I was in control of the source document, I got the originator to save things as HTML format and used Tidy to filter the junk. Another option it to find a tool to convert the Office files to RTF which is more generic.
Another to consider ...
LibreOffice looks useful.
jExcelAPI if you just want excel.
Finally there are some opportunities on sourceForge, try this search: http://sourceforge.net/search/?q=java+ms+office
You may find spreadsheets BIG unless you use OpenOffice or MS Office because you need to have a fancy shamancy virtual sparse matrix to do what they do well.
ODF Toolkit - http://odftoolkit.org
I have MCA final year project to extract data from image (jpg, gif, etc.).
I want to recognize data from image.
I have used java ocr but it is not working.
Are there any open source libraries which can help me?
Have a look at zxing, http://code.google.com/p/zxing/downloads/list
Matlab has a trainable OCR that has been used to break capthcas. Unfortunately the group that broke the capthcas didn't release source code. However, here is example code of training Matlab's ocr.
The matlab code will easily compile into your java project.
Here is a java based OCR tool. The page claims that the tool can recognize triangles and other patterns from letters - they have given sample images too. The code is open source and downloadble.
Did you try Asprise?
tesseract is an open source OCR tool, but it's not in Java. See tesseract in action