Currently, I am using Tess4j to recognize the text in my image. It accurately reads 95% of the characters I throw at it, but I would like to get 100%. I know the font that the image is using and I was wondering if there was a way I can get Tess4j to learn the font I am looking for.
Thanks in advance!
Tess4J is just a wrapper around Tesseract OCR engine. You can train Tesseract and use the produced .traineddata files with Tess4J.
Related
I know we can extract text from image using ocr. But I need to extract the text present in video, like those in video lectures. Or in other words is it possible to transcribe a video to text. Is that possible? If so please suggest me how to do it in java or any other language.
My naive linux driven approach would be:
check: does the OCR work in my operating system?
extract some samples from the video using the normal runner. Each runner (for example VLC) has such a functionality.
check: how good is the OCR in extracting text from image files?
check: how good is the OCR in extracting text from image files with the background the video is providing?
get software to extract frames from videos in batch -> there is various software which allows to create contact-sheets, this should also be able to extract images in full resolution at abitrary points in time out of the video. Full resolution might be necessary to allow the OCR to work. Perhaps you can clip the images first, if you know, that the text is positioned in fixed rectangles.
Worst case you let OCR analyse each frame of the movie.
That mostly depends on how good and how fast your OCR is working. Everything else to me is very proven software. The language might be bash-shell-script, since the components will probably be separate linux programs. As I mentioned, it depends on the quality, performance and runtime environment of your OCR.
Yes, You can do that and there are 3 ways you can achieve it.
Split, Classify and train on your own.
Get a performance server,
A. Extract images from the video
B. Develop and Train your machine learning model. You can use tensor flow to do the same.
Note: If you prefer to train models on your own, make sure you have enough time as sometime the developing and training requires few months and you should have data to train them.
Use an OCR framework
USE API(Freemium model). There are many available in the market. Just google them and your will have many in hand.
For android app I'm working right now i have to test that font file supplies all needed characters. Hoverwer I can have a lot ttf files, so i wanted to check every.
But heres my problem: How to read font chars in UTF-8 and iterate through them?
Any help would be appreciated.
You should try using the sfntly library
This has been used in Android Font Creator project
I was able to print glyph ids and names using the test program on a desktop app. Should work on Android as well.
Ok, the way it worked for me was that a friend of mine prepared svg file with font and we could generate data with javascript from that file.
Is there a way to convert pptx to an image? My aim is to finally generate an PDF file.
I try with Aspose but there are some issue. With Apache POI, it don't support PPTX. Any other idea?
There is a PPTX SVGExporter in DocX4j that may help you, in conjonction with Batik (from SVVG to PDF)
Note : I've not tested it
You may be able to write a small native library that Java calls to do the operation. I'm pretty sure you can use a .NET language or C++ to connect to powerpoint and have it "print" to either PDF or some other usable format. Here is a small sample application which does some automation.
I've being researching on how to extract images from a big (> 300MB) PDF file. I'm using pdfbox but for some particular reason that I can't figure out, some pages are not correctly extracted.
I'm using the PDFToImage class of pdfbox as base for my code.
So, do you know another library that may help me to do this? I know that iText may be used, but I read that it can't be used for commercial products.
I've installed the packages xpdf and xpdf-utils, and the utility called pdfimages is working perfect. But I need to solve this problem from Java and it should be portable.
I think you're talking about two different things here: extracting images from a PDF, and converting PDF pages to images. PDFToImage will output an image for every page, while pdfimages extracts all embedded images (e.g. a text document has 0 images).
Take a look at org.apache.pdfbox.tools.ExtractImages (source code) to see if it does what you want.
The most likely reason why it is hard working with 300 Mb PDF's is that you run out of memory. If it works well for smaller PDF's I would have a closer look at why it fails.
Have you tried icepdf or JPedal (both pure java)?
I have MCA final year project to extract data from image (jpg, gif, etc.).
I want to recognize data from image.
I have used java ocr but it is not working.
Are there any open source libraries which can help me?
Have a look at zxing, http://code.google.com/p/zxing/downloads/list
Matlab has a trainable OCR that has been used to break capthcas. Unfortunately the group that broke the capthcas didn't release source code. However, here is example code of training Matlab's ocr.
The matlab code will easily compile into your java project.
Here is a java based OCR tool. The page claims that the tool can recognize triangles and other patterns from letters - they have given sample images too. The code is open source and downloadble.
Did you try Asprise?
tesseract is an open source OCR tool, but it's not in Java. See tesseract in action