PDF Text Extraction Approach Using OCR [closed]

PDF Text Extraction Approach Using OCR [closed] - java

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Has anybody attempted to extract text from a PDF using an OCR library and Java? What did you find to be the most reliable library for text extraction. Most of the approaches I've seen (tesseract, GOCR) are C libraries that would require some JNI code to be written.
I'm familiar with pdfbox, which is now an Apache incubator project at version 0.8.x, but it's text extraction isn't always accurate. I'm looking for an alternative approach that is somewhat more reliable.
I've not tried Asprise JavaPDF yet, in the process of trying that, but wanted to know more about the OCR approach (if it's possible).
Any help would be appreciated.

If you have a text-based PDF, I'd strongly recommend PDFTextStream. It's not free, but licensing is reasonable, and it is much much better than PDFBox. PDFBox chokes on many PDF files which are generated by newer tools, and is not too consistent about PDFs it can handle. PDFTextStream handles any PDF I throw at it, including PDFs with embedded PNG images, which PDFBox can not do.
If you heckle the PDFTextStream folks to add OCR, they may listen up.

We use ABBYY FineReader Engine 11. They have java wrapper.
Pros:
It works great with all the languages (English, Russian, Uzbek etc) and doing real OCR (even if you have pdf without OCR they perform rendering at first and OCRing).
Cons:
It costs. You have to buy developer license and end-user license.
And it is EXTREMELY slow.

If you want to extract OCR from text based PDF you may have to convert it to an image first.

You can use Java wrappers of Tesseract - tesjeract or Tess4J - to perform OCR. However, for PDF, you'll need to convert to image (PNG or TIFF) first before feeding it to the OCR engine.
VietOCR calls Tesseract executable to perform the text extraction. It uses GhostScript to do PDF-to-image conversion.

Related

Any open source api to covert to pdf file in JAVA [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I Need to convert below file format to pdf format.
TIF,TIFF,TXT,JPG,JPEG,BMP,DOC,DOCX,XLS,XLSX,PPT,PPTX,GIF,PDF
Do we have any open source API to convert into PDF. I tried APACHE POI. but its not look sufficient. Let me know any open source api is available.

Creating a PDF that contains nothing but an image is quite easy using the iText library; its web site has an example that shows how to do that.
Converting Excel files is not hard; the Apache POI library can be used for reading the Excel file, and then again the iText library can be used for creating PDFs that contain tables.
Word can be dealt with in a similar manner (POI also supports it), but it'll be quite a bit tricker, especially if the file contains tables and images, since the POI API for handling DOC/DOCX isn't as advanced as the one handling XLS/XLSX, and of course Word files have a less regular structure than Excel files.
JAI won't be of any help with this.
There are commercial packages available that can be used from Java applications; you may want to investigate those before embarking on writing your own, especially if you need to deal with complex documents - writing your own converter that handles those and generates good quality output could easily take a couple of weeks (or a month) of your time.

Java convert html with css+js to pdf [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I am developing a JSF project and I have a doubt about reporting.
The idea is to offer the users reports in both HTML and PDF formats.
This should work developing the reports in HTML+CSS+JS and whenever a user needs a PDF report just convert the HMTL+CSS+JS to PDF.
Does somebody know a free Java library for converting the HTML to PDF?
This should be blind to the user.
Other proposals are accepted.
Thanks in advance.

better to use wkhtmltopdf tool to convert your HTML to pdf

Free solution: wkhtmltopdf - uses WebKit under the hood.
Commercial solution: PrinceXML - uses it's own ACID 2 compliant HTML rendering engine.

Apache FOP would be one solution which is an XSLT based solution although it does not support HTML5. Flying Saucer, wkhtmltopdf are some free solutions which are worthy a try. Commercial libraries like PriceXML offer support to CSS3. Pdfcrowd is yet another commercial solution.

Would Jasper Reports be an option? From the same report file you can generate many formats, PDF and HTML (+CSS) are two of them. Plus there is a GUI report designer.

Try out Flying Saucer
It internally uses iText and pretty good library

Convert PDF to Word in Java [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Is it possible to convert PDF to Word in Java? I'm not talking about parsing a PDF document and then custom render it again to Word. I want a Java library that can directly convert it.

Reading PDF documents is a very involved process and there are no good free libraries for extracting non-text information from PDF documents in Java. Worse yet, PDF documents have a lot of layout information that is hard to reconstruct, for example a table in a Word document becomes some lines and a bunch of pieces of text in PDF.

It is almost impossible to recreate semantic information from an arbitrary PDF. If you have the same tool that wrote it you have somewhat more chance but even so there is much uncertainty. The only thing you can be sure of in a (text) PDF is the position of each character on the page. (Note that some PDFs include bitmaps in which textual information occurs and that has to rely on OCR).
There are several groups in computer science departments and elsewqhere who are spending very significant effort to try and get semantic information. We collaborate with Penn State - one of the leaders - and they are working on extracting tables. In good casees they get 90% in bad ones 50%.
So the answer is formally that you cannot, but you may occasionally be fortunate. (We do a lot of this for chemistry and count ourselves lucky if we get 50% on a regular basis).

You can try to do it with the iText library. Read the PDF and then write it as an RTF.
This is not that simple though, as you have to preserve the different style that the PDF has.
You can use some external tools.
Install some free program like "Free PDF to Doc" and execute it from you java program.
This Works fine in most cases.
use the Acrobat Pro SDK from you java code.
Best of luck

open source image processing lib in java [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Can anyone suggest a good open-source image processing library in Java?
I want to develop an OMR reader using it.

There are a number of options out there, each with their own features and drawbacks. If you want to discuss your needs in more detail, I can touch on the specific attributes of each library as it relates to your project:
ImageJ - http://rsbweb.nih.gov/ij/index.html -- Note that ImageJ is primarily a self-contained application. However, the underlying API is very easy to use in your own applications without having to invoke the GUI.
Fiji - http://pacific.mpi-cbg.de/wiki/index.php/Main_Page -- This is ImageJ with a number of additional features. I have no personal experience with this library, but it looks promising.
JAI - http://www.oracle.com/technetwork/articles/javaee/jai-142803.html -- This is Sun's image processing Java offering. Limited in functionality, but it can be used as a basis for more powerful libraries.
jMagick - http://www.jmagick.org/index.html -- This is just a Java wrapper around ImageMagick and uses JNI to interface with the ImageMagick API
Apache Sanselan - http://commons.apache.org/imaging/ -- This library mostly does image IO, but it has a handful of features that can facilitate image analysis.
JIU (Java Imaging Utilities) - http://sourceforge.net/projects/jiu/ -- A Java library for loading, editing, analyzing and saving pixel image files.
Endrov - http://www.endrov.net/wiki/index.php?title=Main_Page -- Endrov is a multi-purpose image analysis program. I get the impression that the underlying API is usable outside of the application, but it also seems that not everything is implemented in Java. I have no personal experience with this library and am only throwing it in because it seems to have a number of useful features.

JAI

Marvin Image Processing Framework
http://marvinproject.sourceforge.net

and the dead-simple one: imgscalr

I would suggest using JAI, as mentioned, for the imaging side, but for writing an OMR application you will need template registration. This can be achieved using OpenCv. This works with Java (as well as many other languages and platforms).
Without good image registration, regardless of image processing library, you will end up missing some of the marks on some scans, as you will find that some scans are shifted due to the way scanners work.

Looking for a "Universal" Document viewer component/library [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I am looking for an Applet with similar functionality to the Oracle/Stellent OutsideIn ActiveX control or the Autonomy KeyView technology that act as a browser plug-in allowing the rendering/display of a large number of file formats (Word processing, spreadhseet, graphics, etc.) I currently use the Stellent solution, but due to some restrictions of some of our clients would prefer something that either exists as a Java Applet, Silverlight control, or has a Java API that I could build an applet on top of (neither of the two I mentioned do).
At a bare minimum it would need to display at least the following formats:
MS Word, Excel, PowerPoint
MS Outlook MSG files
Adobe PDF
Standard image formats: BMP, PNG, JPEG, TIFF
WordPerfect
HTML
Any suggestions?

If a commercial product is an option, ViewOne is a nice product. It's an Applet and you can view a large variety of document.

It's not a plugin, but multivalent is a java library and browser for a large number of document formats, but probably not all the ones you'd like to cover.
It does at least cover the PDF, HTML, and any reasonable image format, but not any of the proprietary formats.

If you are looking for pure Java component that supports all these formats, I'm pretty confident that it doesn't exist. If what you want is to embed Browser, MS Office, Acrobat etc. you would need an ActiveX container.
Here are some choices:
JDIC - if you are using Swing (see the Document Viewer demo.)
SWT ActiveX container - if you are using SWT
TeamDev WinPack - if your time is more valuable than your money ;-) The product is very polished, the price is reasonable and the support is excellent.
Note that with any of these you need to have installed Acrobat, MS Office (or the free doc viewers) and whatever else applications you need to edit the file formats.

Have you looked at Adeptol AJAX Document Viewer.
A no plugin non applet no install viewer which supports more than 300 file typess.
See ajaxdocumentviewer.com

You may be interested in Net-it Central. It uses an Active-x plugin or java applet and works with several different formats. I am using it for Word and Excel currently.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.