Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I Need to convert below file format to pdf format.
TIF,TIFF,TXT,JPG,JPEG,BMP,DOC,DOCX,XLS,XLSX,PPT,PPTX,GIF,PDF
Do we have any open source API to convert into PDF. I tried APACHE POI. but its not look sufficient. Let me know any open source api is available.
Creating a PDF that contains nothing but an image is quite easy using the iText library; its web site has an example that shows how to do that.
Converting Excel files is not hard; the Apache POI library can be used for reading the Excel file, and then again the iText library can be used for creating PDFs that contain tables.
Word can be dealt with in a similar manner (POI also supports it), but it'll be quite a bit tricker, especially if the file contains tables and images, since the POI API for handling DOC/DOCX isn't as advanced as the one handling XLS/XLSX, and of course Word files have a less regular structure than Excel files.
JAI won't be of any help with this.
There are commercial packages available that can be used from Java applications; you may want to investigate those before embarking on writing your own, especially if you need to deal with complex documents - writing your own converter that handles those and generates good quality output could easily take a couple of weeks (or a month) of your time.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
In one of my NLP assignments I have to read PDF files and extract information out of them. Using Java I am able to read the textual content from PDF and able to apply our NLP algorithms on the text, but I also need to extract information present in Tables in PDF, I am trying to read them but not able to get them in proper format. Any idea how I can read tables from PDF document , or any hint if any library is available in OpenNLP, GATE, Stanford NLP for achieving these.
Unfortunately, tables as structures are not stored in PDFs. You have to apply some serious coordinate math to figure out/estimate where a table is, where the columns are and where the rows are.
For PDFs, Apache Tika doesn't have any special table handling (it does for MSWord, MSPPT and many other formats, but not PDFs).
To extract tables as tables from PDFs, you might consider tabulapdf; see also John Hewson's recommendation. There are also commercial tools that likely do a decent job with table extraction from PDFs -- Abby Finereader, Nuance *PDF products.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Is it possible to convert PDF to Word in Java? I'm not talking about parsing a PDF document and then custom render it again to Word. I want a Java library that can directly convert it.
Reading PDF documents is a very involved process and there are no good free libraries for extracting non-text information from PDF documents in Java. Worse yet, PDF documents have a lot of layout information that is hard to reconstruct, for example a table in a Word document becomes some lines and a bunch of pieces of text in PDF.
It is almost impossible to recreate semantic information from an arbitrary PDF. If you have the same tool that wrote it you have somewhat more chance but even so there is much uncertainty. The only thing you can be sure of in a (text) PDF is the position of each character on the page. (Note that some PDFs include bitmaps in which textual information occurs and that has to rely on OCR).
There are several groups in computer science departments and elsewqhere who are spending very significant effort to try and get semantic information. We collaborate with Penn State - one of the leaders - and they are working on extracting tables. In good casees they get 90% in bad ones 50%.
So the answer is formally that you cannot, but you may occasionally be fortunate. (We do a lot of this for chemistry and count ourselves lucky if we get 50% on a regular basis).
You can try to do it with the iText library. Read the PDF and then write it as an RTF.
This is not that simple though, as you have to preserve the different style that the PDF has.
You can use some external tools.
Install some free program like "Free PDF to Doc" and execute it from you java program.
This Works fine in most cases.
use the Acrobat Pro SDK from you java code.
Best of luck
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
How to convert to PDF from my JSP/HTML file?.
I want to convert a particular part of my webpage to a PDF file. Is it possible?
Yes. Take a good look at booth Apache FOP and iText. No matter what you use, you'll probably have to do a little fiddling.
I used HTMLDoc a couple of years ago and had pretty good luck with it.
try wkhtmltopdf. It is a command line utility that can be provided an html file or web address and a save location for the pdf. Very easy to use and utilizes the same rendering engine as safari. Works MUCH better than many of the other parsers that I have used (that don't always support CSS and other advanced layout features.
Take a look at html2ps (Perl) or html2ps (PHP). However, none of the two is implemented in Java.
You might also want to read this article.
flying saucer library is the best one to use. It works on top of itext and makes the task of conversion very easy.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I would like to read the text and binary attachments in a saved Outlook message (.msg file) from a Java application, without resorting to native code (JNI, Java Native Interface).
Apache POI-HSMF seems to be in the right direction, but it's in very early stages of development...
msgparser is a small open source Java library that parses Outlook .msg files and provides their content using Java objects. msgparser uses the Apache POI - POIFS library to parse the message files which use the OLE 2 Compound Document format.
You could use Apache POIFS, which
seems to be a little more mature,
but that would appear to duplicate the efforts of POI-HSMF.
You could use POI-HSMF and contribute changes to get the
features you need working. That's
often how FOSS projects like that expand.
You
could use com4j, j-Interop, or some
other COM-level interop feature and
interact directly with the COM
interfaces that provide access to
the structured document. That would
be much easier than trying to hit it
directly through JNI.
Have you tried to use Jython with the Python win32 extensions (http://www.jython.org/Project/ + http://python.net/crew/mhammond/win32/)?
If this is for a "personal" or "internal" project Jython with Python may be a very good choice. If you are building a "shrink wrapped" software package this may not be the best option.
Apache POI-HSMF.
You can start from the example given in below link.
http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/examples/src/org/apache/poi/hsmf/examples/Msg2txt.java?revision=821500&view=markup&pathrev=821500
Further read library docs.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Has anybody attempted to extract text from a PDF using an OCR library and Java? What did you find to be the most reliable library for text extraction. Most of the approaches I've seen (tesseract, GOCR) are C libraries that would require some JNI code to be written.
I'm familiar with pdfbox, which is now an Apache incubator project at version 0.8.x, but it's text extraction isn't always accurate. I'm looking for an alternative approach that is somewhat more reliable.
I've not tried Asprise JavaPDF yet, in the process of trying that, but wanted to know more about the OCR approach (if it's possible).
Any help would be appreciated.
If you have a text-based PDF, I'd strongly recommend PDFTextStream. It's not free, but licensing is reasonable, and it is much much better than PDFBox. PDFBox chokes on many PDF files which are generated by newer tools, and is not too consistent about PDFs it can handle. PDFTextStream handles any PDF I throw at it, including PDFs with embedded PNG images, which PDFBox can not do.
If you heckle the PDFTextStream folks to add OCR, they may listen up.
We use ABBYY FineReader Engine 11. They have java wrapper.
Pros:
It works great with all the languages (English, Russian, Uzbek etc) and doing real OCR (even if you have pdf without OCR they perform rendering at first and OCRing).
Cons:
It costs. You have to buy developer license and end-user license.
And it is EXTREMELY slow.
If you want to extract OCR from text based PDF you may have to convert it to an image first.
You can use Java wrappers of Tesseract - tesjeract or Tess4J - to perform OCR. However, for PDF, you'll need to convert to image (PNG or TIFF) first before feeding it to the OCR engine.
VietOCR calls Tesseract executable to perform the text extraction. It uses GhostScript to do PDF-to-image conversion.