Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Is it possible to convert PDF to Word in Java? I'm not talking about parsing a PDF document and then custom render it again to Word. I want a Java library that can directly convert it.
Reading PDF documents is a very involved process and there are no good free libraries for extracting non-text information from PDF documents in Java. Worse yet, PDF documents have a lot of layout information that is hard to reconstruct, for example a table in a Word document becomes some lines and a bunch of pieces of text in PDF.
It is almost impossible to recreate semantic information from an arbitrary PDF. If you have the same tool that wrote it you have somewhat more chance but even so there is much uncertainty. The only thing you can be sure of in a (text) PDF is the position of each character on the page. (Note that some PDFs include bitmaps in which textual information occurs and that has to rely on OCR).
There are several groups in computer science departments and elsewqhere who are spending very significant effort to try and get semantic information. We collaborate with Penn State - one of the leaders - and they are working on extracting tables. In good casees they get 90% in bad ones 50%.
So the answer is formally that you cannot, but you may occasionally be fortunate. (We do a lot of this for chemistry and count ourselves lucky if we get 50% on a regular basis).
You can try to do it with the iText library. Read the PDF and then write it as an RTF.
This is not that simple though, as you have to preserve the different style that the PDF has.
You can use some external tools.
Install some free program like "Free PDF to Doc" and execute it from you java program.
This Works fine in most cases.
use the Acrobat Pro SDK from you java code.
Best of luck
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I Need to convert below file format to pdf format.
TIF,TIFF,TXT,JPG,JPEG,BMP,DOC,DOCX,XLS,XLSX,PPT,PPTX,GIF,PDF
Do we have any open source API to convert into PDF. I tried APACHE POI. but its not look sufficient. Let me know any open source api is available.
Creating a PDF that contains nothing but an image is quite easy using the iText library; its web site has an example that shows how to do that.
Converting Excel files is not hard; the Apache POI library can be used for reading the Excel file, and then again the iText library can be used for creating PDFs that contain tables.
Word can be dealt with in a similar manner (POI also supports it), but it'll be quite a bit tricker, especially if the file contains tables and images, since the POI API for handling DOC/DOCX isn't as advanced as the one handling XLS/XLSX, and of course Word files have a less regular structure than Excel files.
JAI won't be of any help with this.
There are commercial packages available that can be used from Java applications; you may want to investigate those before embarking on writing your own, especially if you need to deal with complex documents - writing your own converter that handles those and generates good quality output could easily take a couple of weeks (or a month) of your time.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
Could you please anyone tell me the way to create a pdf document for J2EE Application other than iText.
We are previously used the iText, but the problem is the html file (which is generated from Jsp) display is different with the generated PDF. So I need some other way to create a pdf as same as jsp display.
Any one please suggest me the libraries other than iText?
Thanks in Advance.
You are probably using iText 5 and XML Worker. Have you tried iText 7 and pdfHTML? See the HTML to PDF tutorial.
You will need:
iText 7: https://github.com/itext/itext7
the pdfHTML add-on: https://github.com/itext/i7j-pdfhtml
You claim:
the problem is the html file (which is generated from Jsp) display is different with the generated PDF.
That is certainly true when you use HTMLWorker (which you shouldn't) and it's true in many cases for XML Worker. But we rewrote iText from scratch because of the mismatch between the old iText architecture and the requirements when converting HTML to PDF.
If you have a problem with the HTML to PDF conversion, please explain the problem in a question and tag that question as an iText question. If we can improve iText 7 + pdfHTML, why wouldn't we do that?
This is always tricky because HTML and PDF have differing purposes (with a lot of overlap). That means "simply" converting between the two is sometimes not going to work well.
You can
Snapshot an image of the HTML and PDF the image. This has various downsides (can't search / extract text easily, larger, poor zoom, pagination) but is simple if it doesn't conflict with your requirements.
Use a PDF system (like iText) to construct the PDF as desired using the same data. This is obviously more work (possibly a lot more), but is the optimal result in terms of PDF quality and fitness for purpose.
Simplify/adjust your HTML so it converts better into PDF. This depends on what HTML tools/libraries you are using - you might not have much control over the HTML.
Try various other conversion libraries to see if you find a tool that works better for your HTML.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I've been searching high and low for an up to date solution to this age old problem.
Long story short I want to take css + html -> pdf and do it in java.
I don't want to use an API as the data is sensitive. Googling provides me with countless sites/services that offer to do this but I'm looking for a stand alone tool and looking for one that will work nicely from my java server. I've found this awesome looking command line tool but it's a command line tool and spawning processes off a web server starts to get sketchy IMO (but I'm always willing to hear otherwise). Additionally flying saucer seems to be a standard choice, but I've heard mixed reviews.
Here is a 5 year old question on the subject, but I figure things have changed! Especially with all the work being done in the area of front end unit testing with dom manipulation I figure there might be some less than conventional solutions and I'm willing to hear them all!
Any help would be greatly appreciated.
You might try a combination CSSBox that converts HTML+CSS to SVG and then use for example Batik for creating your PDF as proposed for example here. FlyingSaucer could also do the job.
The choice depends on your further requirements. E.g. are you processing "street HTML" or well-formed documents? What about the pages in the resulting PDF? What about interactive elements in the HTML pages?
I mean the only way is to try at least some options practically and then you may ask more specific questions about some particular problems.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I have been improving a document management project and one requirement is to render documents(word, pdf, etc) in web page. Pdf can be rendered with iframe, object or embed tag and servlet. But the other documents like word, excel can not be rendered in the web page. My solution is to convert these documents to pdf or html on rendering and render them like this. I've tried to convert them with JODCONVERTER and it does convert but converting a word(docx) almost with 700 pages to pdf 25-30 sec, to html 30-35 sec. It is too much.In the course of events, waiting for too much is not good for users. Documents will be stored our server, not another place. Is there another thing for faster conversion or better solution?
Thank!
You could use jodconverter + LibreOffice 3.5.* or jodconverter + OpenOffice.org 3.4.1 (I have tried both recently and they are way faster than LibreOffice 3.6+/4.0+) in combination with a lazy/parallel conversion process to improve response times.
You cannot transform 700 pages of content in a snap. Even Google Docs puts you on a cloud transformation queue for your uploaded documents. So you can implement this kind of queue which will lazily transform your documents one by one, and you can show a proper message to the user while transform operation is pending. This queue must save the transformed file to filesystem of course, so you can display it anytime you wanted. You must consider the disk space problem here.
A blind solution is to just open the file in another browser tab with correct mimetype, given that the browser is ie and microsoft office is installed, hopefully it will open the file natively in the browser. However it is not a platform-independent solution.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Has anybody attempted to extract text from a PDF using an OCR library and Java? What did you find to be the most reliable library for text extraction. Most of the approaches I've seen (tesseract, GOCR) are C libraries that would require some JNI code to be written.
I'm familiar with pdfbox, which is now an Apache incubator project at version 0.8.x, but it's text extraction isn't always accurate. I'm looking for an alternative approach that is somewhat more reliable.
I've not tried Asprise JavaPDF yet, in the process of trying that, but wanted to know more about the OCR approach (if it's possible).
Any help would be appreciated.
If you have a text-based PDF, I'd strongly recommend PDFTextStream. It's not free, but licensing is reasonable, and it is much much better than PDFBox. PDFBox chokes on many PDF files which are generated by newer tools, and is not too consistent about PDFs it can handle. PDFTextStream handles any PDF I throw at it, including PDFs with embedded PNG images, which PDFBox can not do.
If you heckle the PDFTextStream folks to add OCR, they may listen up.
We use ABBYY FineReader Engine 11. They have java wrapper.
Pros:
It works great with all the languages (English, Russian, Uzbek etc) and doing real OCR (even if you have pdf without OCR they perform rendering at first and OCRing).
Cons:
It costs. You have to buy developer license and end-user license.
And it is EXTREMELY slow.
If you want to extract OCR from text based PDF you may have to convert it to an image first.
You can use Java wrappers of Tesseract - tesjeract or Tess4J - to perform OCR. However, for PDF, you'll need to convert to image (PNG or TIFF) first before feeding it to the OCR engine.
VietOCR calls Tesseract executable to perform the text extraction. It uses GhostScript to do PDF-to-image conversion.