Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I have been improving a document management project and one requirement is to render documents(word, pdf, etc) in web page. Pdf can be rendered with iframe, object or embed tag and servlet. But the other documents like word, excel can not be rendered in the web page. My solution is to convert these documents to pdf or html on rendering and render them like this. I've tried to convert them with JODCONVERTER and it does convert but converting a word(docx) almost with 700 pages to pdf 25-30 sec, to html 30-35 sec. It is too much.In the course of events, waiting for too much is not good for users. Documents will be stored our server, not another place. Is there another thing for faster conversion or better solution?
Thank!
You could use jodconverter + LibreOffice 3.5.* or jodconverter + OpenOffice.org 3.4.1 (I have tried both recently and they are way faster than LibreOffice 3.6+/4.0+) in combination with a lazy/parallel conversion process to improve response times.
You cannot transform 700 pages of content in a snap. Even Google Docs puts you on a cloud transformation queue for your uploaded documents. So you can implement this kind of queue which will lazily transform your documents one by one, and you can show a proper message to the user while transform operation is pending. This queue must save the transformed file to filesystem of course, so you can display it anytime you wanted. You must consider the disk space problem here.
A blind solution is to just open the file in another browser tab with correct mimetype, given that the browser is ie and microsoft office is installed, hopefully it will open the file natively in the browser. However it is not a platform-independent solution.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I Need to convert below file format to pdf format.
TIF,TIFF,TXT,JPG,JPEG,BMP,DOC,DOCX,XLS,XLSX,PPT,PPTX,GIF,PDF
Do we have any open source API to convert into PDF. I tried APACHE POI. but its not look sufficient. Let me know any open source api is available.
Creating a PDF that contains nothing but an image is quite easy using the iText library; its web site has an example that shows how to do that.
Converting Excel files is not hard; the Apache POI library can be used for reading the Excel file, and then again the iText library can be used for creating PDFs that contain tables.
Word can be dealt with in a similar manner (POI also supports it), but it'll be quite a bit tricker, especially if the file contains tables and images, since the POI API for handling DOC/DOCX isn't as advanced as the one handling XLS/XLSX, and of course Word files have a less regular structure than Excel files.
JAI won't be of any help with this.
There are commercial packages available that can be used from Java applications; you may want to investigate those before embarking on writing your own, especially if you need to deal with complex documents - writing your own converter that handles those and generates good quality output could easily take a couple of weeks (or a month) of your time.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
Could you please anyone tell me the way to create a pdf document for J2EE Application other than iText.
We are previously used the iText, but the problem is the html file (which is generated from Jsp) display is different with the generated PDF. So I need some other way to create a pdf as same as jsp display.
Any one please suggest me the libraries other than iText?
Thanks in Advance.
You are probably using iText 5 and XML Worker. Have you tried iText 7 and pdfHTML? See the HTML to PDF tutorial.
You will need:
iText 7: https://github.com/itext/itext7
the pdfHTML add-on: https://github.com/itext/i7j-pdfhtml
You claim:
the problem is the html file (which is generated from Jsp) display is different with the generated PDF.
That is certainly true when you use HTMLWorker (which you shouldn't) and it's true in many cases for XML Worker. But we rewrote iText from scratch because of the mismatch between the old iText architecture and the requirements when converting HTML to PDF.
If you have a problem with the HTML to PDF conversion, please explain the problem in a question and tag that question as an iText question. If we can improve iText 7 + pdfHTML, why wouldn't we do that?
This is always tricky because HTML and PDF have differing purposes (with a lot of overlap). That means "simply" converting between the two is sometimes not going to work well.
You can
Snapshot an image of the HTML and PDF the image. This has various downsides (can't search / extract text easily, larger, poor zoom, pagination) but is simple if it doesn't conflict with your requirements.
Use a PDF system (like iText) to construct the PDF as desired using the same data. This is obviously more work (possibly a lot more), but is the optimal result in terms of PDF quality and fitness for purpose.
Simplify/adjust your HTML so it converts better into PDF. This depends on what HTML tools/libraries you are using - you might not have much control over the HTML.
Try various other conversion libraries to see if you find a tool that works better for your HTML.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I've been searching high and low for an up to date solution to this age old problem.
Long story short I want to take css + html -> pdf and do it in java.
I don't want to use an API as the data is sensitive. Googling provides me with countless sites/services that offer to do this but I'm looking for a stand alone tool and looking for one that will work nicely from my java server. I've found this awesome looking command line tool but it's a command line tool and spawning processes off a web server starts to get sketchy IMO (but I'm always willing to hear otherwise). Additionally flying saucer seems to be a standard choice, but I've heard mixed reviews.
Here is a 5 year old question on the subject, but I figure things have changed! Especially with all the work being done in the area of front end unit testing with dom manipulation I figure there might be some less than conventional solutions and I'm willing to hear them all!
Any help would be greatly appreciated.
You might try a combination CSSBox that converts HTML+CSS to SVG and then use for example Batik for creating your PDF as proposed for example here. FlyingSaucer could also do the job.
The choice depends on your further requirements. E.g. are you processing "street HTML" or well-formed documents? What about the pages in the resulting PDF? What about interactive elements in the HTML pages?
I mean the only way is to try at least some options practically and then you may ask more specific questions about some particular problems.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I need Java Library to perform following tasks 1) Convert Pdf pages to Image 2) Extract html text from PDF pages with there locations on the page 3) Extract images from PDF pages
I have already tried
PDFBox - it fails with error --unsupported/disabled operation: BDC and EMC
icePDF - it works for task 1) and 3) but again its paid.
PDFRenderer - it fails
BFO - its paid library but able to perform tasks 1) and 3)
Can anyone suggest better solution.
Have you tried JOD Converter? It's a Java API to a self-booted Open Office Server.
To see whether it converts to/from the formats you want, just install Open Office, open a file, and try to "Save As" the format you need, to see if it's supported.
I have followed following steps to solve the issue in Ubuntu Enviornment
Step 1) Used pdftohtml library to convert pdf to html
Step 2) Used Jsoup to extract text with styling and position from html in step 1)
Step 3) Used CutyCapt to generate snapshot of HTML (if required)
We can also use
pdftoppm command to extract images directly from pdf
You can do all those things with PDFBox. But for getting the position there is no API. Download the latest PDFBox. Go to the following links to find your solutions.
Convert Pdf pages to Image
Extract images from PDF pages
Extract html text from PDF pages with there locations on the page is a little bit different. Using the API you will not get the position information. But you can get all the position information using PDFBox.
Please have a look at this link. There you can see getTextPos() function. getTextPos().getXPosition(), getTextPos().getYPosition() will give you X and Y coordinates.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Is it possible to convert PDF to Word in Java? I'm not talking about parsing a PDF document and then custom render it again to Word. I want a Java library that can directly convert it.
Reading PDF documents is a very involved process and there are no good free libraries for extracting non-text information from PDF documents in Java. Worse yet, PDF documents have a lot of layout information that is hard to reconstruct, for example a table in a Word document becomes some lines and a bunch of pieces of text in PDF.
It is almost impossible to recreate semantic information from an arbitrary PDF. If you have the same tool that wrote it you have somewhat more chance but even so there is much uncertainty. The only thing you can be sure of in a (text) PDF is the position of each character on the page. (Note that some PDFs include bitmaps in which textual information occurs and that has to rely on OCR).
There are several groups in computer science departments and elsewqhere who are spending very significant effort to try and get semantic information. We collaborate with Penn State - one of the leaders - and they are working on extracting tables. In good casees they get 90% in bad ones 50%.
So the answer is formally that you cannot, but you may occasionally be fortunate. (We do a lot of this for chemistry and count ourselves lucky if we get 50% on a regular basis).
You can try to do it with the iText library. Read the PDF and then write it as an RTF.
This is not that simple though, as you have to preserve the different style that the PDF has.
You can use some external tools.
Install some free program like "Free PDF to Doc" and execute it from you java program.
This Works fine in most cases.
use the Acrobat Pro SDK from you java code.
Best of luck