How to load ms word document in jasper report - java

How to load word document in the jasper report in java.

You have the following options:
Put the word document somewhere on a shared network drive and add a link to it in the report.
You can convert the word document to PDF and then link to that in your report. That way, almost anyone on the planet (not only Windows users) can read it.
You can convert the word document to PDF and then use a tool to convert that to PNG or JPEG images (one page = one image) and then include the images in your report. This will make the report very huge.
You can hope that Microsoft will implement a MS Word reader in Java and tell your boss in the meantime that it is not possible without some drawbacks.

Related

Programmatically combine PDF bookmarks in the final PDF with pdfbox

An other title could have been :
Combine several PDFs at runtime, create for each of them a bookmark in the final PDF but also keep bookmarks present in one or more of the unit PDFs if any
Hello,
I have following problem :
our webapplication "concatenates" PDFs from our content management system corresponding to a user request and send it back to the user.
We are using org.apache.pdfbox.pdmodel.PDDocument.addPage and also org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem in our code)
It works well, but we recently noticed a problem :
"when one of the PDF already contained bookmarks they are not considered in the final PDF bookmark tree"
It's normal, we haven't developped this functionnality! :-(
My question : can somebody give me some hints I can follow to merge also the bookmarks (I mean at runtime, not using the command line PdfMerger tool)
Many thanks in advance
Benoît

Pdf to Pdf of Image with PdfBox

I'm using pdfbox (1.8) to handle pdf on Windows (7 and above). I need to take an input pdf and convert to a pdf made by the same page but used as image (no text selectable etc etc). With small file i have no problem but when i have to convert bigger file i have no clue due to massive memory use.
I will post some code if it helps but the approach i'm using is simple: create a document by all the page saved as image taken from the source pdf.
I'm searching for some more memory and time efficent way to do this (i have to handle pdf with 1 or 2 k of pages).

how to do photo and text extraction form an online pdf

I know that there is already PDFbox and iText but they don't have the ability for visual content extraction as well as need to work offline with the pdf. withal, I want a way to do some text and visual content extraction online. do not want to download the pdf file and then do stuff. what kind of API or library is there for Java language?
EDIT for those who find it not clear, I explain some more:
Just imagine when using any HTML parser you can parse a page online, make the DOM or SAX tree and going through their elements and then extracting photos and text based on the content of the nodes in those trees. at least, for photos, you can get their corresponding HTML tags and for text, the same plus you can get actual text. now, I want to know if there is anything similar for doing with PDFs? going through text and images without downloading
Gnostice PDFOne (for Java) has a getPageElements() method that can parse a PDF page for text and image elements. Text in a PDF is not in a DOM like a HTML or XML document. Text just appears in various x-y coordinates and magically looks well-formatted. However, PDFOne has some PDF text extraction methods that reconstruct those text elements to user-friendly sentences. DISCLOSURE: I work for the company that makes this library.
PDFImageStream can do that. There is a free version with only one restriction: it can only be used in single-threaded applications.

Google drive whats the limit to indexing large files?

I'm using the google drive api to store and retrieve pdf files. I would like to query these files using the search parameters.
But before I start implementing this. I would like to know how google handles the indexing of large pdf files. (600+pages 25Mb+) I would like to know for text based pdf's.(they don't need ocr)
I've tried some searches on the drive website and it doesn't always work.
I would like to know if the are any limitations and what they are.
According to this page for PDFs with OCR:
The maximum size for images (.jpg, .gif, .png) and PDF files (.pdf) is 2 MB. For PDF files, we only look at the first 10 pages when searching for text to extract.
And this page for PDFs with text:
You can search for text in PDF and image files by:
Typing a query in the search box in Google Drive on the web.
Opening the Google Drive viewer and using the search box in the upper right.
In theory you should be able to search the first 100 pages of any text documents or text-based PDFs that you've uploaded. You'll also be able to search for text found on the first ten pages of any image PDFs on your Drive.

Best API for reading a huge .pdf file from java

I have a huge pdf file (20 mb/800 pages) which contains some information.
It has got index with hyperlinks. Also most of the remaining information is in Tabular format (in pdf). I need to retrieve this information using Java and store it in SQL Server.
Which is the best API available to read this kind of file from Java?
It is unlikely to be in tabular format inside the PDF as PDF does not contain structure information unless explicitly added at creation time. I wrote an article explaining some of the issues with text extraction from at PDF at http://www.jpedal.org/PDFblog/2009/04/pdf-text/
Have you tried iText:
iText
Download iText
iText in Action — 2nd Edition
List of the Examples

Categories

Resources