How to extract images from pdf using Java (not using pdfbox)

How to extract images from pdf using Java (not using pdfbox) - java

I've being researching on how to extract images from a big (> 300MB) PDF file. I'm using pdfbox but for some particular reason that I can't figure out, some pages are not correctly extracted.
I'm using the PDFToImage class of pdfbox as base for my code.
So, do you know another library that may help me to do this? I know that iText may be used, but I read that it can't be used for commercial products.
I've installed the packages xpdf and xpdf-utils, and the utility called pdfimages is working perfect. But I need to solve this problem from Java and it should be portable.

I think you're talking about two different things here: extracting images from a PDF, and converting PDF pages to images. PDFToImage will output an image for every page, while pdfimages extracts all embedded images (e.g. a text document has 0 images).
Take a look at org.apache.pdfbox.tools.ExtractImages (source code) to see if it does what you want.

The most likely reason why it is hard working with 300 Mb PDF's is that you run out of memory. If it works well for smaller PDF's I would have a closer look at why it fails.

Have you tried icepdf or JPedal (both pure java)?

Related

converting ppt to html

I want to implement a function that can see PowerPoint on the web at this time.
You can do it simply by converting PowerPoint to an image, but if you convert it to an image, I think there are issues that you can not use video or audio.
So the idea was to convert PowerPoint to HTML and place it where I wanted. However, it does not have much ability to directly implement the pure function of converting PowerPoint to HTML. To solve this problem, I have been looking for open source or various libraries, but I have not found them yet.
The development environment is java8 + Spring Boot.

If you are OK with converting your PPT files to PDF before converting them to HTML, then pdf2htmlEX could be worth looking at. It is the best tool I could find for this kind of work, as it is capable of converting PDFs to HTML very precisely (have a look at the exmples 1,2,3,4). You should be able to find wrapper libraries in the maven repo so that you are able to call it from your Java applications.

If you are OK in using iframe you may use a Microsoft solution https://products.office.com/it-IT/office-online/view-office-documents-online
You may use this code:
<iframe src='https://view.officeapps.live.com/op/embed.aspx?src=[you_ppt_url]' width='100%' height='600px' frameborder='0'>

There's an older node package called PPTX2HTML. It outputs a bunch of garbled code on a canvas element, but it might work. They even have a demo website to try it out. They seemed to have broken the powerpoint up into parseable XML and rendered the elements.

Preview LaTeX output with Java

Right now I'm working on displaying LaTeX generated document with Java.
Strictly speaking, LaTeX source can be used to directly generate two formats:
DVI using latex, the first one to be supported;
PDF using pdflatex, more recent.
However rendering dvi or pdf is not available as far as I know.
Is there any way to handle those formats ? Or maybe others that makes sense ?

There are not enough details with regards to how you wish to "render" DVI or PDF from a LaTeX document. However, you could always just render the pdf using pdflatex and DVI using latex and use ICEpdf for viewing PDFs and javaDVI for viewing DVIs.
Another neat hack to display pdf in a panel is to pass the file path to an embedded web component in the application, and the web component will use whatever pdf rendering tool is available on your machine (Acrobat, Foxit, Preview, etc.)
I remember there was a post about this a long time ago.
I don't think there's a generic way to preview the rendered output without generating the file itself. You can write your own LaTeX engine which caches the output every few seconds and displays that but regardless of the storage, you have to output it somewhere physically and then render the output separately using any of the steps mentioned above.

Another approach is to convert the div output to an svg image file and render that with SVGGraphics2D. That will produce nice scalable results. Dvi files can be converted to svg on the command line (or in a script) using:
dvisvgm --no-fonts input.dvi -o output.svg
For more conversion options see this thread on how to convert pdf to clean svg.

extracting text AND Images from PDF file

I have been bumping my head against the wall with this one, have researched and pretty much tried every library suggested to me. I am currently trying to write a program in java that will extract text AND images from a pdf file and allow me to write the extracted content to a word file. I have managed to extract the content using the ICEpdf library, however the problem is that I need to be able to write the content in the exact same order as it was read. So, to clarify, I need a library that will help me keep track of where exactly in the page the text and images are situated so I can put them in the same place in my word file.

A PDF to Word converter is a horribly complex proposition.
Your best bet will probably to use Open Office to do it for you and not even try to handle the intermediate steps.
http://www.openoffice.org/api/

Look at this: Advanced PDF parser for Java
OFF:
-Also to my knowledge there is a python parser that sorta converts the pdf to html (that way you can keep track of the ordering of the objects within the pdf). I know its not java, but you might be able to use the output.
http://www.unixuser.org/~euske/python/pdfminer/index.html

Custom PDF creation - Large images

Looking for a Java based PDF creation library. We're currently using Apache Velocity with HTML to render PDFs on the fly.
We'd like to be able to find a way to render large images (sometimes as big as 3000 x 1700) in a creative manner within the PDF container. For instance, a scrollable image pane within a PDF. This might not be possible within a PDF, I might be wrong.
Open source would ideal.

For a good PDF library you should take a look at iText: http://itextpdf.com/
I have used images of around 5000x4000 with iText without any problems.
I don't know if it is possible to create a working scrollpane inside a PDF, unless of course you were doing it through a custom PDF creator/viewer.
iText is open source but make sure to check out the AGPL license before you use it commecrially: http://itextpdf.com/terms-of-use/agpl.php

For just creating PDF files from images iText is a little overdimensioned. Give xsPDF a chance, it has no limits for images sizes and seems to be appropriate for your problem.

Just a FYI for anyone that may run into this in the future:
I used a library called PDFBox (http://pdfbox.apache.org/) to open a pre-existing PDF and modify the PDF with a custom sized PDFRectangle with the dimensions of the image. Then inserted the image and rectangle into that new page and got the desired results.
I didn't realize you could have multiple page sizes in a single PDF.

View PDF files in IFrame with Named Destinations

We've got an application that displays PDF files in an IFrame at specific Named Destinations. This works well on Windows systems but not Mac. In Safari, with Acrobat, the Named Destination is ignored and the document is displayed at the start.
Does anyone have any suggestions on how we might accomplish the task of displaying this information? Our initial thoughts are to:
Convert the PDF to HTML on the fly and display the HTML version in the IFrame
Convert the PDF on the page referenced to another format such as PNG etc. and display that in the IFrame
Utilize some kind of Java app that allowed us to render the PDF while honouring the Named Destination (not sure if this exists)
Any other ideas on a potential method of better displaying PDF files at Named Destination points that is a little more cross platform?
EDIT: I guess another option is to store the data in XSL/XSLT type format and convert to HTML for veiwing or PDF for saving to the desktop.

Not much help, but I found that alternative ways to display PDF files (other than the Acrobat Reader client) are few and far between. As you say, the commonly accepted way to render PDF's in something that doesn't natively support it seems to be converting it "something else", which is supported (even Acrobat.com does it this way in their Flex client if I remember it correctly).
Even converting the PDF document to other formats may be disappointing - especially if you expect a certain level of quality. It may also introduce server-side performance issues.
I realise this doesn't help anyone much but I'm interested to see if any other suggestions come up. We've dealt with this problem before in the same way, using IFrame controls (but without named destinations) but I'm very much interested in other suggestions/ideas as well.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to extract images from pdf using Java (not using pdfbox) - java

The most likely reason why it is hard working with 300 Mb PDF's is that you run out of memory. If it works well for smaller PDF's I would have a closer look at why it fails.

Have you tried icepdf or JPedal (both pure java)?

Related

converting ppt to html

Preview LaTeX output with Java

extracting text AND Images from PDF file

Custom PDF creation - Large images

View PDF files in IFrame with Named Destinations

Categories

Resources