Extracting all images and text from pdf file - java

I need to create json from pdf to render the pdf content as HTML with all the images and text. I have tried the modules below to do that. I am able to extract only plain images now, but not able to extract the graphical images and background shadow images. Is there any module to get these?
Modules tried
-PDFMiner (python)
-Mammoth(Node)
-pdf2json(Node)
-PDFBox(Java)

Have a look at http://pythonhosted.org/PyMuPDF/. Apparently this product renders pages in various formats, including json. Although I have limited experience with it, the recipe at http://code.activestate.com/recipes/580703-extract-images-of-a-pdf-optionally-by-page-using-p/history/1/ shows how to use PyMuPDF to extract images from a PDF.

Related

How to differentiate text and images from a PDF file using java?

So i have to make an android app using Java that reads a PDF File and displays it on screen without using other programs(such as PDF Reader). How to make a distinction between text and image in that file? in other words, there is text and in between text ther is an image, how do i verify where it is text and where is an image?
PDF files don't work like that.
It is a complex format, and there is a lot more data in the files than just text and images, such as metadata and formatting.
If you want to handle PDF files in your app, you should use a PDF library, such as the ones listed here:
https://camposha.info/android-examples/android-pdf-libraries/#gsc.tab=0
How exactly to load text will depend on the specific library you choose, and you should check the relevant documentation.

Using qoppa to create pdf with table structure just like we do using itext pdf library

Sample pdf
A sample pdf is shown in image. We need to create 2 column structure which can have text/images/figures etc. Moreover, we need to change the text format like font/size, auto wrapping etc. Text content will be dynamic as we don't know the content at compile time hence, it should be able to align itself after paragraph ends and we should not need to provide hard coded value for height. Giving absolute positions of components in qoppa to create pdf is not feasible for us because of dynamic content.
We've already explored qoppa library and we couldn't figure out how to make a pdf like shown in image. If anyone has worked on qoppa, Please do share the valuable resources available online related to qoppa. And Please, let me know if it is possible to create a pdf like this using qoppa.

how to do photo and text extraction form an online pdf

I know that there is already PDFbox and iText but they don't have the ability for visual content extraction as well as need to work offline with the pdf. withal, I want a way to do some text and visual content extraction online. do not want to download the pdf file and then do stuff. what kind of API or library is there for Java language?
EDIT for those who find it not clear, I explain some more:
Just imagine when using any HTML parser you can parse a page online, make the DOM or SAX tree and going through their elements and then extracting photos and text based on the content of the nodes in those trees. at least, for photos, you can get their corresponding HTML tags and for text, the same plus you can get actual text. now, I want to know if there is anything similar for doing with PDFs? going through text and images without downloading
Gnostice PDFOne (for Java) has a getPageElements() method that can parse a PDF page for text and image elements. Text in a PDF is not in a DOM like a HTML or XML document. Text just appears in various x-y coordinates and magically looks well-formatted. However, PDFOne has some PDF text extraction methods that reconstruct those text elements to user-friendly sentences. DISCLOSURE: I work for the company that makes this library.
PDFImageStream can do that. There is a free version with only one restriction: it can only be used in single-threaded applications.

Custom PDF creation - Large images

Looking for a Java based PDF creation library. We're currently using Apache Velocity with HTML to render PDFs on the fly.
We'd like to be able to find a way to render large images (sometimes as big as 3000 x 1700) in a creative manner within the PDF container. For instance, a scrollable image pane within a PDF. This might not be possible within a PDF, I might be wrong.
Open source would ideal.
For a good PDF library you should take a look at iText: http://itextpdf.com/
I have used images of around 5000x4000 with iText without any problems.
I don't know if it is possible to create a working scrollpane inside a PDF, unless of course you were doing it through a custom PDF creator/viewer.
iText is open source but make sure to check out the AGPL license before you use it commecrially: http://itextpdf.com/terms-of-use/agpl.php
For just creating PDF files from images iText is a little overdimensioned. Give xsPDF a chance, it has no limits for images sizes and seems to be appropriate for your problem.
Just a FYI for anyone that may run into this in the future:
I used a library called PDFBox (http://pdfbox.apache.org/) to open a pre-existing PDF and modify the PDF with a custom sized PDFRectangle with the dimensions of the image. Then inserted the image and rectangle into that new page and got the desired results.
I didn't realize you could have multiple page sizes in a single PDF.

Java: Embed a HTML or PDF into a JFrame

I am open to alternative solutions, so here is my problem.
I have 111 PDFs that contain information on various degree programs. I can convert them to HTML using freeware.
The problem with the HTML is that it contains CSS, JEditorPane doesn't display the webpage, and the PDF libraries are slow and bulky.
I want to have a JCombobox where users can select a page to view, and have it appear below the box.
Any ideas on the best method?
Use iText API for PDF and DJ Project for HTML/webPage.
Most libraries that convert PDF to HTML can also convert PDF to image.
If you can convert to SVG and display it using one of the SVG libraries (E.g. Batik), that would be one way you could display it without losing any functionality like zoom in/out.
Otherwise you can convert PDF to high-res JPG/PNG and display it in your app.

Categories

Resources