How to differentiate text and images from a PDF file using java?

How to differentiate text and images from a PDF file using java? - java

So i have to make an android app using Java that reads a PDF File and displays it on screen without using other programs(such as PDF Reader). How to make a distinction between text and image in that file? in other words, there is text and in between text ther is an image, how do i verify where it is text and where is an image?

PDF files don't work like that.
It is a complex format, and there is a lot more data in the files than just text and images, such as metadata and formatting.
If you want to handle PDF files in your app, you should use a PDF library, such as the ones listed here:
https://camposha.info/android-examples/android-pdf-libraries/#gsc.tab=0
How exactly to load text will depend on the specific library you choose, and you should check the relevant documentation.

Related

How to recognize/replace images between text in a PDF file? Using PDF Box Java Spring Boot

PDF Example
There is a PDF file with images and text mixed like in the picture above.
As a result, I want to create a parser that make PDF files to data with a fixed format.
Question is.
Simply, how to replace image to text ? TextStripper skip all images.
Extracting and dataizing text is successful, but it does not recognize the image between the texts.
I'm using PDF BOX. I extract pdf with PDFTextStripper().getText();
I also succeeded in extracting the images individually.

Edit SVG xml text content after converted from PDF

I am using Inkscape to Convert my PDF to SVG file and I would like to change the text content using the xml format in SVG. However, the changed text font seem to be very different and the alignment is totally off from the original place.
Hence, how can I edit the text content using SVG? is there any other tool can be done by converting the PDF to SVG and edit the text content?

There are differences in the 2 formats that can cause issues when converting from pdf to svg, take a read over this guide. It's suggested to try pdf2svg if you don't mind getting your hands dirty.
Excerpt below:
Conversion with Inkscape
Download Inkscape from www.inkscape.org (version 0.46 and above)
Download the PDF you want to convert
Run Inkscape
Open the PDF file you want convert in Inkscape (not Acrobat)
Uncheck Embed images on the box that comes up and click OK
Wait a little while as Inkscape converts it
Click File>Save As..
Under Save as type:, choose "Plain SVG (*.svg)"
Click Save in the bottom right corner
Done! You now have an SVG file with the same name as the PDF, but with the .svg extension
Before uploading you may assure its W3C-validity, with tool SVG-check
For checking that it displays properly, upload it first to Test.svg
Upload the SVG to Wikimedia Commons and tag it with {{Extracted with Inkscape|v}}
Conversion with PDF2SVG
Some versions of Inkscape do not have PDF support compiled in; also, text importing does not always produce satisfactory results in Inkscape. In that case, you might try performing the conversion with the PDF2SVG command line tool. (It requires that Poppler, Cairo, and X are installed on your system.)
Get PDF2SVG from http://www.cityinthesky.co.uk/opensource/pdf2svg/ and compile it. If you are using Linux or FreeBSD or MacPorts, PDF2SVG might also be installable via the package installer.
Convert the PDF with pdf2svg file.pdf file.svg
If necessary use Inkscape to edit the resulting SVG.

Extracting all images and text from pdf file

I need to create json from pdf to render the pdf content as HTML with all the images and text. I have tried the modules below to do that. I am able to extract only plain images now, but not able to extract the graphical images and background shadow images. Is there any module to get these?
Modules tried
-PDFMiner (python)
-Mammoth(Node)
-pdf2json(Node)
-PDFBox(Java)

Have a look at http://pythonhosted.org/PyMuPDF/. Apparently this product renders pages in various formats, including json. Although I have limited experience with it, the recipe at http://code.activestate.com/recipes/580703-extract-images-of-a-pdf-optionally-by-page-using-p/history/1/ shows how to use PyMuPDF to extract images from a PDF.

Generate PDF or Similar to it

I'm creating a JAVA application
and I want to create and display and print a PDF file.
Like this example:
http://img11.hostingpics.net/pics/331702Sanstitre.jpg
So can you give the right way to do it ?
I mean is this a pdf file displayed into a JPanel or something else ?
and thnx alot.

For working with PDF files I would recommend using a library such as Apache PDFBox which has the ability to write, read, and print PDF files using org.apache.pdfbox.PrintPDF
The API can be found Here
As for displaying it in the JFrame, you can simply read the text and print it out in a Swing Text Area

For generation of pdf files you can use Jasper Reports library. It is popular API for creation pdf files from template in which specific data is inserted. Template files have ".jrxml" extension and can be created and edited by Jaspersoft Studio. These files look like forms with variable fields, this is very useful for generating different kinds of reports.
The API for Jasper Reports Library can be found here.

Java DOCX file Viewer

Currently I'm developing an application that allows users to create a template and generate it into a DOCX file. The application needs to be able to display to users the changes in the template as the user is creating it.
The approach I tried was using DOCX4J library (allows manipulation of DOCX file) and ICEPDF which is primarily used to display the DOCX into the swing component by converting it first into a PDF file. Now the problem in this approach is that it loads pretty slow and some of the changes that occurs in the DOCX file does not reflect on the PDF conversion (example: dashed underline, font changes). When I tried to open the DOCX file ouput in MS WORD, the file is viewed correctly so I know changes do occur, but it seems that ICEPDF just can't show it properly.
So I was wondering if anyone knows a java library that allows DOCX files to be viewed directly from a Swing Component instead of converting it first into a PDF file.

You can try docx4all or DocxEditorKit. Both of these are built around docx4j.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.