convert pdf to xml - java

i want to convert a PDF file having few images into xml using java.
Is there any api though which it can be done so that all the images and text of pdf will be converted into xml file.
please help.

Use pdftohtml.
It can be installed with brew install pdftohtml. This adds pdftohtml to your path.
So, to convert pdf to xml, you can run pdftohtml -xml your_file.pdf your_file.xml
Then, just use java or any other language to execute this command.

PDF is one of the worst format to work with. It is designed for rendering 2D graphics and text documents. There are libraries which allow you to manipulate PDF objects in PDF document but it will not be able to tell you whether an image is related to which paragraph. You will not be able to extract the semantic of it easily.
On the other hand, XML is desinged to store text data in a well structured manner. This means it contains implicit semantic. In order to convert from a format which does not have semantic to a format which have implicit you will need to add your own logic into the conversion process otherwise you will just end up having a mess in your XML which contradicts the whole purpose of using XML.
Since each PDF document is very much different, it is almost impossible to automate this without human aids.
If you are really determine to do it, I suggest you use a library to read PDF into objects, and start writing a converter from there. You will have to take care of newpage, newline, page number, headers, images, graphics, tables, and many more by yourself. Since XML is made mainly for text data, you will have to deal with graphics somehow if you want to store in XML, e.g. convert graphics into Base64 string.

iText is a library that allows you to create and manipulate PDF documents. It enables developers looking to enhance web- and other applications with dynamic PDF document generation and/or manipulation.
Developers can use iText to:
* Serve PDF to a browser
* Generate dynamic documents from XML files or databases
* Use PDF's many interactive features
* Add bookmarks, page numbers, watermarks, etc.
* Split, concatenate, and manipulate PDF pages
* Automate filling out of PDF forms
* Add digital signatures to a PDF file
iText is available in Java as well as in C#.

You could Base64 encode the entire PDF file's byte stream and serialize it into an XML document like "<pdf><![CDATA[BASE64ENCODEDPDFFILECONTENTS...]]></pdf>". =)

Related

Creating pdfs from predefined templates in java

My Spring application needs to create different types of PDF documents like invoices and certificates each having dynamic data .I would like to have some predefined templates(html/text file) from which I can generate the full PDF content. The predefined template holds the full content of the PDF document including the font size, alignments of each section etc and also have key values that need to be replaced with actual values form database. I know it is easy to create PDFs from html. Is anyone having any idea on how to accept an html template as input in a java program and then hook in the keys defined in it with actual values and finally creating the PDFs?

Make existing PDF's in to templets - iText

I am trying to make some existing PDF's into templets.
Because these documents hold real data I am replaceing this data such as names and addrsss and making them into dummy place holders.
Examples
[[Name]]
[[Address1]]
When I alter the text via the iText version 5 library replace via a program I can use the template.
To speed things up I tried to use Adobe DC.
When using this method the template stops working.
Any ideas?
From what I understand of your question;
you have (or want to have) a template document
fill in the template with data from a program
turn this back into a pdf
You can easily achieve some of your goals with iText.
I suggest you look into http://developers.itextpdf.com/examples/form-examples/clone-filling-out-forms

How to extract data from a PDF file using Tika or any other library and store it in CSV/excel format

I want to extract the data present inside a PDF file and present it in the format of a CSV/Excel sheet.I got to know that this can be done using Tika library in java.But,i did find the solution as to how extract the data as simple text,but i want to know how to store it in an excel sheet.
If someone has done such type of work earlier,then please help me.
The first part (and the hard one) is to parse original data and interpret it as a table. Apache Tika will give you xhtml representation (or call your own handler with SAX events) but it usually won't construct a table for you. From pdf file, I mean, since pdf isn't a tabular format by itself.
So, you'll have to take Tika-produced paragraphs, split them and pass resulting cells to some csv/xls/xlsx writter.
It might work if you have some regular table in you pdf (one line per table row, clean cell logical separation etc). But it will look like parsing plain text, of course.
In case I wouldn't work, you'll have to take pdf parser (like Apache PDFBox) and try to interpret its output.
The second part (output) is simple. If csv/ssv/tsv is suitable for you -- use your preferred library to produce it (I can recommend Apache commons-csv).
But take into account that MS Excel requires BOM for UTF-8 and UTF-16 csv to understand that file isn't in one-byte encoding (like CP-1252 etc).
If you want Excel xls or xlsx format -- just use Apache POI to write it.

Replacing placeholders using iText in Java

I have a PDF that contains placeholders like <%DATE_OF_BIRTH%>, i want to be able to read in the PDF and change the PDF placeholder values to text using iText.
So read in PDF, use maybe a replaceString() method and change the placeholders then generate the new PDF.
Is this possible?
Thanks.
The use of placeholders in PDF is very, very limited. Theoretically it can be done and there are some instances where it would be feasible to do what you say, but because PDF doesn't know about structure very much, it's hard:
simply extracting words is difficult so recognising your placeholders in the PDF would already be difficult in many cases.
Replacing text in PDF is a nightmare because PDF files generally don't have a concept of words, lines and paragraphs. Hence no nice reflow of text for example.
Like I said, it could theoretically work under special conditions, but it's not a very good solution.
What would be a better approach depends on your use case:
1) For some forms it may be acceptable to have the complete form as a background image or PDF file and then generate your text as an overlay to that background (filling in the blanks so to speak) As pointed out by Bruno and mlk in comments, in this case you can also look into using form fields which can be dynamically filled.
2) For other forms it may be better to have your template in a structured format such as XML or HTML, do the text replacement in that format and then convert it into PDF.

How to detect different types of PDF

A PDF file extension can be verified by the magic signature: 25 50 44 46
However, I want to detect whether a PDF contains text or image (i.e. whether the PDF contains text that can be searched with ctrl+f OR whether it contains scanned documents)
Is there a way to do this?
Well technically, you could parse the PDF document structure and look for elements that contain text. I imagine this would require a big effort to implement.
So you may want to use a premade PDF package to do the parsing for you (PDFBox, BfoPDF or something similar). Still, I think it will require some effort to implement.
The simplest way that I know of would be to use a package that can extract the plain text for you. Apache TIKA can do this. Just feed it the document and see if you get something back.
In any case it will be hard to classify PDF's that contain both images and text.

Categories

Resources