Replacing placeholders using iText in Java - java

I have a PDF that contains placeholders like <%DATE_OF_BIRTH%>, i want to be able to read in the PDF and change the PDF placeholder values to text using iText.
So read in PDF, use maybe a replaceString() method and change the placeholders then generate the new PDF.
Is this possible?
Thanks.

The use of placeholders in PDF is very, very limited. Theoretically it can be done and there are some instances where it would be feasible to do what you say, but because PDF doesn't know about structure very much, it's hard:
simply extracting words is difficult so recognising your placeholders in the PDF would already be difficult in many cases.
Replacing text in PDF is a nightmare because PDF files generally don't have a concept of words, lines and paragraphs. Hence no nice reflow of text for example.
Like I said, it could theoretically work under special conditions, but it's not a very good solution.
What would be a better approach depends on your use case:
1) For some forms it may be acceptable to have the complete form as a background image or PDF file and then generate your text as an overlay to that background (filling in the blanks so to speak) As pointed out by Bruno and mlk in comments, in this case you can also look into using form fields which can be dynamically filled.
2) For other forms it may be better to have your template in a structured format such as XML or HTML, do the text replacement in that format and then convert it into PDF.

Related

Extract paragraph sample from file

I have an unknown file type uploaded. It can be doc, pdf, xls, etc.
My ultimate goal is to:
Determine if there are paragraphs of text in the file (as opposed to, say, a bunch of picture captions or text from a chart or table)
If (1) is true and there are paragraphs of text, extract a few sample paragraphs from the file.
I know that I can use a program like Apache Tika to extract the file to a String.
However, I would like to also get the format of the extracted text and determine where there are paragraphs of full, written text (as opposed to captions, etc.).
So I also would like a way to analyze the extracted text. Specifically, I would like a library that can identify full, written paragraphs, as opposed to text that was simply taken from things like photo captions, charts, etc.
While Tika is a rather large library, I would be willing to add it if it can perform the tasks that I need.
However, I can not find anything in Tika that would allow me to analyze the structure of the text in such a way.
Is there something I missed?
Other than Tika, I am aware of some API’s for analyzing text, specifically Comprehend or Textract, but I still couldn't find something that can ensure the extraction of full, written paragraphs as I require.
I am looking for any suggestion using the libraries I listed above or others. Again, I'd like to avoid things like photo captions and such and only get text that was part of full, written paragraphs.
Is there any library that can help me with this or will I have to code the logic myself (for detecting paragraphs as well as detecting the difference between full paragraphs and text that was extracted from charts and captions)?

Use flexmark-java to clean markdown

Within a Java application, I would need to convert marked-down text into simple plain text instead of html (for example dropping all links addresses, bold and italic markers).
Which is the best way to do this? I was thinking using a markdown library like fleaxmark. But I cant find this feature at first sight. Is it there? Are there other better alternatives?
Edit
Commonmark supports rendering to text, by using org.commonmark.renderer.text.TextContentRenderer instead of the default HTML renderer. Not sure what it does with newlines, but worth a try.
Original answer, using flexmark HTML + JSoup
The ideal solution would be to implement a custom Renderer for flexmark, but this would force you to write a model-to-string for all language features in markdown. Unless it supports this out of the box, but I'm not aware of this feature...
A simpler solution may be to use flexmark (or any other lightweight markdown renderer) and let it create the HTML. After that, just run the generated HTML through https://jsoup.org/ and let it extract the text:
Jsoup.parse(htmlInputStream).text();
String org.jsoup.nodes.Element.text()
Gets the combined text of this element and all its children. Whitespace is normalized and trimmed.
For example, given HTML <p>Hello <b>there</b> now! </p>, p.text() returns Hello there now!
We use this approach to get a "preview" of the text entered in a rich content editor (summernote), after being sanitized with org.owasp.html.HtmlSanitizer.
flexmark also have mark down to text feature.
checkout this

Parsing html text to obtain input fields

So I currently have a big blob of html text, and I want to generate an input form based on what is contained in that text. For example, if the text contains '[%Name%]', I want to be able to read that in and recognize 'Name' is there, and so in turn enable a form field for name. There will be multiple tags ([%age%], [%height%], etc.)
I was thinking about using Regex, but after doing some research it seems that Regex is a horrible idea to parse html with. I came across parsing html pages with groovy, but it is not strictly applicable to my implementation. I am storing the html formatted text (which I am creating using ckeditor) in a database.
Is there a efficient way to do this in java/groovy? Or should I just create an algorithm similar to examples shown here (I'm not too sure how effective the given algorithms would be, as they seem to be constructed around relatively small strings, whereas my string to parse through may end up being quite large (a 15-20 page document)).
Thanks in advance
Instead of reimplementing the wheel I think it's better to use jsoup. It is an excellent tool for your task and would be easy to obtain anything in a html page using it's selector syntax. Check out examples of usage in their cookbook.

How to detect different types of PDF

A PDF file extension can be verified by the magic signature: 25 50 44 46
However, I want to detect whether a PDF contains text or image (i.e. whether the PDF contains text that can be searched with ctrl+f OR whether it contains scanned documents)
Is there a way to do this?
Well technically, you could parse the PDF document structure and look for elements that contain text. I imagine this would require a big effort to implement.
So you may want to use a premade PDF package to do the parsing for you (PDFBox, BfoPDF or something similar). Still, I think it will require some effort to implement.
The simplest way that I know of would be to use a package that can extract the plain text for you. Apache TIKA can do this. Just feed it the document and see if you get something back.
In any case it will be hard to classify PDF's that contain both images and text.

convert pdf to xml

i want to convert a PDF file having few images into xml using java.
Is there any api though which it can be done so that all the images and text of pdf will be converted into xml file.
please help.
Use pdftohtml.
It can be installed with brew install pdftohtml. This adds pdftohtml to your path.
So, to convert pdf to xml, you can run pdftohtml -xml your_file.pdf your_file.xml
Then, just use java or any other language to execute this command.
PDF is one of the worst format to work with. It is designed for rendering 2D graphics and text documents. There are libraries which allow you to manipulate PDF objects in PDF document but it will not be able to tell you whether an image is related to which paragraph. You will not be able to extract the semantic of it easily.
On the other hand, XML is desinged to store text data in a well structured manner. This means it contains implicit semantic. In order to convert from a format which does not have semantic to a format which have implicit you will need to add your own logic into the conversion process otherwise you will just end up having a mess in your XML which contradicts the whole purpose of using XML.
Since each PDF document is very much different, it is almost impossible to automate this without human aids.
If you are really determine to do it, I suggest you use a library to read PDF into objects, and start writing a converter from there. You will have to take care of newpage, newline, page number, headers, images, graphics, tables, and many more by yourself. Since XML is made mainly for text data, you will have to deal with graphics somehow if you want to store in XML, e.g. convert graphics into Base64 string.
iText is a library that allows you to create and manipulate PDF documents. It enables developers looking to enhance web- and other applications with dynamic PDF document generation and/or manipulation.
Developers can use iText to:
* Serve PDF to a browser
* Generate dynamic documents from XML files or databases
* Use PDF's many interactive features
* Add bookmarks, page numbers, watermarks, etc.
* Split, concatenate, and manipulate PDF pages
* Automate filling out of PDF forms
* Add digital signatures to a PDF file
iText is available in Java as well as in C#.
You could Base64 encode the entire PDF file's byte stream and serialize it into an XML document like "<pdf><![CDATA[BASE64ENCODEDPDFFILECONTENTS...]]></pdf>". =)

Categories

Resources