RTF Java Parser

RTF Java Parser - java

here is my issue.
I need to read an RTF document and render to a webpage (some sort of google docs) but these documents are templates, the idea, is that user can only edit certain text and not the text that is marked to be "template logic".
So far I've seen a bunch of RTF libraries that performs only rendering but wont let you access an object that can be iterated dynamically to go over the structure of the RTF document.
My idea is to determine what can be editable and what can not, put all that info (images, text, tables, headers, footers) into a json and send it to my JS client.
Maybe this is a crazy idea, any suggestions?

When I read "template", I think "Velocity". I wonder if you can solve this by separating template from dynamic data. I wonder if you can solve this by letting users modify dynamic data and only marry it with the static, unedited template at the last minute.

It's possible that Docmosis can help because it lets you use documents and templates and you can extract from Docmosis an "analysis" of the template (eg a list of fields). It's hard to be sure if it will fit your purpose though from your description. Please note I work for the company that produces Docmosis.

Related

How to fill XFA (PDF) form automatically?

I'm looking for a free option to fill a XFA PDF form. I know that iText is an option but their commercial prices are too expensive for me, I'd prefer something fully open-source. There is PDFBox but it doesn't seem to allow for inserting data into XFA forms, or at least there's very little explaining how to.
I simply need to fill in certain fields in my form with certain data in a text file. Can you recommend or direct me to a solution? Thank you greatly

XFA is an enterprise product, with enterprise pricing…
XFA consists of an XML part and a PDF part which represents kind of what is in the XML part.
You may try to fill data into the XML part, and then see whether Acrobat/Reader can render it properly. No guarantee that it works, but worth a try. If that does work, you'd have shifted the problem to something maybe a bit more managable.

How to manipulate a Check box in Word with Java and save as PDF?

I need to edit some Check-Boxes in a big Wordfile (docx) and save this then as PDF. This file contains many images and is about 19MB big.
Maybe there will be the need of adding some Checkbox and text.
My idea was to use docx4j, but before to learn the ropes I want to ask if this is possible and which is the best way.
May it be better to save the document as a PDF and then use this as base for processing?

Yes, you can manipulate checkboxes using docx4j.
Be aware that there are several different kinds of checkboxes:
legacy checkbox
content control checkbox
checkbox character
and the details depend on which type are present.
For more, you should post a snippet of the relevant OpenXML (and as they say here on SO, code showing what you've tried).

Is it necessary to use only docx4j?
Recently i tried a solution that helps me manage a Word document with checkboxes and save it as a PDF file. I used Plumsail Documents. The case is about how to populate a Word template using a form with checkboxes. You can connect your app via Zapier or Power Automate to activate checkboxes depending on value from your app. You can set the resulting file as a PDF and deliver it by email and across any system using Zapier and Power Automate.
The great is that Plumsail Documents has a templating engine that allows it to operate pictures.
Your case may be like this:
Create a form in Plumsail Form. It will allow you to activate checkboxes depending on your needs, or your users' needs.
Create a process in Plumsail Documents, upload your Word document and set it as a template. Just put placeholders where you want to change or fill a document with some values or data. Set the resulting document in PDF format.
Set the delivery method. Save across apps or deliver by email.
I recommend you to read the article. That solution is not free, but there is a free 30-day trial, so you will have enough time to try it.

How to create templates from html page automatically?

I have a use case in which I need to render an unformatted text in the format of a given web page programmatically in Java. i.e. The text should automatically be formatted like the web page with styles, paragraphs, bullet points etc.
As I see first I will have to analyze the piece of unformatted text to find out the candidates for paragraphs, bullet points, headings etc. I intend to use Lucene analyzers/tokenizers for this task. Are there any alternatives?
The second problem is to convert the formatted web page into some kind of template (e.g. velocity template) with place holders for various entities like titles, bullet points etc.
Is there any text analysis/templating library in Java that can help me do this? Preferably open source.
Are there any other suggestions for doing this sort of task in a better way in Java?
Thanks for your help.

There are a lot of hard parts to what you're doing.
The user input
If you don't ask your user to provide any context, you're never going to guess the structure of the text. At least, you should ask them to provide a title, and a series of paragraph in your GUI.
Ideally, you could ask them to follow a well-know markup language (Markdown, Textile, etc...) and use the open source parser to extract the structure.
The external page
If any page is used, the only things you can rely on are the "structural markup". So assuming you know the title of the page should be "Hello World", and there is a "h1" element somewhere in the page, you can maybe assume that this is where the header could go.
But if the pages is a div tag-soup, and only CSS is used to differentiate the rendering of the header as opposed to the bulk of the text, you're going to have to guess how the styling is done : that's plain impossible if you don't know how the page is made.
I don't think Lucene would help fo this (as far as I know Lucene is made to create an index of the words used in a bulk of text ; I don't think it can help you guessing which part of the text is meant to be a title, a subtitle, etc...)
Generating templates from external page
Assuming you have "guessed" right, you could generate the content by
copy pasting the page
replacing the parts to change with tags of your template language of choice
storing the template somewhere the templating system can access it
configure your template / view system (viewResolver for velocity) to use the right template for the rigth person
That would of course pose terrible legal questions, since your templates would incorporate works by the original website author (most probably copyrighted material)
A more realistic solution
I would suggest you constrain your problem to :
using input that has some structure information available (use a GUI to enter it, use a markup language, whatever)
using templates that you provide, know the structure of (and can reuse very easily)
Note that none of those points are related to the template system.
Otherwise, I'm afraid you're heading to an unreasonnable amount of work...

How to detect different data types inside HTML page?

What is the best way to detect data types inside html page using Java facilities DOM API, regexp, etc?
I'd like to detect types like skype plugin does for the phone/skype numbers, similar for addresses, emails, time, etc.

'Types' is an inappropriate term for the kind of information you are referring to. Choice of DOM API or regex depends upon the structure of information within the page.
If you know the structure, (for example tables being used for displaying information, you already know from which cell you can find phone number and which cell you can find email address), it makes sense to go with a DOM API.
Otherwise, you should use regex on plain HTML text without parsing it.

I'd use regexes in the following order:
Extract only the BODY content
Remove all tags to leave just plain text
Match relevant patterns in text
Of course, this assumes that markup isn't providing hints, and that you're purely extracting data, not modifying page context.
Hope this helps,
Phil Lello

What technologies are there for formatted, structured data input and output?

I am working on a project here that ingests internal resumes from people at my company, strips out the skills and relevant content from them and stores it in a database. This was all done using docx4j and Grails. This required the resumes to first be submitted via a template that formatted everything just right so that the ingest tool knew what to look for to strip the data.
The 2nd portion of this, is what if we want to get out a "reduced" resume from the database. In other words, I want to search the uploaded content I now have, and only print out new resumes for people who have Java programming experience lets say. So I can go into my database, find the people who originally had java as a skill, and output a new set of resumes that are also still in a nice templated format, and only have the relevant info in them, instead of ALL the content.
I have been writing some software to do this in Java that will basically use a docx template, overwriting the items in customXML which are bound to the content controls in the doc, so the new data shows up and can eb saved as a new docx with that custom data.
This seems really cumbersome to me, and has some limitations. For one, lets say my template has a place for 3 Skills, and the particular person has 8 skills. There seems to be no good way to add those 5 additional skills to the docx other than painstakingly inserting the data with all of the formatting XML tags and such. This is a real pain, because if the template changes, I dont want to have to go back into my software and edit source code to change that additional data input XML tag to bold instead of italic.
I was doing some reading up on using Infopath to create a form that I could use to get the input, connecting to some sharepoint data source or something to store the stripped out data. However, I can't seem to find out if it is possible using sharepoint to get the data back out, in a nice formatted way. What would the general steps for this be? It seems like I couldnt find very much about this topic with any quick googling.
Thanks

You could set up the skills:
<skills>
<skill>..</skill>
<skill>..</skill>
and use a "repeat" content control pointing to the container. This would handle any number of <skill> entries.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.