Use flexmark-java to clean markdown

Use flexmark-java to clean markdown - java

Within a Java application, I would need to convert marked-down text into simple plain text instead of html (for example dropping all links addresses, bold and italic markers).
Which is the best way to do this? I was thinking using a markdown library like fleaxmark. But I cant find this feature at first sight. Is it there? Are there other better alternatives?

Edit
Commonmark supports rendering to text, by using org.commonmark.renderer.text.TextContentRenderer instead of the default HTML renderer. Not sure what it does with newlines, but worth a try.
Original answer, using flexmark HTML + JSoup
The ideal solution would be to implement a custom Renderer for flexmark, but this would force you to write a model-to-string for all language features in markdown. Unless it supports this out of the box, but I'm not aware of this feature...
A simpler solution may be to use flexmark (or any other lightweight markdown renderer) and let it create the HTML. After that, just run the generated HTML through https://jsoup.org/ and let it extract the text:
Jsoup.parse(htmlInputStream).text();
String org.jsoup.nodes.Element.text()
Gets the combined text of this element and all its children. Whitespace is normalized and trimmed.
For example, given HTML <p>Hello <b>there</b> now! </p>, p.text() returns Hello there now!
We use this approach to get a "preview" of the text entered in a rich content editor (summernote), after being sanitized with org.owasp.html.HtmlSanitizer.

flexmark also have mark down to text feature.
checkout this

Related

Limiting Markdown to underline/bold/italic in Java converter

I am searching for a way to allow an user to format his text. The formatting is limited to:
underline
italic
bold
enumeration
I would like to use Markdown and convert the Markdown to HTML on serverside.
My problem is that Markdown supports a lot of more formatting than I want to allow (headings, tables, ..).
Do you know a Markdown library where I can whitelist underline/italic/bold/..?
If there is no whitelisting, I thought about cleaning up the resulting HTML with JSOUP. Is that a preferred way?
Thank you.

There are a few different ways this could be accomplished. Which you chose depends on which Libraries you use (suggesting specific tools is off-topic on StackOverflow) and exactly what behavior you are looking for. You can find a summary of each approach below.
Modify a Markdown parser.
Some parsers provide an API to allow you to modify their behavior. You could perhaps remove the bits and pieces which parse tables, headers, etc. and leave the rest in place. Your final output would then leave in any Markdown syntax for those features. For example, if the author types a header, they would get a paragraph which begins with hashes.
Create a custom renderer.
Some Markdown parsers work in two steps. In step 1, the parser takes the Markdown text and outputs an Abstract Syntax Tree (AST) and in step 2, the renderer accepts an AST and outputs HTML. You could either modify the default renderer or build a custom renderer which handles each element as you wanted. For example, you can tell the "header" renderer method to output a paragraph (rather than a header) and you can choose whether that paragraph includes the original hashes or not.
Use an HTML Sanitizer.
Use your Markdown parser of choice, passing the text in and taking the output without modification. Then pass the HTML output into an HTML sanitizer, which will strip out any tags not in a whitelist. In this scenario there will be no clue that a header used to be a header. In the final output it will simply look like a regular paragraph.

Parsing html text to obtain input fields

So I currently have a big blob of html text, and I want to generate an input form based on what is contained in that text. For example, if the text contains '[%Name%]', I want to be able to read that in and recognize 'Name' is there, and so in turn enable a form field for name. There will be multiple tags ([%age%], [%height%], etc.)
I was thinking about using Regex, but after doing some research it seems that Regex is a horrible idea to parse html with. I came across parsing html pages with groovy, but it is not strictly applicable to my implementation. I am storing the html formatted text (which I am creating using ckeditor) in a database.
Is there a efficient way to do this in java/groovy? Or should I just create an algorithm similar to examples shown here (I'm not too sure how effective the given algorithms would be, as they seem to be constructed around relatively small strings, whereas my string to parse through may end up being quite large (a 15-20 page document)).
Thanks in advance

Instead of reimplementing the wheel I think it's better to use jsoup. It is an excellent tool for your task and would be easy to obtain anything in a html page using it's selector syntax. Check out examples of usage in their cookbook.

Replacing placeholders using iText in Java

I have a PDF that contains placeholders like <%DATE_OF_BIRTH%>, i want to be able to read in the PDF and change the PDF placeholder values to text using iText.
So read in PDF, use maybe a replaceString() method and change the placeholders then generate the new PDF.
Is this possible?
Thanks.

The use of placeholders in PDF is very, very limited. Theoretically it can be done and there are some instances where it would be feasible to do what you say, but because PDF doesn't know about structure very much, it's hard:
simply extracting words is difficult so recognising your placeholders in the PDF would already be difficult in many cases.
Replacing text in PDF is a nightmare because PDF files generally don't have a concept of words, lines and paragraphs. Hence no nice reflow of text for example.
Like I said, it could theoretically work under special conditions, but it's not a very good solution.
What would be a better approach depends on your use case:
1) For some forms it may be acceptable to have the complete form as a background image or PDF file and then generate your text as an overlay to that background (filling in the blanks so to speak) As pointed out by Bruno and mlk in comments, in this case you can also look into using form fields which can be dynamically filled.
2) For other forms it may be better to have your template in a structured format such as XML or HTML, do the text replacement in that format and then convert it into PDF.

A library in Java that can transform an HTML text into plain text?

The problem is simple, I want to transform a HTML text to plain text, thinks like putting line-breaks where is the <br> or title tags, number or markers on lists, etc.
I'm using BoilerPipe at the moment to do this, but this is not the main target of this library. There is another one that can do this?

I really like the java library for selenium. Use getBodyText() to get the plain body text with the html tags stripped out and properly formatted.
see...
Selenium java API

How about using a XML parser? That way, you have control over the spacing and line breaks.
I doubt a full-fledged HTML parser and formatter would be available, since that runs into issues such as CSS parsing and all that stuff.

Generate HTML from plain text using Java

I have to convert a .log file into a nice and pretty HTML file with tables. Right now I just want to get the HTML header down. My current method is to println to file every single line of the HTML file. for example
p.println("<html>");
p.println("<script>");
etc. there has to be a simpler way right?

How about using a JSP scriplet and JSTL?, you could create some custom object which holds all the important information and display it formatted using the Expression Language.

Printing raw HTML text as strings is probably the "easiest" (most straightforward) way to do what you're asking but it has its drawbacks (e.g. properly escaping the content text).
You could use the DOM (e.g. Document et al) interface provided by Java but that would hardly be "easy". Perhaps there are "DOM builder" type tools/libraries for Java that would simplify this task for you; I suggest looking at dom4j.

Look at this Java HTML Generator library (easy to use). It should make generating the actual HTML muuuch clearer. There are complications when creating HTML with Java Strings (what happens if you want to change something like a rowspan?) that can be avoided with this library. Especially when dealing with tables.

There are many templating engines available. Have a look at https://stackoverflow.com/questions/174204/suggestions-for-a-java-based-templating-engine
This way you can define a template in a txt file and have the java code fill in the variables.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.