I am searching for a way to allow an user to format his text. The formatting is limited to:
underline
italic
bold
enumeration
I would like to use Markdown and convert the Markdown to HTML on serverside.
My problem is that Markdown supports a lot of more formatting than I want to allow (headings, tables, ..).
Do you know a Markdown library where I can whitelist underline/italic/bold/..?
If there is no whitelisting, I thought about cleaning up the resulting HTML with JSOUP. Is that a preferred way?
Thank you.
There are a few different ways this could be accomplished. Which you chose depends on which Libraries you use (suggesting specific tools is off-topic on StackOverflow) and exactly what behavior you are looking for. You can find a summary of each approach below.
Modify a Markdown parser.
Some parsers provide an API to allow you to modify their behavior. You could perhaps remove the bits and pieces which parse tables, headers, etc. and leave the rest in place. Your final output would then leave in any Markdown syntax for those features. For example, if the author types a header, they would get a paragraph which begins with hashes.
Create a custom renderer.
Some Markdown parsers work in two steps. In step 1, the parser takes the Markdown text and outputs an Abstract Syntax Tree (AST) and in step 2, the renderer accepts an AST and outputs HTML. You could either modify the default renderer or build a custom renderer which handles each element as you wanted. For example, you can tell the "header" renderer method to output a paragraph (rather than a header) and you can choose whether that paragraph includes the original hashes or not.
Use an HTML Sanitizer.
Use your Markdown parser of choice, passing the text in and taking the output without modification. Then pass the HTML output into an HTML sanitizer, which will strip out any tags not in a whitelist. In this scenario there will be no clue that a header used to be a header. In the final output it will simply look like a regular paragraph.
Related
Within a Java application, I would need to convert marked-down text into simple plain text instead of html (for example dropping all links addresses, bold and italic markers).
Which is the best way to do this? I was thinking using a markdown library like fleaxmark. But I cant find this feature at first sight. Is it there? Are there other better alternatives?
Edit
Commonmark supports rendering to text, by using org.commonmark.renderer.text.TextContentRenderer instead of the default HTML renderer. Not sure what it does with newlines, but worth a try.
Original answer, using flexmark HTML + JSoup
The ideal solution would be to implement a custom Renderer for flexmark, but this would force you to write a model-to-string for all language features in markdown. Unless it supports this out of the box, but I'm not aware of this feature...
A simpler solution may be to use flexmark (or any other lightweight markdown renderer) and let it create the HTML. After that, just run the generated HTML through https://jsoup.org/ and let it extract the text:
Jsoup.parse(htmlInputStream).text();
String org.jsoup.nodes.Element.text()
Gets the combined text of this element and all its children. Whitespace is normalized and trimmed.
For example, given HTML <p>Hello <b>there</b> now! </p>, p.text() returns Hello there now!
We use this approach to get a "preview" of the text entered in a rich content editor (summernote), after being sanitized with org.owasp.html.HtmlSanitizer.
flexmark also have mark down to text feature.
checkout this
So I currently have a big blob of html text, and I want to generate an input form based on what is contained in that text. For example, if the text contains '[%Name%]', I want to be able to read that in and recognize 'Name' is there, and so in turn enable a form field for name. There will be multiple tags ([%age%], [%height%], etc.)
I was thinking about using Regex, but after doing some research it seems that Regex is a horrible idea to parse html with. I came across parsing html pages with groovy, but it is not strictly applicable to my implementation. I am storing the html formatted text (which I am creating using ckeditor) in a database.
Is there a efficient way to do this in java/groovy? Or should I just create an algorithm similar to examples shown here (I'm not too sure how effective the given algorithms would be, as they seem to be constructed around relatively small strings, whereas my string to parse through may end up being quite large (a 15-20 page document)).
Thanks in advance
Instead of reimplementing the wheel I think it's better to use jsoup. It is an excellent tool for your task and would be easy to obtain anything in a html page using it's selector syntax. Check out examples of usage in their cookbook.
I have to convert a .log file into a nice and pretty HTML file with tables. Right now I just want to get the HTML header down. My current method is to println to file every single line of the HTML file. for example
p.println("<html>");
p.println("<script>");
etc. there has to be a simpler way right?
How about using a JSP scriplet and JSTL?, you could create some custom object which holds all the important information and display it formatted using the Expression Language.
Printing raw HTML text as strings is probably the "easiest" (most straightforward) way to do what you're asking but it has its drawbacks (e.g. properly escaping the content text).
You could use the DOM (e.g. Document et al) interface provided by Java but that would hardly be "easy". Perhaps there are "DOM builder" type tools/libraries for Java that would simplify this task for you; I suggest looking at dom4j.
Look at this Java HTML Generator library (easy to use). It should make generating the actual HTML muuuch clearer. There are complications when creating HTML with Java Strings (what happens if you want to change something like a rowspan?) that can be avoided with this library. Especially when dealing with tables.
There are many templating engines available. Have a look at https://stackoverflow.com/questions/174204/suggestions-for-a-java-based-templating-engine
This way you can define a template in a txt file and have the java code fill in the variables.
What is the best way to detect data types inside html page using Java facilities DOM API, regexp, etc?
I'd like to detect types like skype plugin does for the phone/skype numbers, similar for addresses, emails, time, etc.
'Types' is an inappropriate term for the kind of information you are referring to. Choice of DOM API or regex depends upon the structure of information within the page.
If you know the structure, (for example tables being used for displaying information, you already know from which cell you can find phone number and which cell you can find email address), it makes sense to go with a DOM API.
Otherwise, you should use regex on plain HTML text without parsing it.
I'd use regexes in the following order:
Extract only the BODY content
Remove all tags to leave just plain text
Match relevant patterns in text
Of course, this assumes that markup isn't providing hints, and that you're purely extracting data, not modifying page context.
Hope this helps,
Phil Lello
I want to parse a document that is not pure xml. For example
my name is <j> <b> mike</b> </j>
example 2
my name is <mytag1 attribute="val" >mike</mytag1> and yours is <mytag2> john</mytag2>
Means my input is not pure xml. ITs simliar to html but the tags are not html.
How can i parse it in java?
Your examples are valid XML, except for the lack of a document element. If you know this to always be the case, then you could just wrap a set of dummy tags around the whole thing and use a standard parser (SAX, DOM...)
On the other hand if you get something uglier (e.g. tags don't match up, or are spaced out in an overlapping fashion), you'll have to do something custom which will involve a number of rules that you have to decide on that will be unique to your application. (e.g. How do I handle an opening tag that has no close? What do I do if the closing tag is outside the parent?)
There are few parsers that take not well formed html and turn it into well formed xml, here is some comparison with examples, that includes the most popular ones, except maybe HTMLParser. Probably that's what you need.