Does using "JSP Document" / "JSP in XML notation" imply outputting XHTML? - java

I'm really not sure about this: does using "JSP Document" / "JSP in XML notation" imply outputting XHTML?
If so, is there anything special to look after as to produce a "valid" XHTML page?
More specifically: can I have a valid "JSP Document" (JSP in XML) that is producing an invalid XHTML page?

I'm really not sure about this: does using "JSP Document" / "JSP in XML notation" imply outputting XHTML?
It at least implies consuming and producing well formed XML. If you write invalid XML, then it will error during parsing. If it produces well formed XML, then it can impossibly be HTML4 because closing shorttags like br, hr, meta and link is disallowed.
What would you recommend to serve when using JSP Document? transitional? strict? HTML5 XML? HTML5 HTML? (HTML5 allows closing tags like <br/>)
Since it's well formed XML, you should choose either XHTML or HTML5. While HTML5 specification is still in draft mode, it allows closing shorttags. Also see the end of chapter 3.2.2 Elements:
Some elements, however, are forbidden from containing any content at all. These are known as void elements. In HTML, the above syntax cannot be used for void elements. For such elements, the end tag must be omitted because the element is automatically closed by the parser. Such elements include, among others, br, hr, link and meta
HTML Example:
<link type="text/css" rel="stylesheet" href="style.css">
In XHTML, the XML syntactic requirements dictate that this must be made explicit using either an explicit end tag, as above, or the empty element syntax. This is achieved by inserting a slash at the end of the start tag immediately before the right angle bracket.
Example:
<link type="text/css" href="style.css"/>
Authors may optionally choose to use this same syntax for void elements in the HTML syntax as well. Some authors also choose to include whitespace before the slash, however this is not necessary. (Using whitespace in that fashion is a convention inherited from the compatibility guidelines in XHTML 1.0, Appendix C.)
Then, the choice between transitional and strict depends on the degree of web standards you'd like to support. For that, the table at the bottom of this website gives an excellent overview.
To start, you'd like to avoid the Quirks Mode as much as possible since that triggers the box model bug in MSIE browser which causes inconsitenties in margins, paddings, dimensions of the elements when specified by CSS. The lack of the doctype or an incorrect doctype will trigger this mode.
I strongly recommend to pick a Strict doctype since the box model and behaviour would then be as much as possible consistent among the different webbrowsers the world is aware of. Either of the following doctypes is okay, depending on what elements/attributes you'd like to support/vaildate.
XHTML 1.0 strict:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
or the newer XHTML 1.1 (strict, module-based):
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
or the (still in draft mode) HTML5 doctype:
<!DOCTYPE html>
Note that you need to ensure that the HTTP Content-Type header is set to text/html, not application/xml nor application/xhtml+xml when going for XHTML, else MSIE may still go mad since it doesn't support that. Also see the aforementioned doctype website for more detail. The same article indeed mentions that serving XHTML as text/html is considered harmful, but that only applies when it get rendered with the <?xml?> declaration and/or contains inline JavaScripts not embedded in CDATA blocks.

It depends on your definition of XHTML. For most people, XHTML simply means HTML in well-formed XML. In that sense, JSP Document implies XHTML because JSP Document itself is well-formed XML.
However, JSP Document itself doesn't enforce any XHTML rules. For example, you can still generate XHTML 1.0 Strict document with deprecated tags like <center>.
It's also possible to use custom tags in JSP Document that generates non-XML tags, which renders whole document non-XML.

Related

Should we remove HTML attributes while using Thymeleaf?

I'm studying Thymeleaf and have found out that in almost all examples there are Thymeleaf's tag values as well as standard HTML values like:
<title th:text="#{product.page.title}">Page Title</title>
<link href="../static/css/bootstrap-3.3.7-dist/bootstrap.min.css" rel="stylesheet"
th:href="#{/css/bootstrap-3.3.7-dist/bootstrap.css}"/>
<script src="../static/js/jquery-3.1.1.js"
th:src="#{/js/jquery-3.1.1.js}"></script>
These standard tag values like Page Title or href="../static/css/bootstrap-3.3.7-dist/bootstrap.min.css" etc. are ignoring by controller and don't rendering on the page.
I'm wondering – is it just a good practice to leave them to improve code readability or it is better to remove them to clean up code?
Because for the compiler they are useless and have not any affect to the rendering result.
This depends entirely on your development process.
You could keep the HTML attributes around in the early phases, while you are still trying to lay out the page using just your browser.
But, once you get to a point where you are using automated unit / web testing, you can safely remove the HTML attributes because this testing should always be using a prod-like environment (which would include thymeleaf).

Embedding HTML within XHTML

I have a JSF page which is outputting XHTML (from a facelet). One of the fields has user-generated HTML which is causing parsing errors in my web browser (Safari).
I understand that this is because XHTML is strict and follows the rules of XML, unlike HTML. What is the best way of embedding this HTML while avoiding fatal parsing errors?
One thing I've thought of is replacing all instances of say <br> with <br />, but there's got to be a better solution than that.
Here's another example of something I need to embed:
This is my sample text.<br>The address is Wind & Fire.
Notice here that the line break tag needs to be self-enclosing, and the ampersand should probably be &aamp;
Use a HTML parser which returns well formed HTML syntax. I can recommend Jsoup for this.
Kickoff example:
String userHtml = "foo<br>bar&baz";
String wellFormedHtml = Jsoup.parse(userHtml).body().html();
System.out.println(wellFormedHtml); // foo<br />bar&baz
Just apply this once when you're about to process submitted user input.
Jsoup offers more advantages as well, such a Whitelist which you can use to strip out potential malicious HTML/JS code which can open XSS attack holes.

How to implement a possibility for user to post some html-formatted data in a safe way?

I have a textarea and I want to support some simplest formatting for posted data (at least, whitespaces and line breaks).
How can I achieve this? If I will not escape the response and keep some html tags then it'll be a great security hole. But I don't see any other solution which will allow text formatting in browser.
So, I probably should filter user's input. But how can I do this? Are there any ready to use solutions? I'm using JSF so are there any smart component which filters everything except html tags?
Use a HTML parser which supports HTML filtering against a whitelist like Jsoup. Here's an extract of relevance from its site.
Sanitize untrusted HTML
Problem
You want to allow untrusted users to supply HTML for output on your website (e.g. as comment submission). You need to clean this HTML to avoid cross-site scripting (XSS) attacks.
Solution
Use the jsoup HTML Cleaner with a configuration specified by a Whitelist.
String unsafe =
"<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>";
String safe = Jsoup.clean(unsafe, Whitelist.basic());
// now: <p>Link</p>
And then to display it with whitespace preserved, apply CSS white-space: pre-wrap; on the HTML element where you're displaying it.
No all-in-one JSF component comes to mind.
Is there some reason why you need to accept HTML instead of some other markup language, such as markdown (which is what StackOverflow uses)?
http://daringfireball.net/projects/markdown/
Not sure what kind of tags you'd want to accept that wouldn't be covered by md or a similar formatting language...

Getting elements by type in malformed HTML

What's the easiest way in Java to retrieve all elements with a certain type in a malformed HTML page? So I want to do something like this:
public static void main(String[] args) {
// Read in an HTML file from disk
// Retrieve all INPUT elements regardless of whether the HTML is well-formed
// Loop through all elements and retrieve their ids if they exist for the element
}
HtmlCleaner is arguably one of the best HTML parsers out there when it comes to dealing with (somewhat) malformed HTML.
Documentation is here with some code samples; you're basically looking for getElementsByName() method.
Take a look at Comparison of Java HTML parsers if you're considering other libraries.
I've had success using tagsoup. Heres a short description from their home page:
This is the home page of TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
Check Jtidy.
JTidy is a Java port of HTML Tidy, a
HTML syntax checker and pretty
printer. Like its non-Java cousin,
JTidy can be used as a tool for
cleaning up malformed and faulty HTML.
In addition, JTidy provides a DOM
interface to the document that is
being processed, which effectively
makes you able to use JTidy as a DOM
parser for real-world HTML.

parse meta tags in Java

I have a collection of HTML documents for which I need to parse the contents of the <meta> tags in the <head> section. These are the only HTML tags whose values I'm interested in, i.e. I don't need to parse anything in the <body> section.
I've attempted to parse these values using the XPath support provided by JDom. However, this isn't working out too well because a lot of the HTML in the <body> section is not valid XML.
Does anyone have any suggestions for how I might go about parsing these tag values in manner that can deal with malformed HTML?
Cheers,
Don
You can likely use the Jericho HTML Parser. In particular, have a look at this to see how you can go about finding specific tags.
If it suits your application you can use Tidy to convert HTML to valid XML, and then use as much XPath as you like!
JTidy should provide a good starting point for this.

Categories

Resources