Embedding HTML within XHTML

Embedding HTML within XHTML - java

I have a JSF page which is outputting XHTML (from a facelet). One of the fields has user-generated HTML which is causing parsing errors in my web browser (Safari).
I understand that this is because XHTML is strict and follows the rules of XML, unlike HTML. What is the best way of embedding this HTML while avoiding fatal parsing errors?
One thing I've thought of is replacing all instances of say <br> with <br />, but there's got to be a better solution than that.
Here's another example of something I need to embed:
This is my sample text.<br>The address is Wind & Fire.
Notice here that the line break tag needs to be self-enclosing, and the ampersand should probably be &aamp;

Use a HTML parser which returns well formed HTML syntax. I can recommend Jsoup for this.
Kickoff example:
String userHtml = "foo<br>bar&baz";
String wellFormedHtml = Jsoup.parse(userHtml).body().html();
System.out.println(wellFormedHtml); // foo<br />bar&baz
Just apply this once when you're about to process submitted user input.
Jsoup offers more advantages as well, such a Whitelist which you can use to strip out potential malicious HTML/JS code which can open XSS attack holes.

Related

filter out encoded javascript content from request

I have a problem where I am trying to cleanse the request content to strip out HTML and javascript if included in the input parameters.
This is basically to protect against XSS attacks and the ideal mechanism would be to validate input and encode the output but due to some restrictions I cannot work on the output end.
All I can do at this time is to try to cleanse the input through a filter. I am using ESAPI to canonicalize the input parameters and also using jsoup with the most restrictive Whitelist.none() option to strip all HTML.
This works as long as the malicious javascript is within some HTML tags but fails for a URL with javascript code without any HTML surrounding it, eg:
http://example.com/index.html?a=40&b=10&c='-prompt``-'
ends up showing an alert on the page. This is kind of what I am doing right now:
param = encoder.canonicalize(param, false, false);
param = Jsoup.clean(param, Whitelist.none());
So the question is:
Is there some way through which I can make sure that my input is stripped of all HTML and javascript code at the filter?
Should I throw in some regex validations but is there any regex that will take care of the cases that are getting past the check I have right now?

DISCLAIMER:
If output-escaping is not allowed in your internet-facing solution, you are in a NO-WIN SCENARIO. It's like antivirus on Windows: You'll be able to detect specific and known attacks, but you will be unable to detect or defend against unknown attacks. If your employer insists on this path, your due diligence is to make management aware of this fact and get their acceptance of the risks in writing. Every time I've confronted management with this, they've opted for the correct solution--output escaping.
================================================================
First off... watch out when using JSoup in any kind of a cleaning/filtering/input validation situation.
Upon receiving invalid HTML, like
<script>alert(1);
Jsoup will add in the missing </script> tag.
This means that if you're using Jsoup to "cleanse" HTML, it first transforms INVALID HTML into VALID HTML, before it begins processing.
So the question is: Is there some way through which I can make sure
that my input is stripped of all HTML and javascript code at the
filter? Should I throw in some regex validations but is there any
regex that will take care of the cases that are getting past the check
I have right now?
No. ESAPI and ESAPI's input validation is not appropriate for your use case because HTML is not a regular language and ESAPI's input for its validation are Regular Expressions. The fact is you cannot do what you ask:
Is there some way through which I can make sure that my input is
stripped of all HTML and javascript code at the filter?
And still have a functioning web application that requires user-defined HTML/JavaScript.
You can stack the deck in your favor a little bit: I would choose something like OWASP's HTML Sanitizer. and test your implementation against the XSS inputs listed here.
Many of those inputs are taken from OWASP's XSS Filter evasion cheat sheet, and will at least exercise your application against known attempts. But you will never be secure without output escaping.
===================UPDATE FROM COMMENTS==================
SO the use case is to try and block all html and javascript. My recommendation is to implement caja since it encapsulates HTML, CSS, and Javascript.
Javascript though is also difficult to manage from input validation, because like HTML, JavaScript is a non-regular language. Additionally, each browser has its own implementation that deviates in different ways from the ECMAScript spec. If you want to protect your input from being interpreted, this means you'd ideally have to have a parser for each browser family attempting to interpret user input in order to block it.
When all you've really got to do is make sure that the output is escaped. Sorry to beat a dead horse, but I have to stress that output escaping is 100x more important than rejecting user input. You want both, but if forced to choose one or the other, output escaping is less work overall.

java regex replace all html tags except br

I need a regular expression that can be used with replaceall to replace all the html tags with empty string except any variations of br to maintain the line breaks.
I found the following to replace all html tags
<\s*br\s*\[^>]

You might get some answers that claim to work.
Those answers might even work for the particular cases you try them against.
But know that regular expressions (which I'm fond of in general) are the wrong tool for the job in this case.
And as your project evolves and needs to cover more complex HTML inputs, the regular expression will get more and more convoluted, and there may well come a time when it simply cannot solve your problem anymore, period.
Do it the right way from the beginning. Use an HTML parser, not a regex.
For reference, here are some related SO posts:
Regex to match all HTML tags except <p> and </p>
Regex to replace all \n in a String, but no those inside [code] [/code] tag
RegEx match open tags except XHTML self-contained tags - bobince says it much more thoroughly than I do (:

If the HTML is known to be valid, then you can use this regex (case-insensitive):
<(?!br\b)/?[a-z]([^"'>]|"[^"]*"|'[^']*')*>
but it can fail in interesting ways if you give it invalid HTML. Also, I took "HTML tags" pretty literally; the above won't cover <!-- HTML comments --> and <!DOCTYPE declarations>, and won't convert <![CDATA[ blocks ]]> and &entity;s to plain text.
It's probably better to take a step back, think about why you want to strip out these HTML tags — that is, what you're actually trying to achieve — and then find an HTML-handling library that offers a better way to achieve that goal. HTML cleaning is really a solved problem; you shouldn't need to reinvent it.
UPDATE: I've just realized that, even for valid HTML, the above has some major limitations. For example, it will mishandle something like <!--<yes--> (converting it to just <!--), and also something like <script><foo></script> (since HTML proper has a small number of tags with CDATA content, that is, everything after the start-tag until the first </ is taken to be character data, not containing HTML tags; fortunately, XHTML was forced to get rid of this concept due to XML's lack of support for it). Both of these limitations can be addressed, of course — using more regexes! — but they should help reinforce the point that you should use a well-tested HTML-handling library rather than trying to roll your own regexes. If you have a lot of guarantees about the nature of the HTML you're trying to handle, then regexes can be useful; but if what you're trying to do is strip out arbitrary tags, then that's a good sign that you don't have these sorts of guarantees.

Struts 2 encode input parameters to avoid XSS

I have an application built with Struts 2. It has some issues with Cross-site scripting (XSS) attacks. I want to encode some of the actions input parameters in a similar fashion to JSP <c:out value="${somevalue}"/> Is there any easy approach to do this in Struts 2? Java API method would do fine.
EDIT I found this one - http://www.owasp.org/index.php/Talk:How_to_perform_HTML_entity_encoding_in_Java
Any experience with it?

You can use
<%# taglib uri="http://java.sun.com/jsp/jstl/functions" prefix="fn" %>
${fn:escapeXml(someValue)}
There is also a Good API JSoup
Sanitize untrusted HTML
Problem
You want to allow untrusted users to supply HTML for output on your website (e.g. as comment submission). You need to clean this HTML to avoid cross-site scripting (XSS) attacks.
Solution
Use the jsoup HTML Cleaner with a configuration specified by a Whitelist.
String unsafe =
"<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>";
String safe = Jsoup.clean(unsafe, Whitelist.basic());
// now: <p>Link</p>
So, all you basically need to do is the the following during processing the submitted text:
String text = request.getParameter("text");
String safe = Jsoup.clean(text, Whitelist.basic());
// Persist 'safe' in DB instead.
There is struts2securityaddons
This project contains additional configuration, interceptors, and other code used to improve the security of struts 2 applications.
See also
XSS Prevention oin Java
Prevent jsp from XSS
struts2securityaddons

Escaping input parameters as an XSS prevention mean has several disadvanteges, especially:
You can't be certain about destination of the particular input data, therefore you can't choose proper escaping scheme.
Escaping input data masks lack of output escaping. Without consistent output escaping, you can still pass unescaped data to the unescaped output accidentially.
Presence of escaping complicates data processing.
Therefor it would be better to apply consistent output escaping instead.
See also:
OWASP XSS Prevention Cheat Sheet

There is no easy, out of the box solution against XSS with struts 2 tags. The OWASP ESAPI API has some support for the escaping that is very usefull, and they have tag libraries.
My approach was to basically to extend the stuts 2 tags in following ways.
Modify s:property tag so it can take extra attributes stating what sort of escaping is required (escapeHtmlAttribute="true" etc.). This involves creating a new Property and PropertyTag classes. The Property class uses OWASP ESAPI api for the escaping.
Change freemarker templates to use the new version of s:property and set the escaping.
If you didn't want to modify the classes in step 1, another approach would be to import the ESAPI tags into the freemarker templates and escape as needed. Then if you need to use a s:property tag in your JSP, wrap it with and ESAPI tag.
I have written a more detailed explanation here.
http://www.nutshellsoftware.org/software/securing-struts-2-using-esapi-part-1-securing-outputs/
I agree escaping inputs is not ideal.

How to implement a possibility for user to post some html-formatted data in a safe way?

I have a textarea and I want to support some simplest formatting for posted data (at least, whitespaces and line breaks).
How can I achieve this? If I will not escape the response and keep some html tags then it'll be a great security hole. But I don't see any other solution which will allow text formatting in browser.
So, I probably should filter user's input. But how can I do this? Are there any ready to use solutions? I'm using JSF so are there any smart component which filters everything except html tags?

Use a HTML parser which supports HTML filtering against a whitelist like Jsoup. Here's an extract of relevance from its site.
Sanitize untrusted HTML
Problem
You want to allow untrusted users to supply HTML for output on your website (e.g. as comment submission). You need to clean this HTML to avoid cross-site scripting (XSS) attacks.
Solution
Use the jsoup HTML Cleaner with a configuration specified by a Whitelist.
String unsafe =
"<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>";
String safe = Jsoup.clean(unsafe, Whitelist.basic());
// now: <p>Link</p>
And then to display it with whitespace preserved, apply CSS white-space: pre-wrap; on the HTML element where you're displaying it.
No all-in-one JSF component comes to mind.

Is there some reason why you need to accept HTML instead of some other markup language, such as markdown (which is what StackOverflow uses)?
http://daringfireball.net/projects/markdown/
Not sure what kind of tags you'd want to accept that wouldn't be covered by md or a similar formatting language...

Getting elements by type in malformed HTML

What's the easiest way in Java to retrieve all elements with a certain type in a malformed HTML page? So I want to do something like this:
public static void main(String[] args) {
// Read in an HTML file from disk
// Retrieve all INPUT elements regardless of whether the HTML is well-formed
// Loop through all elements and retrieve their ids if they exist for the element
}

HtmlCleaner is arguably one of the best HTML parsers out there when it comes to dealing with (somewhat) malformed HTML.
Documentation is here with some code samples; you're basically looking for getElementsByName() method.
Take a look at Comparison of Java HTML parsers if you're considering other libraries.

I've had success using tagsoup. Heres a short description from their home page:
This is the home page of TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.

Check Jtidy.
JTidy is a Java port of HTML Tidy, a
HTML syntax checker and pretty
printer. Like its non-Java cousin,
JTidy can be used as a tool for
cleaning up malformed and faulty HTML.
In addition, JTidy provides a DOM
interface to the document that is
being processed, which effectively
makes you able to use JTidy as a DOM
parser for real-world HTML.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Embedding HTML within XHTML - java

Related

filter out encoded javascript content from request

java regex replace all html tags except br

Struts 2 encode input parameters to avoid XSS

How to implement a possibility for user to post some html-formatted data in a safe way?

Getting elements by type in malformed HTML

Categories

Resources