Anchors with SafeHtml

Anchors with SafeHtml - java

How would you use SafeHtml in combination with links?
Scenario: Our users can enter unformatted text which may contain links, e.g. I like&love http://www.stackoverflow.com. We want to safely render this text in GWT but make the links clickable, e.g. I like&love <a="http://www.stackoverflow.com">stackoverflow.com</a>. Aside rendering the text in the GWT frontend, we also want to send it via email where the links should be clickable as well.
So far, we considered the following options:
Store the complete text as HTML in the backend and let the frontend assume it's correctly encoded (I like&love <a="http://www.stackoverflow.com">stackoverflow.com</a>) -> Introduces XSS vulnerabilities
Store plain text but the links as HTML (I like&love <a="http://www.stackoverflow.com">stackoverflow.com</a>) in the backend and use HtmlSanitizer in the frontend
Store plain text and special encoding for the links (I like&love [stackoverflow.com|http://www.stackoverflow.com]) in the backend and use a custom SafeHtml generator in the frontend
To us, the third option looks the cleanest but it seems to require the most custom code since we can't leverage GWT's SafeHtml infrastructure.
Could anybody share how to best solve the problem? Is there another option that we didn't consider so far?

Why not store the text exactly as it was entered by the user, and perform any special treatment when transforming it for the output (e.g. for sending emails, creating PDFs, ...). This is the most natural approach, and you won't have to undo any special treatment e.g. when you offer the user to edit the string.
As a general rule, I would always perform encoding/escaping/transformation only for the immediate transport/storage/output target. There are very few reasons to deviate from this rule, one of them may be performance, e.g. caching a transformed value in the DB. (In these cases, I think it's best to give the DB field a specific name like 'text_htmltransformed' - this avoids 'overescaping', which can be just as harmful as no escaping.)
Note: Escaping/encoding is no replacement for input validation.

Related

filter out encoded javascript content from request

I have a problem where I am trying to cleanse the request content to strip out HTML and javascript if included in the input parameters.
This is basically to protect against XSS attacks and the ideal mechanism would be to validate input and encode the output but due to some restrictions I cannot work on the output end.
All I can do at this time is to try to cleanse the input through a filter. I am using ESAPI to canonicalize the input parameters and also using jsoup with the most restrictive Whitelist.none() option to strip all HTML.
This works as long as the malicious javascript is within some HTML tags but fails for a URL with javascript code without any HTML surrounding it, eg:
http://example.com/index.html?a=40&b=10&c='-prompt``-'
ends up showing an alert on the page. This is kind of what I am doing right now:
param = encoder.canonicalize(param, false, false);
param = Jsoup.clean(param, Whitelist.none());
So the question is:
Is there some way through which I can make sure that my input is stripped of all HTML and javascript code at the filter?
Should I throw in some regex validations but is there any regex that will take care of the cases that are getting past the check I have right now?

DISCLAIMER:
If output-escaping is not allowed in your internet-facing solution, you are in a NO-WIN SCENARIO. It's like antivirus on Windows: You'll be able to detect specific and known attacks, but you will be unable to detect or defend against unknown attacks. If your employer insists on this path, your due diligence is to make management aware of this fact and get their acceptance of the risks in writing. Every time I've confronted management with this, they've opted for the correct solution--output escaping.
================================================================
First off... watch out when using JSoup in any kind of a cleaning/filtering/input validation situation.
Upon receiving invalid HTML, like
<script>alert(1);
Jsoup will add in the missing </script> tag.
This means that if you're using Jsoup to "cleanse" HTML, it first transforms INVALID HTML into VALID HTML, before it begins processing.
So the question is: Is there some way through which I can make sure
that my input is stripped of all HTML and javascript code at the
filter? Should I throw in some regex validations but is there any
regex that will take care of the cases that are getting past the check
I have right now?
No. ESAPI and ESAPI's input validation is not appropriate for your use case because HTML is not a regular language and ESAPI's input for its validation are Regular Expressions. The fact is you cannot do what you ask:
Is there some way through which I can make sure that my input is
stripped of all HTML and javascript code at the filter?
And still have a functioning web application that requires user-defined HTML/JavaScript.
You can stack the deck in your favor a little bit: I would choose something like OWASP's HTML Sanitizer. and test your implementation against the XSS inputs listed here.
Many of those inputs are taken from OWASP's XSS Filter evasion cheat sheet, and will at least exercise your application against known attempts. But you will never be secure without output escaping.
===================UPDATE FROM COMMENTS==================
SO the use case is to try and block all html and javascript. My recommendation is to implement caja since it encapsulates HTML, CSS, and Javascript.
Javascript though is also difficult to manage from input validation, because like HTML, JavaScript is a non-regular language. Additionally, each browser has its own implementation that deviates in different ways from the ECMAScript spec. If you want to protect your input from being interpreted, this means you'd ideally have to have a parser for each browser family attempting to interpret user input in order to block it.
When all you've really got to do is make sure that the output is escaped. Sorry to beat a dead horse, but I have to stress that output escaping is 100x more important than rejecting user input. You want both, but if forced to choose one or the other, output escaping is less work overall.

Escape HTML in JSON with PlayFramework2

I am using PlayFramework2 and I can't find a way to properly handle HTML escaping.
In the template system, HTML entities are filtered by default.
But when I use REST requests with Backbone.js, my JSON objects are not filtered.
I use play.libs.Json.toJson(myModel) to transform an Object into a String.
So, in my controller, I use return ok(Json.toJson(myModel)); to send the response ... but here, the attributes of my model are not secured.
I can't find a way to handle it ...
Second question :
The template engine filters HTML entities by default, this means that we have to store into our database the raw user inputs.
Is it a save behaviour ?
Third questdion :
Is there in the PlayFramework a function to manualy escape strings ? All those I can find require to add new dependencies.
Thanks !
Edit : I found a way at the Backbone.js templating level :
- Use myBackboneModel.escape('attr'); instead of myBackboneModel.get('attr');
Underscore.js templating system also includes that options : <%= attr %> renders without escaping but <%- attr %> renders with escaping !
Just be careful to the efficiency, strings are re-escaped at each rendering. That's why the Backbone .create() should be prefered.

The best practices on XSS-attacks prevention usually recommend you to reason about your output rather than your input. There's a number of reasons behind that. In my opinion the most important are:
It doesn't make any sense to reason about escaping something unless you exactly know how you are going to output/render your data. Because different ways of rendering will require different escaping strategies, e.g. properly escaped HTML string is not good enough to use it in Javascript block. Requirements and technologies change constantly, today you render your data one way - tomorrow you might be using another (let's say you will be working on a mobile client which doesn't require HTML-escaping, because it doesn't use HTML at all to render data) You can only be sure about proper escaping strategy while rendering your data. This is why modern frameworks delegate escaping to templating engines. I'd recommend reviewing the following article: XSS (Cross Site Scripting) Prevention Cheat Sheet
Escaping user's input is actually a destructive/lossy operation – if you escape user's input before persisting it to a storage you will never find out what was his original input. There's no deterministic way to 'unescape' HTML-escaped string, consider my mobile client example above.
That is why I believe that the right way to go would be to delegate escaping to your templating engines (i.e. Play and JS-templating engine you're using for Backbone). There's no need to HTML-escape string you serialize to JSON. Notice that behind the scenes JSON-serializer will JSON-escape your strings, e.g. if you have a quote in your string it will be properly escaped to ensure resulting JSON is correct, because it's a JSON serializer after all that's why it only cares about proper JSON rendering, it knows nothing about HTML (and it shouldn't). However when you rendering your JSON data in the client side you should properly HTML-escape it using the functionality provided by the JS-templating engine you're using for Backbone.
Answering another question: you can use play.api.templates.HtmlFormat to escape raw HTML-string manually:
import play.api.templates.HtmlFormat
...
HtmlFormat.escape("<b>hello</b>").toString()
// -> <b>hello</b>
If you really need to make JSON-encoder escape certain HTML strings, a good idea might be to create a wrapper for them, let's say RawString and provide custom Format[RawString] which will also HTML-escape a string in its writes method. For details see: play.api.libs.json API documentation

How to create templates from html page automatically?

I have a use case in which I need to render an unformatted text in the format of a given web page programmatically in Java. i.e. The text should automatically be formatted like the web page with styles, paragraphs, bullet points etc.
As I see first I will have to analyze the piece of unformatted text to find out the candidates for paragraphs, bullet points, headings etc. I intend to use Lucene analyzers/tokenizers for this task. Are there any alternatives?
The second problem is to convert the formatted web page into some kind of template (e.g. velocity template) with place holders for various entities like titles, bullet points etc.
Is there any text analysis/templating library in Java that can help me do this? Preferably open source.
Are there any other suggestions for doing this sort of task in a better way in Java?
Thanks for your help.

There are a lot of hard parts to what you're doing.
The user input
If you don't ask your user to provide any context, you're never going to guess the structure of the text. At least, you should ask them to provide a title, and a series of paragraph in your GUI.
Ideally, you could ask them to follow a well-know markup language (Markdown, Textile, etc...) and use the open source parser to extract the structure.
The external page
If any page is used, the only things you can rely on are the "structural markup". So assuming you know the title of the page should be "Hello World", and there is a "h1" element somewhere in the page, you can maybe assume that this is where the header could go.
But if the pages is a div tag-soup, and only CSS is used to differentiate the rendering of the header as opposed to the bulk of the text, you're going to have to guess how the styling is done : that's plain impossible if you don't know how the page is made.
I don't think Lucene would help fo this (as far as I know Lucene is made to create an index of the words used in a bulk of text ; I don't think it can help you guessing which part of the text is meant to be a title, a subtitle, etc...)
Generating templates from external page
Assuming you have "guessed" right, you could generate the content by
copy pasting the page
replacing the parts to change with tags of your template language of choice
storing the template somewhere the templating system can access it
configure your template / view system (viewResolver for velocity) to use the right template for the rigth person
That would of course pose terrible legal questions, since your templates would incorporate works by the original website author (most probably copyrighted material)
A more realistic solution
I would suggest you constrain your problem to :
using input that has some structure information available (use a GUI to enter it, use a markup language, whatever)
using templates that you provide, know the structure of (and can reuse very easily)
Note that none of those points are related to the template system.
Otherwise, I'm afraid you're heading to an unreasonnable amount of work...

How to detect different data types inside HTML page?

What is the best way to detect data types inside html page using Java facilities DOM API, regexp, etc?
I'd like to detect types like skype plugin does for the phone/skype numbers, similar for addresses, emails, time, etc.

'Types' is an inappropriate term for the kind of information you are referring to. Choice of DOM API or regex depends upon the structure of information within the page.
If you know the structure, (for example tables being used for displaying information, you already know from which cell you can find phone number and which cell you can find email address), it makes sense to go with a DOM API.
Otherwise, you should use regex on plain HTML text without parsing it.

I'd use regexes in the following order:
Extract only the BODY content
Remove all tags to leave just plain text
Match relevant patterns in text
Of course, this assumes that markup isn't providing hints, and that you're purely extracting data, not modifying page context.
Hope this helps,
Phil Lello

Java website protection solutions (especially XSS)

I'm developing a web application, and facing some security problems.
In my app users can send messages and see other's (a bulletin board like app). I'm validating all the form fields that users can send to my app.
There are some very easy fields, like "nick name", that can be 6-10 alpabetical characters, or message sending time, which is sended to the users as a string, and then (when users ask for messages, that are "younger" or "older" than a date) I parse this with SimpleDateFormat (I'm developing in java, but my question is not related to only java).
The big problem is the message field. I can't restrict it to only alphabetical characters (upper or lowercase), because I have to deal with some often use characters like ",',/,{,} etc... (users would not be satisfied if the system didn't allow them to use these stuff)
According to this http://ha.ckers.org/xss.html, there are a lot of ways people can "hack" my site. But I'm wondering, is there any way I can do to prevent that? Not all, because there is no 100% protection, but I'd like a solution that can protect my site.
I'm using servlets on the server side, and jQuery, on the client side. My app is "full" AJAX, so users open 1 JSP, then all the data is downloaded and rendered by jQuery using JSON. (yeah, I know it's not "users-without-javascript" friendly, but it's 2010, right? :-) )
I know front end validation is not enough. I'd like to use 3 layer validation:
- 1. front end, javascript validate the data, then send to the server
- 2. server side, the same validation, if there is anything, that shouldn't be there (because of client side javascript), I BAN the user
- 3. if there is anything that I wasn't able to catch earlier, the rendering process handle and render appropriately
Is there any "out of the box" solution, especially for java? Or other solution that I can use?

To minimize XSS attacks important thing is to encode any field data before putting it back on the page. Like change > to > and so on. This would never allow any malicious code to execute when being added to the page.
I think you are doing lot of right things by white listing the data you expect for different fields. Beyond that for fields which can allow other characters which can be problematic encoding would fix the issue for you.
Further since you are using Ajax it gives you some protection as people cannot override values in URL parameters etc.

Look at the AntiSamy library. It allows you to define rulesets for your application, then run your user input through AntiSamy to clean it per your rules.

The easiest way is to do a simple replacement for the following
< with <
> with >
' with \'
That will solve most database vulnerability.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.