How to transform an HTML fragment to another HTML fragment? - java

I have a browser editor, of type contentEditable where users can copy/paste or select html fragments to put inside.
These fragments can be any kind of HTML, so we must sanitize the content so that it does not contain some security issue tags (like <script> etc...).
I know some sanitizer libraries that allow some Whitelist policy (like JSoup on the JVM), but these rules are generally very simple, like saying which tags/attributes are whitelisted and nothing else.
We want more advanced rules like:
Define which inline styles to keep or not,
Transform relative links to absolute links
Blacklist or whitelist some tags according to their className
Allow some URI attributes according to the URI pattern (like allowing only links to a certain domain).
In some cases we want forbidden dom nodes to be "replaced" by their childs (to remove formatting and html layout elements, but not to loose the text nodes that were in the blacklisted tags
So far we have done some code to handle this but I find this very hacky. Is there a known library, standard or algorithm to handle such things? I'm not an XML parse/transform expert, anything I could use like XSLT, SAX or something else that could help me solve my problem.
I'm looking for solutions on both the browser (JS) and the JVM (Java or Scala). Any idea on how to achieve this?

Maybe Showdown.js can help you? https://github.com/showdownjs/showdown

Related

Parse javascript generated content using Java

http://support.xbox.com/en-us/contact-us uses javascript to create some lists. I want to be able to parse these lists for their text. So for the above page I want to return the following:
Billing and Subscriptions
Xbox 360
Xbox LIVE
Kinect
Apps
Games
I was trying to use JSoup for a while before noticing it was generated using javascript. I have no idea how to go about parsing a page for its javascript generated content.
Where do I begin?
You'll want to use an HTML+JavaScript library like Cobra. It'll parse the DOM elements in the HTML as well as apply any DOM changes caused by JavaScript.
you could always import the whole page and then perform a string separator on the page (using return, etc) and look for the string containing the information, then return the string you want and pull pieces out of that string. That is the dirty way of doing it, not sure if there is a clean way to do it.
I don't think that text is generated by javascript... If I disable javascript those options can be found inside the html at this location (a jquery selector just because it was easier to hand-write than figuring out the xpath without javascript enabled :))
'div#ShellNavigationBar ul.NavigationElements li ul li a'
Regardless in direct answer to your query, you'd have to evaluate the javascript within the scope of the document, which I expect would be rather complex in Java. You'd have more luck identifying the javascript file generating the relevant content and just parsing that directly.

How to detect different data types inside HTML page?

What is the best way to detect data types inside html page using Java facilities DOM API, regexp, etc?
I'd like to detect types like skype plugin does for the phone/skype numbers, similar for addresses, emails, time, etc.
'Types' is an inappropriate term for the kind of information you are referring to. Choice of DOM API or regex depends upon the structure of information within the page.
If you know the structure, (for example tables being used for displaying information, you already know from which cell you can find phone number and which cell you can find email address), it makes sense to go with a DOM API.
Otherwise, you should use regex on plain HTML text without parsing it.
I'd use regexes in the following order:
Extract only the BODY content
Remove all tags to leave just plain text
Match relevant patterns in text
Of course, this assumes that markup isn't providing hints, and that you're purely extracting data, not modifying page context.
Hope this helps,
Phil Lello

Sanitize HTML data

I'm fetching data from different RSS / ATOM feeds and sometimes the HTML data I receive contains HTML tags but they dont have close tags or some other issues and it screws up the page layout / styling.
Somethings there is class name / id clash. Is there any way to sanitize it?
If anybody can point me to some reliable Javascript / Java implementation.
You can give JTidy a try.
JTidy can be used as a tool for cleaning up malformed and faulty HTML.
Another option is HTML Cleaner
HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.
I have used NekoHTML with great success. It's just a thin layer over the Apache parser that puts it into error-correcting mode, which is a great architecture as every time Apache gets better so does Neko. And there's no huge amount of extra code.

Java: Best way to remove Javascript from HTML

What's the best library/approach for removing Javascript from HTML that will be displayed?
For example, take:
<html><body><span onmousemove='doBadXss()'>test</span></body></html>
and leave:
<html><body><span>test</span></body></html>
I see the DeXSS project. But is that the best way to go?
JSoup has a simple method for sanitizing HTML based on a whitelist.
Check http://jsoup.org/cookbook/cleaning-html/whitelist-sanitizer
It uses a whitelist, which is safer then the blacklist approach DeXSS uses. From the DeXSS page:
There are still a number of known XSS attacks that DeXSS does not yet detect.
A blacklist only disallows known unsafe constructions, while a whitelist only allows known safe constructions. So unknown, possibly unsafe constructions will only be protected against with a whitelist.
The easiest way would be to not have those in the first place... It probably would make sense to allow only very simple tags to be used in free-form fields and to disallow any kind of attributes.
Probably not the answer you're going for, but in many cases you only want to provide markup capabilities, not a full editing suite.
Similarly, another even easier approach would be to provide a text-based syntax, like Markdown, for editing. (not that many ways you can exploit the SO edit area, for instance. Markdown syntax + limited tag list without attributes).
You could try dom4j http://dom4j.sourceforge.net/dom4j-1.6.1/ This is a DOM parser (as opposed to SAX) and allows you to easily traverse and manipulate the DOM, removing node attributes like onmouseover for example (or entire elements like <script>), before writing back out or streaming somewhere. Depending on how wild your html is, you may need to clean it up first - jtidy http://jtidy.sourceforge.net/ is good.
But obviously doing all this involves some overhead if you're doing this at page render time.

How to implement a possibility for user to post some html-formatted data in a safe way?

I have a textarea and I want to support some simplest formatting for posted data (at least, whitespaces and line breaks).
How can I achieve this? If I will not escape the response and keep some html tags then it'll be a great security hole. But I don't see any other solution which will allow text formatting in browser.
So, I probably should filter user's input. But how can I do this? Are there any ready to use solutions? I'm using JSF so are there any smart component which filters everything except html tags?
Use a HTML parser which supports HTML filtering against a whitelist like Jsoup. Here's an extract of relevance from its site.
Sanitize untrusted HTML
Problem
You want to allow untrusted users to supply HTML for output on your website (e.g. as comment submission). You need to clean this HTML to avoid cross-site scripting (XSS) attacks.
Solution
Use the jsoup HTML Cleaner with a configuration specified by a Whitelist.
String unsafe =
"<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>";
String safe = Jsoup.clean(unsafe, Whitelist.basic());
// now: <p>Link</p>
And then to display it with whitespace preserved, apply CSS white-space: pre-wrap; on the HTML element where you're displaying it.
No all-in-one JSF component comes to mind.
Is there some reason why you need to accept HTML instead of some other markup language, such as markdown (which is what StackOverflow uses)?
http://daringfireball.net/projects/markdown/
Not sure what kind of tags you'd want to accept that wouldn't be covered by md or a similar formatting language...

Categories

Resources