Java: Best way to remove Javascript from HTML

Java: Best way to remove Javascript from HTML - java

What's the best library/approach for removing Javascript from HTML that will be displayed?
For example, take:
<html><body><span onmousemove='doBadXss()'>test</span></body></html>
and leave:
<html><body><span>test</span></body></html>
I see the DeXSS project. But is that the best way to go?

JSoup has a simple method for sanitizing HTML based on a whitelist.
Check http://jsoup.org/cookbook/cleaning-html/whitelist-sanitizer
It uses a whitelist, which is safer then the blacklist approach DeXSS uses. From the DeXSS page:
There are still a number of known XSS attacks that DeXSS does not yet detect.
A blacklist only disallows known unsafe constructions, while a whitelist only allows known safe constructions. So unknown, possibly unsafe constructions will only be protected against with a whitelist.

The easiest way would be to not have those in the first place... It probably would make sense to allow only very simple tags to be used in free-form fields and to disallow any kind of attributes.
Probably not the answer you're going for, but in many cases you only want to provide markup capabilities, not a full editing suite.
Similarly, another even easier approach would be to provide a text-based syntax, like Markdown, for editing. (not that many ways you can exploit the SO edit area, for instance. Markdown syntax + limited tag list without attributes).

You could try dom4j http://dom4j.sourceforge.net/dom4j-1.6.1/ This is a DOM parser (as opposed to SAX) and allows you to easily traverse and manipulate the DOM, removing node attributes like onmouseover for example (or entire elements like <script>), before writing back out or streaming somewhere. Depending on how wild your html is, you may need to clean it up first - jtidy http://jtidy.sourceforge.net/ is good.
But obviously doing all this involves some overhead if you're doing this at page render time.

Related

Is it bad practice to create XML files directly without using a class to store the structure? [duplicate]

In the thread What’s your favorite “programmer ignorance” pet peeve?, the following answer appears, with a large amount of upvotes:
Programmers who build XML using string concatenation.
My question is, why is building XML via string concatenation (such as a StringBuilder in C#) bad?
I've done this several times in the past, as it's sometimes the quickest way for me to get from point A to point B when to comes to the data structures/objects I'm working with. So far, I have come up with a few reasons why this isn't the greatest approach, but is there something I'm overlooking? Why should this be avoided?
Probably the biggest reason I can think of is you need to escape your strings manually, and most new programmers (and even some experienced programmers) will forget this. It will work great for them when they test it, but then "randomly" their apps will fail when someone throws an & symbol in their input somewhere. Ok, I'll buy this, but it's really easy to prevent the problem (SecurityElement.Escape to name one).
When I do this, I usually omit the XML declaration (i.e. <?xml version="1.0"?>). Is this harmful?
Performance penalties? If you stick with proper string concatenation (i.e. StringBuilder), is this anything to be concerned about? Presumably, a class like XmlWriter will also need to do a bit of string manipulation...
There are more elegant ways of generating XML, such as using XmlSerializer to automatically serialize/deserialize your classes. Ok sure, I agree. C# has a ton of useful classes for this, but sometimes I don't want to make a class for something really quick, like writing out a log file or something. Is this just me being lazy? If I am doing something "real" this is my preferred approach for dealing w/ XML.

You can end up with invalid XML, but you will not find out until you parse it again - and then it is too late. I learned this the hard way.

I think readability, flexibility and scalability are important factors. Consider the following piece of Linq-to-Xml:
XDocument doc = new XDocument(new XDeclaration("1.0","UTF-8","yes"),
new XElement("products", from p in collection
select new XElement("product",
new XAttribute("guid", p.ProductId),
new XAttribute("title", p.Title),
new XAttribute("version", p.Version))));
Can you find a way to do it easier than this? I can output it to a browser, save it to a document, add attributes/elements in seconds and so on ... just by adding couple lines of code. I can do practically everything with it without much of effort.

Actually, I find the biggest problem with string concatenation is not getting it right the first time, but rather keeping it right during code maintenance. All too often, a perfectly-written piece of XML using string concat is updated to meet a new requirement, and string concat code is just too brittle.
As long as the alternatives were XML serialization and XmlDocument, I could see the simplicity argument in favor of string concat. However, ever since XDocument et. al., there is just no reason to use string concat to build XML anymore. See Sander's answer for the best way to write XML.
Another benefit of XDocument is that XML is actually a rather complex standard, and most programmers simply do not understand it. I'm currently dealing with a person who sends me "XML", complete with unquoted attribute values, missing end tags, improper case sensitivity, and incorrect escaping. But because IE accepts it (as HTML), it must be right! Sigh... Anyway, the point is that string concatenation lets you write anything, but XDocument will force standards-complying XML.

I wrote a blog entry back in 2006 moaning about XML generated by string concatenation; the simple point is that if an XML document fails to validate (encoding issues, namespace issues and so on) it is not XML and cannot be treated as such.
I have seen multiple problems with XML documents that can be directly attributed to generating XML documents by hand using string concatenation, and nearly always around the correct use of encoding.
Ask yourself this; what character set am I currently encoding my document with ('ascii7', 'ibm850', 'iso-8859-1' etc)? What will happen if I write a UTF-16 string value into an XML document that has been manually declared as 'ibm850'?
Given the richness of the XML support in .NET with XmlDocument and now especially with XDocument, there would have to be a seriously compelling argument for not using these libraries over basic string concatenation IMHO.

I think that the problem is that you aren't watching the xml file as a logical data storage thing, but as a simple textfile where you write strings.
It's obvious that those libraries do string manipulation for you, but reading/writing xml should be something similar to saving datas into a database or something logically similar

If you need trivial XML then it's fine. Its just the maintainability of string concatenation breaks down when the xml becomes larger or more complex. You pay either at development or at maintenance time. The choice is yours always - but history suggests the maintenance is always more costly and thus anything that makes it easier is worthwhile generally.

You need to escape your strings manually. That's right. But is that all? Sure, you can put the XML spec on your desk and double-check every time that you've considered every possible corner-case when you're building an XML string. Or you can use a library that encapsulates this knowledge...

Another point against using string concatenation is that the hierarchical structure of the data is not clear when reading the code. In #Sander's example of Linq-to-XML for example, it's clear to what parent element the "product" element belongs, to what element the "title" attribute applies, etc.

As you said, it's just awkward to build XML correct using string concatenation, especially now you have XML linq that allows for simple construction of an XML graph and will get namespaces, etc correct.
Obviously context and how it is being used matters, such as in the logging example string.Format can be perfectly acceptable.
But too often people ignore these alternatives when working with complex XML graphs and just use a StringBuilder.

The main reason is DRY: Don't Repeat Yourself.
If you use string concat to do XML, you will constantly be repeating the functions that keep your string as a valid XML document. All the validation would be repeated, or not present. Better to rely on a class that is written with XML validation included.

I've always found creating an XML to be more of a chore than reading in one. I've never gotten the hang of serialization - it never seems to work for my classes - and instead of spending a week trying to get it to work, I can create an XML file using strings in a mere fraction of the time and write it out.
And then I load it in using an XMLReader tree. And if the XML file doesn't read as valid, I go back and find the problem within my saving routines and corret it. But until I get a working save/load system, I refuse to perform mission-critical work until I know my tools are solid.
I guess it comes down to programmer preference. Sure, there are different ways of doing things, for sure, but for developing/testing/researching/debugging, this would be fine. However I would also clean up my code and comment it before handing it off to another programmer.
Because regardless of the fact you're using StringBuilder or XMLNodes to save/read your file, if it is all gibberish mess, nobody is going to understand how it works.

Maybe it won't ever happen, but what if your environment switches to XML 2.0 someday? Your string-concatenated XML may or may not be valid in the new environment, but XDocument will almost certainly do the right thing.
Okay, that's a reach, but especially if your not-quite-standards-compliant XML doesn't specify an XML version declaration... just saying.

Get nodes in html document contains word

I want to write a script that checks a
document for keywords and specifies html document nodes in which they are contained (possibly
assign a unique identifier).
I am not a professional programmer and do not know the strength of low-level languages and things as PLO.. I'm afraid of doing something very bad and unsupported.
How is it possible to isolate the desired nodes?
My experience - js and php - php only for very simple things. Also, I
do not want to use the opportunity to work
with js nodes. My thoughts:
to make a string of html
verify the existence of the words on the page
if the word on page exists: foreach node in body element I get first and last positions
(for example, we see opening tag for each character we initially know
position and therefore we calculate the first
position where the tag is opened and last where closed. And so on for all nodes).
We know the position of the word (eg 192,
199) and check in what range it got (in this
case, these bands - nodes html document).
I need ideas from experienced programmers.
It does not matter what language you are
programming (except for web-oriented)-
every opinion is important to me. It is likely
that there are libraries that solve such
problems. I very much hope that you will
understand me. English is not my native
language.

I always recommend Beautiful Soup for this kind of thing. It is a Python library that allows you to parse XML/HTML documents really quickly. You could quite quickly get something running that extracts the text from each div element I would have thought. Then using Pythons built-in string manipulation tools I'm sure searching for particular words would be fairly simple.

You need to use a html parser. Refer
Which HTML Parser is the best?
After that, you need to use xpath feature to extract whichever node.

Escape HTML in JSON with PlayFramework2

I am using PlayFramework2 and I can't find a way to properly handle HTML escaping.
In the template system, HTML entities are filtered by default.
But when I use REST requests with Backbone.js, my JSON objects are not filtered.
I use play.libs.Json.toJson(myModel) to transform an Object into a String.
So, in my controller, I use return ok(Json.toJson(myModel)); to send the response ... but here, the attributes of my model are not secured.
I can't find a way to handle it ...
Second question :
The template engine filters HTML entities by default, this means that we have to store into our database the raw user inputs.
Is it a save behaviour ?
Third questdion :
Is there in the PlayFramework a function to manualy escape strings ? All those I can find require to add new dependencies.
Thanks !
Edit : I found a way at the Backbone.js templating level :
- Use myBackboneModel.escape('attr'); instead of myBackboneModel.get('attr');
Underscore.js templating system also includes that options : <%= attr %> renders without escaping but <%- attr %> renders with escaping !
Just be careful to the efficiency, strings are re-escaped at each rendering. That's why the Backbone .create() should be prefered.

The best practices on XSS-attacks prevention usually recommend you to reason about your output rather than your input. There's a number of reasons behind that. In my opinion the most important are:
It doesn't make any sense to reason about escaping something unless you exactly know how you are going to output/render your data. Because different ways of rendering will require different escaping strategies, e.g. properly escaped HTML string is not good enough to use it in Javascript block. Requirements and technologies change constantly, today you render your data one way - tomorrow you might be using another (let's say you will be working on a mobile client which doesn't require HTML-escaping, because it doesn't use HTML at all to render data) You can only be sure about proper escaping strategy while rendering your data. This is why modern frameworks delegate escaping to templating engines. I'd recommend reviewing the following article: XSS (Cross Site Scripting) Prevention Cheat Sheet
Escaping user's input is actually a destructive/lossy operation – if you escape user's input before persisting it to a storage you will never find out what was his original input. There's no deterministic way to 'unescape' HTML-escaped string, consider my mobile client example above.
That is why I believe that the right way to go would be to delegate escaping to your templating engines (i.e. Play and JS-templating engine you're using for Backbone). There's no need to HTML-escape string you serialize to JSON. Notice that behind the scenes JSON-serializer will JSON-escape your strings, e.g. if you have a quote in your string it will be properly escaped to ensure resulting JSON is correct, because it's a JSON serializer after all that's why it only cares about proper JSON rendering, it knows nothing about HTML (and it shouldn't). However when you rendering your JSON data in the client side you should properly HTML-escape it using the functionality provided by the JS-templating engine you're using for Backbone.
Answering another question: you can use play.api.templates.HtmlFormat to escape raw HTML-string manually:
import play.api.templates.HtmlFormat
...
HtmlFormat.escape("<b>hello</b>").toString()
// -> <b>hello</b>
If you really need to make JSON-encoder escape certain HTML strings, a good idea might be to create a wrapper for them, let's say RawString and provide custom Format[RawString] which will also HTML-escape a string in its writes method. For details see: play.api.libs.json API documentation

How to detect different data types inside HTML page?

What is the best way to detect data types inside html page using Java facilities DOM API, regexp, etc?
I'd like to detect types like skype plugin does for the phone/skype numbers, similar for addresses, emails, time, etc.

'Types' is an inappropriate term for the kind of information you are referring to. Choice of DOM API or regex depends upon the structure of information within the page.
If you know the structure, (for example tables being used for displaying information, you already know from which cell you can find phone number and which cell you can find email address), it makes sense to go with a DOM API.
Otherwise, you should use regex on plain HTML text without parsing it.

I'd use regexes in the following order:
Extract only the BODY content
Remove all tags to leave just plain text
Match relevant patterns in text
Of course, this assumes that markup isn't providing hints, and that you're purely extracting data, not modifying page context.
Hope this helps,
Phil Lello

Error-tolerant XML parsing in Scala

I would like to be able to parse XML that isn't necessarily well-formed. I'd be looking for a fuzzy rather than a strict parser, able to recover from badly nested tags, for example. I could write my own but it's worth asking here first.
Update:
What I'm trying to do is extract links and other info from HTML. In the case of well-formed XML I can use the Scala XML API. In the case of ill-formed XML, it would be nice to somehow convert it into correct XML (somehow) and deal with it the same way, otherwise I'd have to have two completely different sets of functions for dealing with documents.
Obviously because the input is not well-formed and I'm trying to create a well-formed tree, there would have to be some heuristic involved (such as when you see <parent><child></parent> you would close the <child> first and when you then see a <child> you ignore it). But of course this isn't a proper grammar and so there's no correct way of doing it.

What you're looking for would not be an XML parser. XML is very strict about nesting, closing, etc. One of the other answers suggests Tag Soup. This is a good suggestion, though technically it is much closer to a lexer than a parser. If all you want from XML-ish content is an event stream without any validation, then it's almost trivial to roll your own solution. Just loop through the input, consuming content which matches regular expressions along the way (this is exactly what Tag Soup does).
The problem is that a lexer is not going to be able to give you many of the features you want from a parser (e.g. production of a tree-based representation of the input). You have to implement that logic yourself because there is no way that such a "lenient" parser would be able to determine how to handle cases like the following:
<parent>
<child>
</parent>
</child>
Think about it: what sort of tree would expect to get out of this? There's really no sane answer to that question, which is precisely why a parser isn't going to be of much help.
Now, that's not to say that you couldn't use Tag Soup (or your own hand-written lexer) to produce some sort of tree structure based on this input, but the implementation would be very fragile. With tree-oriented formats like XML, you really have no choice but to be strict, otherwise it becomes nearly impossible to get a reasonable result (this is part of why browsers have such a hard time with compatibility).

Try the parser on the XHtml object. It is much more lenient than the one on XML.

Take a look at htmlcleaner. I have used it successfully to convert "HTML from the wild" to valid XML.

Try Tag Soup.
JTidy does something similar but only for HTML.

I mostly agree with Daniel Spiewak's answer. This is just another way to create "your own parser".
While I don't know of any Scala specific solution, you can try using Woodstox, a Java library that implements the StAX API. (Being an even-based API, I am assuming it will be more fault tolerant than a DOM parser)
There is also a Scala wrapper around Woodstox called Frostbridge, developed by the same guy who made the Simple Build Tool for Scala.
I had mixed opinions about Frostbridge when I tried it, but perhaps it is more suitable for your purposes.

I agree with the answers that turning invalid XML into "correct" XML is impossible.
Why don't you just do a regular text search for the hrefs if that's all you're interested in? One issue would be commented out links, but if the XML is invalid, it might not be possible to tell what is intended to be commented out!

Caucho has a JAXP compliant XML parser that is a little bit more tolerant than what you would usually expect. (Including support for dealing with escaped character entity references, AFAIK.)
Find JavaDoc for the parsers here

A related topic (with my solution) is listed below:
Scala and html parsing

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.