How to configure jsoup whitelist to allow internal anchor - java

How do I configure a jsoup Whitelist to allow internal anchor references, without allowing any arbitrary value?
Example html:
Jump To Section 1
<!-- ... -->
<a name="section1">Section 1</a>
If I attempt to clean the code with the relaxed Whitelist the href is removed.
Jsoup.clean(html, Whitelist.relaxed().addAttributes("a", "name", "target");
returns the following:
<a target="_self">Jump To Section 1</a>
<!-- ... -->
<a name="section1">Section 1</a>
If I manually build a Whitelist and add the tags and attributes that I want, but don't call addProtocols(....) I can get jsoup to leave the href in place, but that doesn't seem like a good solution as it doesn't filter out href's that contain JavaScript. For example, I want the a tag (or at least the href) removed from the following:
Jump To Section 1
<a name="section1">Section 1</a>
Is this possible with jsoup?
I did see the following patch submission to jsoup, but it doesn't look like it made it into the jsoup code base: https://github.com/jhy/jsoup/pull/77

Whitelist whitelist=new Whitelist();
Cleaner cleaner = new Cleaner(whitelist);
whitelist.addAttributes("a","accesskey","dir","lang","style","tabindex","title","href");
cleaner.clean(doc);

If no protocols are provided/whitelisted, then all of them are implicitly allowed (see isSafeAttribute). If you want to allow internal anchors, then you need to never call addProtocol on your whitelist's anchor tags, unfortunately (well, on the href at least). It looks like there was a pull request to add support, but it was never merged.
Be aware that if you are allowing all protocols, that a malicious user can run Javascript on link click:
Some text
so be cautious of that if you do not trust your HTML.
If you want to only allow say, http, https, and anchor tags, then I believe you are out of luck.

The reply get 3 upvotes doesn't answer the question at all.
The github link mentioned in the OP is currently merged, and for others who are looking for the answer
Whitelist.relaxed().addProtocols("a", "href", "#")
Reference: Jsoup API Document

Related

HTML ignores ajax call [duplicate]

What are the possible reasons for document.getElementById, $("#id") or any other DOM method / jQuery selector not finding the elements?
Example problems include:
jQuery silently failing to bind an event handler
jQuery "getter" methods (.val(), .html(), .text()) returning undefined
A standard DOM method returning null resulting in any of several errors:
Uncaught TypeError: Cannot set property '...' of null
Uncaught TypeError: Cannot set properties of null (setting '...')
Uncaught TypeError: Cannot read property '...' of null
Uncaught TypeError: Cannot read properties of null (reading '...')
The most common forms are:
Uncaught TypeError: Cannot set property 'onclick' of null
Uncaught TypeError: Cannot read property 'addEventListener' of null
Uncaught TypeError: Cannot read property 'style' of null
The element you were trying to find wasn’t in the DOM when your script ran.
The position of your DOM-reliant script can have a profound effect on its behavior. Browsers parse HTML documents from top to bottom. Elements are added to the DOM and scripts are (generally) executed as they're encountered. This means that order matters. Typically, scripts can't find elements that appear later in the markup because those elements have yet to be added to the DOM.
Consider the following markup; script #1 fails to find the <div> while script #2 succeeds:
<script>
console.log("script #1:", document.getElementById("test")); // null
</script>
<div id="test">test div</div>
<script>
console.log("script #2:", document.getElementById("test")); // <div id="test" ...
</script>
So, what should you do? You've got a few options:
Option 1: Move your script
Given what we've seen in the example above, an intuitive solution might be to simply move your script down the markup, past the elements you'd like to access. In fact, for a long time, placing scripts at the bottom of the page was considered a best practice for a variety of reasons. Organized in this fashion, the rest of the document would be parsed before executing your script:
<body>
<button id="test">click me</button>
<script>
document.getElementById("test").addEventListener("click", function() {
console.log("clicked:", this);
});
</script>
</body><!-- closing body tag -->
While this makes sense and is a solid option for legacy browsers, it's limited and there are more flexible, modern approaches available.
Option 2: The defer attribute
While we did say that scripts are, "(generally) executed as they're encountered," modern browsers allow you to specify a different behavior. If you're linking an external script, you can make use of the defer attribute.
[defer, a Boolean attribute,] is set to indicate to a browser that the script is meant to be executed after the document has been parsed, but before firing DOMContentLoaded.
This means that you can place a script tagged with defer anywhere, even the <head>, and it should have access to the fully realized DOM.
<script src="https://gh-canon.github.io/misc-demos/log-test-click.js" defer></script>
<button id="test">click me</button>
Just keep in mind...
defer can only be used for external scripts, i.e.: those having a src attribute.
be aware of browser support, i.e.: buggy implementation in IE < 10
Option 3: Modules
Depending upon your requirements, you may be able to utilize JavaScript modules. Among other important distinctions from standard scripts (noted here), modules are deferred automatically and are not limited to external sources.
Set your script's type to module, e.g.:
<script type="module">
document.getElementById("test").addEventListener("click", function(e) {
console.log("clicked: ", this);
});
</script>
<button id="test">click me</button>
Option 4: Defer with event handling
Add a listener to an event that fires after your document has been parsed.
DOMContentLoaded event
DOMContentLoaded fires after the DOM has been completely constructed from the initial parse, without waiting for things like stylesheets or images to load.
<script>
document.addEventListener("DOMContentLoaded", function(e){
document.getElementById("test").addEventListener("click", function(e) {
console.log("clicked:", this);
});
});
</script>
<button id="test">click me</button>
Window: load event
The load event fires after DOMContentLoaded and additional resources like stylesheets and images have been loaded. For that reason, it fires later than desired for our purposes. Still, if you're considering older browsers like IE8, the support is nearly universal. Granted, you may want a polyfill for addEventListener().
<script>
window.addEventListener("load", function(e){
document.getElementById("test").addEventListener("click", function(e) {
console.log("clicked:", this);
});
});
</script>
<button id="test">click me</button>
jQuery's ready()
DOMContentLoaded and window:load each have their caveats. jQuery's ready() delivers a hybrid solution, using DOMContentLoaded when possible, failing over to window:load when necessary, and firing its callback immediately if the DOM is already complete.
You can pass your ready handler directly to jQuery as $(handler), e.g.:
<script src="https://code.jquery.com/jquery-3.6.0.js" integrity="sha256-H+K7U5CnXl1h5ywQfKtSj8PCmoN9aaq30gDh27Xc0jk=" crossorigin="anonymous"></script>
<script>
$(function() {
$("#test").click(function() {
console.log("clicked:", this);
});
});
</script>
<button id="test">click me</button>
Option 5: Event Delegation
Delegate the event handling to an ancestor of the target element.
When an element raises an event (provided that it's a bubbling event and nothing stops its propagation), each parent in that element's ancestry, all the way up to window, receives the event as well. That allows us to attach a handler to an existing element and sample events as they bubble up from its descendants... even from descendants added after the handler was attached. All we have to do is check the event to see whether it was raised by the desired element and, if so, run our code.
Typically, this pattern is reserved for elements that don't exist at load time or to avoid attaching a large number of duplicate handlers. For efficiency, select the nearest reliable ancestor of the target element rather than attaching it to the document.
Native JavaScript
<div id="ancestor"><!-- nearest ancestor available to our script -->
<script>
document.getElementById("ancestor").addEventListener("click", function(e) {
if (e.target.id === "descendant") {
console.log("clicked:", e.target);
}
});
</script>
<button id="descendant">click me</button>
</div>
jQuery's on()
jQuery makes this functionality available through on(). Given an event name, a selector for the desired descendant, and an event handler, it will resolve your delegated event handling and manage your this context:
<script src="https://code.jquery.com/jquery-3.6.0.js" integrity="sha256-H+K7U5CnXl1h5ywQfKtSj8PCmoN9aaq30gDh27Xc0jk=" crossorigin="anonymous"></script>
<div id="ancestor"><!-- nearest ancestor available to our script -->
<script>
$("#ancestor").on("click", "#descendant", function(e) {
console.log("clicked:", this);
});
</script>
<button id="descendant">click me</button>
</div>
Short and simple: Because the elements you are looking for do not exist in the document (yet).
For the remainder of this answer I will use getElementById for examples, but the same applies to getElementsByTagName, querySelector, and any other DOM method that selects elements.
Possible Reasons
There are three reasons why an element might not exist:
An element with the passed ID really does not exist in the document. You should double check that the ID you pass to getElementById really matches an ID of an existing element in the (generated) HTML and that you have not misspelled the ID (IDs are case-sensitive!).
If you're using getElementById, be sure you're only giving the ID of the element (e.g., document.getElemntById("the-id")). If you're using a method that accepts a CSS selector (like querySelector), be sure you're including the # before the ID to indicate you're looking for an ID (e.g., document.querySelector("#the-id")). You must not use the # with getElementById, and must use it with querySelector and similar. Also note that if the ID has characters in it that aren't valid in CSS identifiers (such as a .; id attributes containing . characters are poor practice, but valid), you have to escape those when using querySelector (document.querySelector("#the\\.id"))) but not when using getElementById (document.getElementById("the.id")).
The element does not exist at the moment you call getElementById.
The element isn't in the document you're querying even though you can see it on the page, because it's in an iframe (which is its own document). Elements in iframes aren't searched when you search the document that contains them.
If the problem is reason 3 (it's in an iframe or similar), you need to look through the document in the iframe, not the parent document, perhaps by getting the iframe element and using its contentDocument property to access its document (same-origin only). The rest of this answer addresses the first two reasons.
The second reason — it's not there yet — is quite common. Browsers parse and process the HTML from top to bottom. That means that any call to a DOM element which occurs before that DOM element appears in the HTML, will fail.
Consider the following example:
<script>
var element = document.getElementById('my_element');
</script>
<div id="my_element"></div>
The div appears after the script. At the moment the script is executed, the element does not exist yet and getElementById will return null.
jQuery
The same applies to all selectors with jQuery. jQuery won't find elements if you misspelled your selector or you are trying to select them before they actually exist.
An added twist is when jQuery is not found because you have loaded the script without protocol and are running from file system:
<script src="//somecdn.somewhere.com/jquery.min.js"></script>
this syntax is used to allow the script to load via HTTPS on a page with protocol https:// and to load the HTTP version on a page with protocol http://
It has the unfortunate side effect of attempting and failing to load file://somecdn.somewhere.com...
Solutions
Before you make a call to getElementById (or any DOM method for that matter), make sure the elements you want to access exist, i.e. the DOM is loaded.
This can be ensured by simply putting your JavaScript after the corresponding DOM element
<div id="my_element"></div>
<script>
var element = document.getElementById('my_element');
</script>
in which case you can also put the code just before the closing body tag (</body>) (all DOM elements will be available at the time the script is executed).
Other solutions include listening to the load [MDN] or DOMContentLoaded [MDN] events. In these cases it does not matter where in the document you place the JavaScript code, you just have to remember to put all DOM processing code in the event handlers.
Example:
window.onload = function() {
// process DOM elements here
};
// or
// does not work IE 8 and below
document.addEventListener('DOMContentLoaded', function() {
// process DOM elements here
});
Please see the articles at quirksmode.org for more information regarding event handling and browser differences.
jQuery
First make sure that jQuery is loaded properly. Use the browser's developer tools to find out whether the jQuery file was found and correct the URL if it wasn't (e.g. add the http: or https: scheme at the beginning, adjust the path, etc.)
Listening to the load/DOMContentLoaded events is exactly what jQuery is doing with .ready() [docs]. All your jQuery code that affects DOM element should be inside that event handler.
In fact, the jQuery tutorial explicitly states:
As almost everything we do when using jQuery reads or manipulates the document object model (DOM), we need to make sure that we start adding events etc. as soon as the DOM is ready.
To do this, we register a ready event for the document.
$(document).ready(function() {
// do stuff when DOM is ready
});
Alternatively you can also use the shorthand syntax:
$(function() {
// do stuff when DOM is ready
});
Both are equivalent.
Reasons why id based selectors don't work
The element/DOM with id specified doesn't exist yet.
The element exists, but it is not registered in DOM [in case of HTML nodes appended dynamically from Ajax responses].
More than one element with the same id is present which is causing a conflict.
Solutions
Try to access the element after its declaration or alternatively use stuff like $(document).ready();
For elements coming from Ajax responses, use the .bind() method of jQuery. Older versions of jQuery had .live() for the same.
Use tools [for example, webdeveloper plugin for browsers] to find duplicate ids and remove them.
If the element you are trying to access is inside an iframe and you try to access it outside the context of the iframe this will also cause it to fail.
If you want to get an element in an iframe you can find out how here.
As #FelixKling pointed out, the most likely scenario is that the nodes you are looking for do not exist (yet).
However, modern development practices can often manipulate document elements outside of the document tree either with DocumentFragments or simply detaching/reattaching current elements directly. Such techniques may be used as part of JavaScript templating or to avoid excessive repaint/reflow operations while the elements in question are being heavily altered.
Similarly, the new "Shadow DOM" functionality being rolled out across modern browsers allows elements to be part of the document, but not query-able by document.getElementById and all of its sibling methods (querySelector, etc.). This is done to encapsulate functionality and specifically hide it.
Again, though, it is most likely that the element you are looking for simply is not (yet) in the document, and you should do as Felix suggests. However, you should also be aware that that is increasingly not the only reason that an element might be unfindable (either temporarily or permanently).
If script execution order is not the issue, another possible cause of the problem is that the element is not being selected properly:
getElementById requires the passed string to be the ID verbatim, and nothing else. If you prefix the passed string with a #, and the ID does not start with a #, nothing will be selected:
<div id="foo"></div>
// Error, selected element will be null:
document.getElementById('#foo')
// Fix:
document.getElementById('foo')
Similarly, for getElementsByClassName, don't prefix the passed string with a .:
<div class="bar"></div>
// Error, selected element will be undefined:
document.getElementsByClassName('.bar')[0]
// Fix:
document.getElementsByClassName('bar')[0]
With querySelector, querySelectorAll, and jQuery, to match an element with a particular class name, put a . directly before the class. Similarly, to match an element with a particular ID, put a # directly before the ID:
<div class="baz"></div>
// Error, selected element will be null:
document.querySelector('baz')
$('baz')
// Fix:
document.querySelector('.baz')
$('.baz')
The rules here are, in most cases, identical to those for CSS selectors, and can be seen in detail here.
To match an element which has two or more attributes (like two class names, or a class name and a data- attribute), put the selectors for each attribute next to each other in the selector string, without a space separating them (because a space indicates the descendant selector). For example, to select:
<div class="foo bar"></div>
use the query string .foo.bar. To select
<div class="foo" data-bar="someData"></div>
use the query string .foo[data-bar="someData"]. To select the <span> below:
<div class="parent">
<span data-username="bob"></span>
</div>
use div.parent > span[data-username="bob"].
Capitalization and spelling does matter for all of the above. If the capitalization is different, or the spelling is different, the element will not be selected:
<div class="result"></div>
// Error, selected element will be null:
document.querySelector('.results')
$('.Result')
// Fix:
document.querySelector('.result')
$('.result')
You also need to make sure the methods have the proper capitalization and spelling. Use one of:
$(selector)
document.querySelector
document.querySelectorAll
document.getElementsByClassName
document.getElementsByTagName
document.getElementById
Any other spelling or capitalization will not work. For example, document.getElementByClassName will throw an error.
Make sure you pass a string to these selector methods. If you pass something that isn't a string to querySelector, getElementById, etc, it almost certainly won't work.
If the HTML attributes on elements you want to select are surrounded by quotes, they must be plain straight quotes (either single or double); curly quotes like ‘ or ” will not work if you're trying to select by ID, class, or attribute.

Java / Android HTML custom tag parser

I'm trying to figure out a way to parse a html file with custom tags in the form:
[custom tag="id"]
Here's an example of a file I'm working with:
<p>This is an <em>amazing</em> example. </p>
<p>Such amazement, <span>many wow.</span> </p>
<p>Oh look, a wild [custom tag="amaze"] appears.</p>
We need maor embeds <a href="http://youtu.be/F5nLu232KRo"> bro
What I would like (in an ideal world) is to get back is a list of elements):
List foundElements = [text, custom tag, text, link, text]
Where the element in the above list contains:
Text:
<p>This is an <em>amazing</em> example. </p>
<p>Such amazement, <span>many wow.</span> </p>
<p>Oh look, a wild [custom tag="amaze"] appears.</p>
We need maor embeds
Custom tag:
[custom tag="amaze"]
Link:
<a href="http://youtu.be/F5nLu232KRo">
Text:
appears.</p>We need maor embeds
What I've tried:
Jsoup
Jsoup is great, it works perfectly for HTML. The issue is I can't define custom tags with opening "[" and closing "]". Correct me if I'm wrong?
Jericho
Again like Jsoup, Jericho works great..except for defining custom tags. You're required to use "<".
Java Regex
This is the option I really don't want to go for. It's not reliable and there's a lot of string manipulation that is brittle, especially when you're matching against a lot of regexes.
Last but not least, I'm looking for a performance orientated solution as this is done on an Android client.
All suggestions welcome!

thymeleaf - combined th:each with th:href

I'm new to Thymeleaf (and webdev) and I'm trying to combine Thymeleaf iteration (th:each) with URL re-writing (th:href).
<a th:each="lid : ${lists}" th:text="${lid}" th:href="#{/list?l=${lid}}">
hello
</a>
This produces the following (where lid=45):
45
So, it did the substitution on the th:text, but not on the th:href.
I'm not trying to do any sort of URL re-writing, I'm just using the '#' syntax because I want Thymeleaf to substitute the 'lid' attribute.
I'm using the current version of Thymeleaf (2.1.2) with Google App Engine.
If you don't want to do any url rewriting, you shouldn't use the # syntax.
You can use the pipeline (|) syntax to do some literal substitions:
th:href="|/list?l=${lid}|"
Source: Thymeleaf documentation
You can also try this way:
<a th:href="#{'/list?l=' + ${lid}}" th:text="${lid}">element</a>
I don't have enough reputation to add a comment on a previous post but the Thymeleaf Source documentation link from a previous post is broken. Documentation can now be found at the following link:
Thymeleaf Standard URL Syntax
Section 9 Using Expressions in URLs in this documentation explains how you can use expressions within other expressions when generating URLs with the # syntax. The following is an example:
<a th:href="#{/order/details(id=${modelattribute})}"/>
Will produce a link similar to:
http://domain.org/context/order/details?id=1
if modelattribute had a value of 1 in the current context.

ESAPI implementation for spring form tags

How can we implement ESAPI output encoding in an application using java and spring-mvc.
Read many posts and saw this:
<%# page import="org.owasp.esapi.*" %>
<input type="hidden" name="hidden" value="<%out.print(ESAPI.encoder().encodeForHTML(content));%>"/>
But, in my application all the jsps use spring form tags like the following,
<td>Number:
<form:input path="someNo" size="20" maxlength="18" id="firstfield" onkeypress="return PressAButton('submithidden');"/></td>
How can I have ESAPI implementation for above code? is there any other way of implementing output encoding like creating a filter or something? Any suggestions are greatly appreciated!
After researching spring tags a bit, it appears that the data-binding happens in framework code thus preventing you from applying any escaping in the jsp.
One, semi-quick win could be defaulting all output to escape HTML. Add this entry in web.xml:
<context-param>
<param-name>defaultHtmlEscape</param-name>
<param-value>true</param-value>
</context-param>
The only problem here is that output-escaping is a BIG pain... the rules for html escaping are different when your value is going to be passed as data to an HTML attribute or a Javascript function. And there could be some parts of your application where you DO NOT want to html escape, but you should be able to override those with the form tag attribute htmlEscape="false" when you need to.
What you need is to be able to hook the part of Spring tags where it is binding the HTML to the form, but you need to be able to do it so you can escape based on where its being placed. Escaping rules are different for an HTMLAttribute as opposed to plain HTML and if the value is going to be passed as data to a javascript function. So Spring's solution only defends one category of attack.
These are the only ways out I see, all of them will require work:
Use JSTL tags instead of Spring tags so you can write your variables with ${thisSyntax} and wrap them in esapi tags like this:
<c:out value="<esapi:encodeForHTML>${variable}</esapi:encodeForHTML>"/>
Follow a solution like what #A. Paul put forward, where you do your context escaping back on the controller side. I'm aware you feel that this isn't an option, but the next solution I'm putting forward is untested.
Implement your own tag library that subclasses [org.springframework.web.servlet.tags.form.InputTag][1], specifically the method writeValue. While esapi prevents alot, I would recommend looking at owasp's new Encoder project to show you exactly how tricky output encoding is. Ideally your tag library will allow you to utilize either esapi's Encoder or this new API.
Just a thought not sure if this is what you are looking for.
Can you use the below code in Java and change the data in the bean itself and then send in the user interface.
if ( ESAPI.securityConfiguration().getLogEncodingRequired() ) {
data = ESAPI.encoder().encodeForHTML(message);
}
You can check the below url.
http://www.jtmelton.com/tag/esapi/

XSS vulnerability in java

How to fix the below XSS vulnerability issue?
How to secure my website from XSS vulnerability?
By adding a javascript in the URL of the website all the cookies values are being displayed.
below is a similar example of the URL which consists of a java script:
https://www.example.com/>< script>alert(document.cookie)< / script >&UserTarget=https://www.example.com/homepageredirect.jsp
To overcome this I added the below filer in obj.conf file in webserver 7.0:
Input fn="insert-filter"
method="POST"
filter="sed-request"
sed="s/(<|%3c)/\\< / gi"
sed="s/(>|%3e)/\\>/gi"
Ever after making these changes in the obj.conf , still the issue is not fixed. Please suggest something.
When you print your HTML just escape the special chars in the client side (or server-side, it depends for what you are going to print it) then you will be allowed to pass any input through without the need to use awkward regex or other kind of filter.
Example:
Let's say I have a variable that can receive a <script>alert( document.cookie )</script>, when I print I would do something like <div> <%= escapeHTML( dangerousVariable ) %> </div>.
In this URL XSS Prevention Rules mention you can apply rules according to your requirement

Categories

Resources