How to replace smilies in text such as :) with an image? - java

I would like to show smilies as images in my JSF/PrimeFaces web application. For this I would need to replace text like :) with an image. How can I achieve this?

JSF doesn't offer any facilities for this.
At its simplest, you could just use the available methods of the String class to perform manipulations on a String instance, such as replace().
text = text.replace(":)", "<img src=\"smile.png\" />");
(you might want to apply more finer-grained matching, perhaps with regex or a lexer, to prevent that legit character sequences such as "... a semicolon ; (or a colon, :) ..." are incorrectly been replaced)
Then, to present the manipulated String instance with HTML images in it in JSF, you'd need to use <h:outputText> with the escape attribute set to false to disable the builtin HTML-escaping which is been used to prevent XSS attack holes.
<h:outputText value="#{bean.text}" escape="false" />
This way the HTML <img> element will be literally interpreted by the webbrowser instead of being displayed plaintext to the enduser due to the escaping.
But, as you might already have guessed, this puts of course possible XSS attack holes open if you don't sanitize the enduser's input beforehand. The enduser would be able to do bad things with input such as adding a <script>stealCookies()</script> to the text which would be literally interpreted by the webbrowser as well. To sanitize enduser's input beforehand, you can use among others Jsoup which offers a clean() method for this:
text = Jsoup.clean(text, Whitelist.basic());
(do this before replacing the smilies, or it might strip off those <img> tags as well!)

Related

Is URLEncoder.encode(string, "UTF-8") a poor validation?

In a portion of my J2EE/java code, I do a URLEncoding on the output of getRequestURI() to sanitize it to prevent XSS attacks, but Fortify SCA considers that poor validation.
Why?
The key point is that you need to convert HTML special characters to HTML entities. This is also called "HTML escaping" or "XML escaping". Basically, the characters <, >, ", & and ' needs to be replaced by <, >, ", & and '.
URL encoding does not do that. URL encoding converts URL special characters to percent-encoded values. This is not HTML escaping.
In case of web applications, HTML escaping is normally to be done in the view side, exactly there where you're redisplaying user-controlled input. In case of a Java EE web applications, that depends on the view technology you're using.
If the webapp is using modern Facelets view technology, then you don't need to escape it yourself. Facelets will already implicitly do that.
If the webapp is using legacy JSP view technology, then you need to ensure that you're using JSTL <c:out> tag or fn:escapeXml() function to redisplay user-controlled input.
<c:out value="${bean.foo}" />
<input type="text" name="foo" value="${fn:escapeXml(param.foo)}" />
If the webapp is very legacy or bad designed and using servlets or scriptlets to print HTML, then you've a bigger problem. There are no builtin tags or functions, let alone Java methods which can escape HTML entities. You should either write some escape() method yourself or use the Apache Commons Lang StringEscapeUtils#escapeHtml() for this. Then you need to ensure that you're using it everywhere you're printing user-controlled input.
out.print("<p>" + StringEscapeUtils.escapeHtml(request.getParameter("foo")) + "</p>");
Much better would be to redesign that legacy webapp to use JSP with JSTL.
URL encoding does not affect certain significant characters including single quote (') and parentheses, so URL encoding will pass through unchanged certain payloads.
For example,
onload'alert(String.fromCharCode(120))'
will be treated by some browsers as a valid attribute that can result in code execution when injected inside a tag.
The best way to avoid XSS is to treat all untrusted inputs as plain text, and then when composing your output, properly encode all plain text to the appropriate type on output.
If you want to filter inputs as an additional layer of security, make sure your filter treats all quotes (including back-tick) and parentheses as possible code, and disallow them unless the make sense for that input.

How to abbreviate HTML with Java?

A user enters text as HTML in a form, for example:
<p>this is my <strong>blog</strong> post,
very <i>long</i> and written in <b>HTML</b></p>
I want to be able to output only a part of the string ( for example the first 20 characters ) without breaking the HTML structure of the user's input. In this case:
<p>this is my <strong>blog</strong> post, very <i>l</i>...</p>
which renders as
this is my <strong>blog</strong> post, very <i>lo</i>...
Is there a Java library able to do this, or a simple method to use?
MyLibrary.abbreviateHTML(string,20) ?
Since it's not very easy to do this correctly I usually strip all tags and truncate. This gives great control on the text size and appearance which usually needs to be placed in places where you do need control.
Note that you may find my proposal very conservative and it actually is not a proper answer to your question. But most of the times the alternatives are:
strip all tags and truncate
provide an alternate content manageable rich text which will serve as the truncated text. This of course only works in the case of CMSes etc
The reason that truncating HTML would be hard is that you don't know how truncating would affect the structure of the HTML. How would you truncate in the middle of a <ul> or, even worst, in the middle of a complex <table>?
So the problem here is that HTML can not only contain content and styling (bold, italics) but also structure (lists, tables, divs etc). So a good and safe implementation would be to strip everything out apart inline "styling" tags (bold, italics etc) and truncate while keeping track of unclosed tags.
I don't know any library but it should not be so complicated (for 80%).
You only need a simple "parser" that understand 4 type of tokens:
opening tags - everything that starts with < but not </ and ends with > but not />
closing tags - everything that starts with </ and ends with >
self-closing tags (like <br/>) - everything that starts with < but not </ and ends with /> but not >
normal character - everything that is none of the other types
Then you must walk through your input string, and count the "normal characters". While you walking along the string and count, you copy every token to the output as long as the counted normal chars are less or equals the amount you want to have.
You also need to build a stack of current open tags, while you walk thought the input. Every time you walk trough a "opening tag" you put it to the stack (its name), every time you you find a closing tag, you remove the topmost tag name from the stack (hopefully the input is correct XHTML).
When you reach the end of the required amount of normal chars, then you only need to write closing HTML tags for the tag names remaining on the stack.
But be careful, this works only with the input is well-formed XML.
I don't know what you want to do with this piece of code, but you should pay attention to HTML/JavaScript injection attacks.
If you really want to abbreviate HTML then just do it (cut the text at desired length), pass the abbreviated result through http://jtidy.sourceforge.net/ and hope for the best.
It seams that there are a lot of libs and tools for this common task:
truncateNicely from Jakarta Taglibs String (Jakarta Taglibs has been retired)
org.displaytag.util.HtmlTagUtil#abbreviateHtmlString from Display tag library 1.2 (allready Mentioned by Marnix van Bochove in his comment.)

Regex remove only certain tags from html

I want to remove only a set of html tags (b,i,p, end of tags) from a given html.
Pattern p = Pattern.compile("<[^bip/](.*?)>");
However, this also removes img tag coz of .*. What should I change to prevent removal of img
EDIT: I'm doing this on Android app. I know regex is the worst way, but Inbuilt spannable classes are not working as expected and I cant import a library just for html parsing. My purpose is to just detect if other tags exist OR not. Also, html is pretty small (upto 10 lines max), performance shouldn't be a problem.
This has been said a million times on stackoverflow.
Don't process HTML, XHTML or XML with regexes. They aren't regular languages, they are context free languages and can't be correctly processed with regular expressions.
Trying to work into xml (or html) is a bad idea : you definitely want to use a parser.
In your case, you want to match:
<\s*/?\s*[bip]\s*>
Remove simple letter tag (and same closing tag) and take into account some spaces are valid; you also need to run your regex as multiline.
It might work, but it's dangerous and you might have unexpected side effects
EDIT:
I understood you just want to remove the tags, not the actual content inside the tag
EDIT2:
current pattern matches the 3 tags, not their content. In a substitution regexp (replacing by nothing), it would remove these formatting tags, not the embedded content.
If you want to remove only <b>,<p>,<i> and </b>,</p>,</i> tags then you can use following regex :
(</?b>|</?p>|</?i>)
I am not sure I understand your regex, seems very different from what you say you want. Use something like below:
<([bip])>.*?</\1>
And if possible, don't use the above or any other regexes. There are various other better ways to do this. Search here or on google.
Most of the sample regex only checks a tag starts with a certain tag. For instance, you may want to remove <b>, but not <br>. So, in most of the sample regex, if you add <b> in the tags list, it automatically removes <br> as well. I use /<\/?(font|div|b)(\/|>|\s.*?>)/g. This regex prevents starts with issue. This sample will find only font, div and b, not match with br.

How to implement a possibility for user to post some html-formatted data in a safe way?

I have a textarea and I want to support some simplest formatting for posted data (at least, whitespaces and line breaks).
How can I achieve this? If I will not escape the response and keep some html tags then it'll be a great security hole. But I don't see any other solution which will allow text formatting in browser.
So, I probably should filter user's input. But how can I do this? Are there any ready to use solutions? I'm using JSF so are there any smart component which filters everything except html tags?
Use a HTML parser which supports HTML filtering against a whitelist like Jsoup. Here's an extract of relevance from its site.
Sanitize untrusted HTML
Problem
You want to allow untrusted users to supply HTML for output on your website (e.g. as comment submission). You need to clean this HTML to avoid cross-site scripting (XSS) attacks.
Solution
Use the jsoup HTML Cleaner with a configuration specified by a Whitelist.
String unsafe =
"<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>";
String safe = Jsoup.clean(unsafe, Whitelist.basic());
// now: <p>Link</p>
And then to display it with whitespace preserved, apply CSS white-space: pre-wrap; on the HTML element where you're displaying it.
No all-in-one JSF component comes to mind.
Is there some reason why you need to accept HTML instead of some other markup language, such as markdown (which is what StackOverflow uses)?
http://daringfireball.net/projects/markdown/
Not sure what kind of tags you'd want to accept that wouldn't be covered by md or a similar formatting language...

HttpServletRequest - Quick way to encode url and hidden field paramaters

In my java app I'm preventing XSS attacks. I want to encode URL and hidden field paramaters in the HttpServletRequest objects I have a handle on.
How would I go about doing this?
Don't do that. You're making it unnecessarily more complicated. Just escape it during display only. See my answer in your other topic: Java 5 HTML escaping To Prevent XSS
To properly display user-entered data on an HTML page, you simply need to ensure that any special HTML characters are properly encoded as entities, via String#replace or similar. The good news is that there is very little you need to encode (for this purpose):
str = str.replace("&", "&").replace("<", "<");
You can also replace > if you like, but there's no need to.
This isn't only because of XSS, but also just so that characters show up properly. You may also want to handle ensuring that characters outside the common latin set are turned into appropriate entities, to protect against charset issues (UTF-8 vs. Windows-1252, etc.).
You can use StringEscapeUtils from the library Apache Jakarta Commons Lang
http://www.jdocs.com/lang/2.1/org/apache/commons/lang/StringEscapeUtils.html

Categories

Resources