The application I support is going through security review and there are some questions regarding escaping special characters. I have not been supporting this application for a long time and I'm not very knowledgeable about escaping special characters. The question I was asked is "Why are you JavaScript encoding the value and then HTML encoding it? Is that value written out in a context that requires the value to be encoded for both contexts?"
What is the difference between JavaScript encoding used and HTML encoding used? Why would I need both in my code?
Any information regarding this will be greatly appreciated!
public class HTMLEncodedResultSet extends ResultSetWrapper {
public HTMLEncodedResultSet(ResultSet resultSet) {
super(resultSet);
}
public String getString(int columnIndex) throws SQLException {
return StringEscapeUtils.escapeHtml(StringEscapeUtils.escapeJavaScript(super.getString(columnIndex)));
}
public String getString(String columnName) throws SQLException {
return StringEscapeUtils.escapeHtml(StringEscapeUtils.escapeJavaScript(super.getString(columnName)));
}
}
From the official documentation:
escapeHtml
Escapes the characters in a String using HTML entities.
For example:
"bread" & "butter"
becomes: "bread" & "butter".
escapeJavaScript
Escapes the characters in a String using JavaScript String rules.
Escapes any values it finds into their JavaScript String form. Deals
correctly with quotes and control-chars (tab, backslash, cr, ff, etc.)
So a tab becomes the characters '\' and 't'.
Example:
input string: He didn't say, "Stop!" output string: He didn\'t say,
\"Stop!\"
So, given that JS and HTML reserved characters are not the same, in your case if the input has HTML and JS code it may be necessary to invoke both methods.
It looks like that your application has JavaScript snippets stored in database. These snippets might create or contain HTML parts (i.e. for generating dynamic HTML based on interaction). When loading these snippets from DB as a string in Java a JavaScript AND HTML encoding is required.
Here an example of a value that could be stored in DB.
var obj = $('#fire');
var fps = 200;
var letters = obj.html().split('');
obj.empty();
$.each(letters,function(el){
obj.append($('<span>'+this+'</span>'));
});
var animateLetters = obj.find('span');
setInterval(function(){
animateLetters.each(function(){
$(this).css('fontSize', 80+(Math.floor(Math.random()*50)));
});
},fps);
Referring to the documentation:
escapeHTML: Escapes the characters in a String using HTML entities.
For example:
"bread" & "butter"
becomes: "bread" & "butter".
and
escapeJavaScript: Escapes any values it finds into their JavaScript
String form. Deals correctly with quotes and control-chars (tab,
backslash, cr, ff, etc.)
So a tab becomes the characters '\' and 't'.
The only difference between Java strings and JavaScript strings is
that in JavaScript, a single quote must be escaped.
Example:
input string: He didn't say, "Stop!" output string: He didn\'t say,
\"Stop!\"
Related
I have an error in printing special characters in Java.
The system reads a product name from a mysql database, and checking the database from the command line, the data displays the Registered Symbol ® correctly.
A java program then runs a database read to get information of orders to print out as a PDF, but when the print is produced the ® symbol becomes 'fi'.
Is there a way of retaining the myself character coding when handling in Java?
Before printing to PDF, you can replace the special characters with the unicode characters as below.
public static String specialCharactersConversion( String charString ) {
if( isNotEmpty( charString ) ){
charString = charString.replaceAll( "\\(R\\)", "\u00AE" );
}
}
There is an Open Source java library MgntUtils that has a Utility that converts Strings to unicode sequence and vise versa:
result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);
The output of this code is:
\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World
So, what you can do before converting your text to PDF, you can convert special characters or entire text to Unicode sequences. The answer is copied with modifications from this question: Convert International String to \u Codes in java
The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc
Here is javadoc for the class StringUnicodeEncoderDecoder
I am trying to parse an HTML document. In the document, there is the
span-data-personalization = '{"one":["two"]}' which converts to
span-data-personalization = "{"one":["two"]}" while parsing. The double quotes convert to " and single quotes to double quote. I have also used doc.outputSettings().prettyPrint(false); with no success. Also, made the changes suggested in jsoup - stop jsoup from making quotes into & It still did not work. And, I have also tried updating the Jsoup version.nothing seems to work. Does anybody have any suggestions?
Thank you.
The JSoup Parser class has a built in unescapeEntities method. From the JSoup documentation:
public static String unescapeEntities(String string,
boolean inAttribute)
Utility method to unescape HTML entities from a string
Parameters:
string - HTML escaped string
inAttribute - if the string is to be escaped in strict mode (as attributes are)
Returns:
an unescaped string
I am using encodeURIComponent in javascript(assuming this does UTF-8 encoding) to encode a variable which could contain characters like =, +, etc. This is sent as POST to my servlet where I decode it.
This works well with English but when used with Japanese string - "バスケット", this converts to some special character sequence like this - "ãÂÂã¹ã±ãÂÂãÂÂ"
I am using following java 1.6 code to decode it but it doesn't work -
String ID = java.net.URLDecoder.decode(assignedID,"UTF-8");
where assignedID contains special character sequence. The above code returns me - "ãÂÂã¹ã±ãÂÂãÂÂ"
In your post, is the string you're sending is being sent as part of the URL or as part of the POST body. Its mostly the part of POST body, try adding (to jsp):
<% request.setCharacterEncoding("UTF-8"); %>
Pages with spaces in the URL don't get correctly translated:
i.e.
http://www.streetinsider.com/Press Releases/National Trends Reflected in Plano Housing Market/9778767.html
or
http://www.streetinsider.com/Press%20Releases/National+Trends+Reflected+in+Plano+Housing+Market/9778767.html
Gives 404. Please note "Press Releases" is encoded as "Press%20Releases".
However following two versions work fine where "Press Releases" is encoded as "Press+Releases".
http://www.streetinsider.com/Press+Releases/National+Trends+Reflected+in+Plano+Housing+Market/9778767.html
The article parses fine with plus signs or HEX spaces %20.
http://www.streetinsider.com/Press+Releases/National%20Trends%20Reflected%20in%20Plano%20Housing%20Market/9778767.html
Both + and %20 represent spaces. Then why this behavior.
And also, in java what could I use to get the correct encoded URL
Both + and %20 represent spaces
Only in query strings. Elsewhere in a URL a plus is a plus, not a space. In this case the web server gives you the same content for the two different URLs
http://www.streetinsider.com/Press+Releases/National+Trends+Reflected+in+Plano+Housing+Market/9778767.html
and
http://www.streetinsider.com/Press+Releases/National%20Trends%20Reflected%20in%20Plano%20Housing%20Market/9778767.html
but the two URLs are distinct, they're not alternative representations of the same URL.
Officially + might only be used in the query string (after ?).
This is what URLEncoder is for:
"?x=" + URLEncoder.encode("Hello World", "UTF-8");
"?x=" + URLEncoder.encode("ŝi estas ĉarma", "UTF-8");
?x=Hello+World
?x=%C5%9Di+estas+%C4%89arma
The more universal class URI, obeys the specification for spaces to be replaced, using %.
URI uri = new URI("http", "www.streetinsider.com",
"/Press Releases/National Trends Reflected in Plano Housing Market/9778767.html",
"?x=ŝi estas ĉarma");
String u = uri.toString();
http://www.streetinsider.com/Press%20Releases/National%20Trends%20
Reflected%20in%20Plano%20Housing%20Market/9778767.html#?x=ŝi%20estas%20ĉarma
One sometime encounters URI as generalisation for File and others, and then has to be careful not introducing %20 in file names.
So probably there is a partial remapping on streetinsider of + or even %20 as it seems; in order to reach the same code.
Your statement
Both + and %20 represent spaces.
is not exactly true in all cases.
Space characters may only be encoded as "+" in one context: application/x-www-form-urlencoded key-value pairs.
The RFC-1866 (HTML 2.0 specification), paragraph 8.2.1. subparagraph 1. says: "The form field names and values are escaped: space characters are replaced by `+', and then reserved characters are escaped").
Here is an example of such a string in URL where RFC-1866 allows encoding spaces as pluses: "http://example.com/over/there?name=foo+bar". So, only after "?", spaces can be replaced by pluses (in other cases, spaces should be encoded to %20). This way of encoding form data is also given in later HTML specifications, for example, look for relevant paragraphs about application/x-www-form-urlencoded in HTML 4.01 Specification, and so on.
The URL that you have provided is not a form data containing key/value pairs, it's just a path to a 9778767.html file:
http://www.streetinsider.com/Press%20Releases/National+Trends+Reflected+in+Plano+Housing+Market/9778767.html
So, it is illegal to use pluses here. The correct URL in this case should have been the following:
http://www.streetinsider.com/Press%20Releases/National%20Trends%20Reflected%20in%20Plano%20Housing%20Market/9778767.html
I have an XML which contains many special symbols like ® (HTML number ®) etc.
and HTML names like ã (HTML number ã) etc.
I am trying to replace these HTML symbols and HTML names with corresponding HTML number using Java. For this, I first converted XML file to string and then used replaceAll method as:
File fn = new File("myxmlfile.xml");
String content = FileUtils.readFileToString(fn);
content = content.replaceAll("®", "&\#174");
FileUtils.writeStringToFile(fn, content);
But this is not working.
Can anyone please tell how to do it.
Thanks !!!
The signature for the replaceAll method is:
public String replaceAll(String regex, String replacement)
You have to be careful that your first parameter is a valid regular expression. The Java Pattern class describes the constructs used in a Java regular expression.
Based on what I see in the Pattern class description, I don't see what's wrong with:
content = content.replaceAll("®", "&\#174");
You could try:
content = content.replaceAll("\\p(®)", "&\#174");
and see if that works better.
I don't think that \# is a valid escape sequence.
BTW, what's wrong with "®" ?
If you want HTML numbers try first escaping for XML.
Use EscapeUtils from Apache Commons Lang.
Java may have trouble dealing with it, so first I prefere to escape Java, and after that XML or HTML.
String escapedStr= StringEscapeUtils.escapeJava(yourString);
escapedStr= StringEscapeUtils.escapeXML(yourString);
escapedStr= StringEscapeUtils.escapeHTML(yourString);