I understand what ESAPI is used for, but I see these two lines repeated in a lot of ESAPI examples. Can someone please explain what exactly this does?
ESAPI.encoder().canonicalize(inputUrl,false,false);
See the docs:
Canonicalization is simply the operation of reducing a possibly
encoded string down to its simplest form. This is important, because
attackers frequently use encoding to change their input in a way that
will bypass validation filters, but still be interpreted properly by
the target of the attack. Note that data encoded more than once is not
something that a normal user would generate and should be regarded as
an attack.
The two additional parameters which are set to false in your example indicate whether or not to restrict multiple encoding and mixed encoding (see docs for meaning), respectively.
Related
Using owasp.esapi for to filter incoming request parameters and headers, I'm stumbling on an issue where apparently the Referer header contains a value that is considered as using "multiple encoding".
An example:
http://123.abc.xx/xyz/input.xhtml?server=http%3A%2F%2F123.abc.xx%3A7016%2Fxyz&o=1&language=en&t=a074faf3
To me though, that URL seems to be correctly encoded, and decoding it results in a perfectly readable and correct url.
So, can anyone explain the issue here, and how to handle this?
ESAPI reports the error when running this method on the header value:
value = ESAPI.encoder().canonicalize(value);
Output:
SEVERE: [SECURITY FAILURE] INTRUSION - Mixed encoding (2x) detected
As a matter of fact yes. I fixed this bug in the upcoming release of ESAPI but it will require an API change, perhaps one that might have a bug based on your data here.
In short, prior to my fix, ESAPI just did a Regex against the URI. The problem and slew of bug reports on this, is that URI’s are not a regular language. They are a language themselves. So what would happen is that the URI in question would have parameters that contained HTML entities, only, some random data variants would align to known HTML entities such as ¶m=foo which would be interpreted as the entity ¶ which is paragraph. There were also some issues in regards to ASCII vs Unicode (non bmp encodings.).
At any rate there will be a new method to use in the release candidate for our next library, Encoder.getCanonicalizedURI();
This will be safe to regex against as it will be broken down and checked for mixed/multiple encoding. The method you’re currently using is now deprecated.
To escape from Cross-Site-Scripting attack i have to sanitize/validate java object that is coming from RequestBody. Can i make use of Encoder (from OWASP) to encode the entire java object. It seems that the Encoder will encode only the strings and cant accept objects. I have the similar issue many places wherein I should to handle this issue.
Is there any way to do sanitize for whole object for avoiding cross site script issue?
As you noticed, sanitization of input to prevent XSS (Cross Site Scripting) is only relevant for strings. Encoding other types is either impossible or meaningless.
To understand it better, you need to actually understand the mechanism and attack vector of an XSS. I suggest starting here: OWASP XSS
To solve your problem, it would make sense to create a custom method that after getting the object from the request, sanitizes it by going over all its strings (don't forget strings in lists and other data structures) and encode them using the OWASP encoder.
Good luck!
I had experienced different JSON encoded value for the same string depending on the language used in the past. Since the APIs were used in closed environment (no 3rd parties allowed), we made a compromise and all our Java applications are manually encoding Unicode characters. LinkedIn's API is returning "corrupted" values, basically the same as our Java applications. I've already posted a question on their forum, the reason I am asking it here as well is quite simple; sharing is caring :) This question is therefore partially connected with LinkedIn, but mostly trying to find an answer to the general encoding problem described below.
As you can see, my last name contains a letter ž, which should be \u017e but Java (or LinkedIn's API for that matter) returns \u009e with JSON and nothing with XML response. PHP's json_decode() ignores it and my last name becomes Kurida.
After an investigation, I've found ž apparently has two representations, 9e and 17e. What exactly is going on here? Is there a solution for this problem?
U+009E is a usually-invisible control character and not an acceptable alternative representation for ž.
The byte 0x9E represents the character ž in Windows code page 1252. That byte, if decoded using ISO-8859-1, would turn into U+009E.
(The confusion comes from the fact that if you write in an HTML page, the browser doesn't actually give you character U+009E, as you might expect, but converts it to U+017E. The same is true of all the character references 0080–009F: they get changed as if the numbers referred to cp1252 bytes instead of Unicode characters. This is utterly bizarre and wrong behaviour, but all the major browsers do it so we're stuck with it now. Except in proper XHTML served as XML, since that has to follow the more sensible XML rules.)
Looking at the forum page, the JSON-reading is clearly not wrong: your name is registered as being “David Kurid[U+009E]a”. However that data has got into their system needs looking at.
I know UTF file has BOM for determining encoding but what about other encoding that has
no clue how to guess that encoding.
I am new java programmer.
I have written code for guessing UTF encoding using UTF BOM.
but I have problem with other encoding. How do I guess them.
Anybody can help me?
thanks in Advance.
This question is a duplicate of several previous ones. There are at least two libraries for Java that attempt to guess the encoding (although keep in mind that there is no way to guess right 100% of the time).
GuessEncoding
jchardet (Java port of the algorithm used by mozilla firefox)
Of course, if you know the encoding will only be one of three or four options, you might be able to write a more accurate guessing algorithm.
Short answer is: you cannot.
Even in UTF-8, the BOM is entirely optional and it's often recommended not to use it since many apps do not handle it properly and just display it as if it was a printable char. The original purpose of Byte Order Markers was to tell out the endianness of UTF-16 files.
This said, most apps that handle Unicode implement some sort of guessing algorithm. Read the beginning of the file and look for certain signatures.
If you don't know the encoding and don't have any indicators (like a BOM), its not always possible to accurately "guess" the encoding. Some pointers exist that can give you hints.
For example, a ISO-8859-1 file will (usually) not have any 0x00 chars, however a UTF-16 file have loads of them.
The most common solution is to let the user select the encoding if you cannot detect it.
In my java app I'm preventing XSS attacks. I want to encode URL and hidden field paramaters in the HttpServletRequest objects I have a handle on.
How would I go about doing this?
Don't do that. You're making it unnecessarily more complicated. Just escape it during display only. See my answer in your other topic: Java 5 HTML escaping To Prevent XSS
To properly display user-entered data on an HTML page, you simply need to ensure that any special HTML characters are properly encoded as entities, via String#replace or similar. The good news is that there is very little you need to encode (for this purpose):
str = str.replace("&", "&").replace("<", "<");
You can also replace > if you like, but there's no need to.
This isn't only because of XSS, but also just so that characters show up properly. You may also want to handle ensuring that characters outside the common latin set are turned into appropriate entities, to protect against charset issues (UTF-8 vs. Windows-1252, etc.).
You can use StringEscapeUtils from the library Apache Jakarta Commons Lang
http://www.jdocs.com/lang/2.1/org/apache/commons/lang/StringEscapeUtils.html