First I would like to say thank you for the help in advance.
I am currently writing a web crawler that parses HTML content, strips HTML tags, and then spell checks the text which is retrieved from the parsing.
Stripping HTML tags and spell checking has not caused any problems, using JSoup and Google Spell Check API.
I am able to pull down content from a URL and passing this information into a byte[] and then ultimately a String so that it can be stripped and spell checked. I am running into a problem with character encoding.
For example when parsing http://www.testwareinc.com/...
Original Text: We’ve expanded our Mobile Web and Mobile App testing services.
... the page is using ISO-8859-1 according to meta tag...
ISO-8859-1 Parse: Weve expanded our Mobile Web and Mobile App testing services.
... then trying using UTF-8...
UTF-8 Parse: We�ve expanded our Mobile Web and Mobile App testing services.
Question
Is it possible that HTML of a webpage can include a mix of encodings? And how can that be detected?
It looks like the apostrophe is coded as a 0x92 byte, which according to Wikipedia is an unassigned/private code point.
From there on, it looks like the browser falls back by assuming it's a non-encoded 1-byte Unicode code point : +0092 (Private Use Two) which appears to be represented as an apostrophe. No wait, if it's one byte, it's more probably cp1252: Browsers must have a fallback strategy according to the advertised CP, such as ISO-8859-1 -> CP1252.
So no mix of encoding here but as others said a broken document. But with a fallback heuristic that will sometimes help, sometimes not.
If you're curious enough, you may want to dive into FF or Chrome's source code to see exactly what they do in such a case.
Having more than 1 encoding in a document isn't a mixed document, it is a broken document.
Unfortunately there are a lot of web pages that use an encoding that doesn't match the document definition, or contains some data that is valid in the given encoding and some content that is invalid.
There is no good way to handle this. It is possible to try and guess the encoding of a document, but it is difficult and not 100% reliable. In cases like yours, the simplest solution is just to ignore parts of the document that can't be decoded.
Apache Tika has an encoding detector. There are also commercial alternatives if you need, say, something in C++ and are in a position to spend money.
I can pretty much guarantee that each web page is in one encoding, but it's easy to be mistaken about which one.
seems like issue with special characters. Check this StringEscapeUtils.escapeHtml if it helps. or any method there
edited: added this logic as he was not able to get code working
public static void main(String[] args) throws FileNotFoundException {
String asd = "’";
System.out.println(StringEscapeUtils.escapeXml(asd)); //output - ’
System.out.println(StringEscapeUtils.escapeHtml(asd)); //output - ’
}
Related
I have a problem where I am trying to cleanse the request content to strip out HTML and javascript if included in the input parameters.
This is basically to protect against XSS attacks and the ideal mechanism would be to validate input and encode the output but due to some restrictions I cannot work on the output end.
All I can do at this time is to try to cleanse the input through a filter. I am using ESAPI to canonicalize the input parameters and also using jsoup with the most restrictive Whitelist.none() option to strip all HTML.
This works as long as the malicious javascript is within some HTML tags but fails for a URL with javascript code without any HTML surrounding it, eg:
http://example.com/index.html?a=40&b=10&c='-prompt``-'
ends up showing an alert on the page. This is kind of what I am doing right now:
param = encoder.canonicalize(param, false, false);
param = Jsoup.clean(param, Whitelist.none());
So the question is:
Is there some way through which I can make sure that my input is stripped of all HTML and javascript code at the filter?
Should I throw in some regex validations but is there any regex that will take care of the cases that are getting past the check I have right now?
DISCLAIMER:
If output-escaping is not allowed in your internet-facing solution, you are in a NO-WIN SCENARIO. It's like antivirus on Windows: You'll be able to detect specific and known attacks, but you will be unable to detect or defend against unknown attacks. If your employer insists on this path, your due diligence is to make management aware of this fact and get their acceptance of the risks in writing. Every time I've confronted management with this, they've opted for the correct solution--output escaping.
================================================================
First off... watch out when using JSoup in any kind of a cleaning/filtering/input validation situation.
Upon receiving invalid HTML, like
<script>alert(1);
Jsoup will add in the missing </script> tag.
This means that if you're using Jsoup to "cleanse" HTML, it first transforms INVALID HTML into VALID HTML, before it begins processing.
So the question is: Is there some way through which I can make sure
that my input is stripped of all HTML and javascript code at the
filter? Should I throw in some regex validations but is there any
regex that will take care of the cases that are getting past the check
I have right now?
No. ESAPI and ESAPI's input validation is not appropriate for your use case because HTML is not a regular language and ESAPI's input for its validation are Regular Expressions. The fact is you cannot do what you ask:
Is there some way through which I can make sure that my input is
stripped of all HTML and javascript code at the filter?
And still have a functioning web application that requires user-defined HTML/JavaScript.
You can stack the deck in your favor a little bit: I would choose something like OWASP's HTML Sanitizer. and test your implementation against the XSS inputs listed here.
Many of those inputs are taken from OWASP's XSS Filter evasion cheat sheet, and will at least exercise your application against known attempts. But you will never be secure without output escaping.
===================UPDATE FROM COMMENTS==================
SO the use case is to try and block all html and javascript. My recommendation is to implement caja since it encapsulates HTML, CSS, and Javascript.
Javascript though is also difficult to manage from input validation, because like HTML, JavaScript is a non-regular language. Additionally, each browser has its own implementation that deviates in different ways from the ECMAScript spec. If you want to protect your input from being interpreted, this means you'd ideally have to have a parser for each browser family attempting to interpret user input in order to block it.
When all you've really got to do is make sure that the output is escaped. Sorry to beat a dead horse, but I have to stress that output escaping is 100x more important than rejecting user input. You want both, but if forced to choose one or the other, output escaping is less work overall.
This is the beginning -- I have a file on disk which is HTML page. When I open it with regular web browser it displays as it should -- i.e. no matter what encoding is used, I see correct national characters.
Then I come -- my task is to load the same file, parse it, and print out some pieces on the screen (console) -- let's say, all <hX> texts. Of course I would like to see only correct characters, not some mambo-jumbo. The last step is changing some of text, and save the file.
So the parser has to parse and handle encoding in both ways as well. So far I am unaware of parser which is even capable of loading data correctly.
Question
What parser would you recommend?
Details
HTML page in general has the encoding given in header (in meta tag), so parser should use it. The scenario I have to look in advance and check the encoding, and then manually set the encoding in code is no-go. For example, this is taken from JSoup tutorials:
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
I cannot do such thing, parser has to handle encoding detection by itself.
In C# I faced similar problem with loading html. I used HTMLAgilityPack and first executed encoding detection, then using it I encoded the data stream, and after that I parsed the data. So, I did both steps explicitly, but since the library delivers both methods it is fine with me.
Such explicit separation might be even better, because it would be possible to use in case of missing header probabilistic encoding detection method.
The Jsoup API reference says for that parse method that if you provide null as the second argument (the encoding one), it'll use the http-equiv meta-tag to determine the encoding. So it looks like it already does the "parse a bit, determine encoding, re-parse with proper encoding" routine. Normally such parsers should be capable of resolving the encoding themselves using any means available to them. I know that SAX parsers in Java are supposed to use byte-order marks and the XML declaration to try and establish an encoding.
Apparently Jsoup will default to UTF-8 if no proper meta-tag is found. As they say in the documentation, this is "usually safe" since UTF-8 is compatible with a host of common encodings for the lower code points. But I take it that "usually safe" might not really be good enough in this case.
If you don't sufficiently trust Jsoup to detect the encoding, I see two alternatives:
Should you somehow be ascertained that the HTML is always in fact XHTML, then an XML parser might prove a better fit. But that would only work if the input is definitely XML compliant.
Do a heuristic encoding detection yourself by trying to use byte-order marks, parsing a portion using common encodings and finding a meta-tag, detecting the encoding by byte patterns you'd expect in header tags and finally, all else failing, use a default.
I am trying to encode Arabic text from a Web service. Currently the values come as question marks (???).
I have read many blogs (even stackoverflow answers/links) but nothing seems to worked.
Any idea of how I can resolve this issue?
Thanks
If you use dreamweaver's designer view and paste your Arabic text in design view you will get ascii characters in dreamweaver's code view which will work in any web browser.
First, an important aside: check that the web service you are consuming sends you actual Arabic characters and not actual question marks. Check a network dump if you are not sure, and use wget/curl to perform a simple transaction; check the results.
If the raw data as sent by the WS is question marks, you have an uphill battle - try again and fiddle with the Accept/Accept-Charset headers. If all fail, it may be that the server itself isn't coded properly and there ain't much you can do after that...
Also, you're trying to decode the text, convert it from a byte representation to abstract characters.
This has been the problem Sending UTF-8 data from Android. Your code would work fine except that you will have to encode your String to Base64 . At Server PHP you just decode Base64 String back. It worked for me. I can share if you need the code.
I had experienced different JSON encoded value for the same string depending on the language used in the past. Since the APIs were used in closed environment (no 3rd parties allowed), we made a compromise and all our Java applications are manually encoding Unicode characters. LinkedIn's API is returning "corrupted" values, basically the same as our Java applications. I've already posted a question on their forum, the reason I am asking it here as well is quite simple; sharing is caring :) This question is therefore partially connected with LinkedIn, but mostly trying to find an answer to the general encoding problem described below.
As you can see, my last name contains a letter ž, which should be \u017e but Java (or LinkedIn's API for that matter) returns \u009e with JSON and nothing with XML response. PHP's json_decode() ignores it and my last name becomes Kurida.
After an investigation, I've found ž apparently has two representations, 9e and 17e. What exactly is going on here? Is there a solution for this problem?
U+009E is a usually-invisible control character and not an acceptable alternative representation for ž.
The byte 0x9E represents the character ž in Windows code page 1252. That byte, if decoded using ISO-8859-1, would turn into U+009E.
(The confusion comes from the fact that if you write in an HTML page, the browser doesn't actually give you character U+009E, as you might expect, but converts it to U+017E. The same is true of all the character references 0080–009F: they get changed as if the numbers referred to cp1252 bytes instead of Unicode characters. This is utterly bizarre and wrong behaviour, but all the major browsers do it so we're stuck with it now. Except in proper XHTML served as XML, since that has to follow the more sensible XML rules.)
Looking at the forum page, the JSON-reading is clearly not wrong: your name is registered as being “David Kurid[U+009E]a”. However that data has got into their system needs looking at.
I'm making a small project in Google AppEngine but I'm having problems with international chars. My program takes data from the user through the url "page.html?data1&data2..." and stores it for displaying later.
But when the user are using some international characters like åäö it gets coded as %F4, %F5 and %F6. I assume it is because only the first 128(?) chars in ASCII table are allowed in http-requests.
Is there anyone who has a good solution for this? Any simple way to decode the text? And is it better to decode it before I store the data or should I decode it when displaying it to the user.
URLs can contain anything, but it should be encoded. In Java you can use URLEncoder and URLDecoder to encode and decode urls with the desired character encoding.
Have in mind that these classes are actually meant for HTML form encoding, but they can be applied to the query string (the parameters) of the URLs, so do not use them on the whole URLs - only on the parameters.
The URI spec (RFC 3986) restricts the characters that can be used in URIs (see the ABNF) and defines a percent-encoding scheme for transmitting "unsafe" characters. As Bozho says, the query part of the URL is usually encoded as per the HTML spec (application/x-www-form-urlencoded).
The doc for App Engine says:
App Engine uses the Java Servlet standard for web applications.
So, you should probably let the Servlet API decode the parameters for you. See the parameter methods on HttpServletRequest. This sort of encoding should generally be kept to the view layer, so data would be stored unencoded.
If you do things manually, have a look at this blog post on character handling in URIs.