I'm making a small project in Google AppEngine but I'm having problems with international chars. My program takes data from the user through the url "page.html?data1&data2..." and stores it for displaying later.
But when the user are using some international characters like åäö it gets coded as %F4, %F5 and %F6. I assume it is because only the first 128(?) chars in ASCII table are allowed in http-requests.
Is there anyone who has a good solution for this? Any simple way to decode the text? And is it better to decode it before I store the data or should I decode it when displaying it to the user.
URLs can contain anything, but it should be encoded. In Java you can use URLEncoder and URLDecoder to encode and decode urls with the desired character encoding.
Have in mind that these classes are actually meant for HTML form encoding, but they can be applied to the query string (the parameters) of the URLs, so do not use them on the whole URLs - only on the parameters.
The URI spec (RFC 3986) restricts the characters that can be used in URIs (see the ABNF) and defines a percent-encoding scheme for transmitting "unsafe" characters. As Bozho says, the query part of the URL is usually encoded as per the HTML spec (application/x-www-form-urlencoded).
The doc for App Engine says:
App Engine uses the Java Servlet standard for web applications.
So, you should probably let the Servlet API decode the parameters for you. See the parameter methods on HttpServletRequest. This sort of encoding should generally be kept to the view layer, so data would be stored unencoded.
If you do things manually, have a look at this blog post on character handling in URIs.
Related
Lets take a look at this scenario: you have a textbox that allows the user to copy any kind of text (UTF8 or Chinese or Arabic characters), then a Submit button to insert that text into MySQL DB.
Normally, I use URLEncoder.encode(text,"UTF-8") & my App runs really stably; I never worried if the users inserted any special characters since the text was encoded so when I read the text, I just decoded it & the text came out exactly the way it was before.
But some guys said that we can set UTF8 in MySQL and Tomcat server or something so we don't need to encode, but this solution requires configuration and I hate configuration as it is not a very sound solution.
Besides, users can enter junk code to hack the DB.
So, In Java & MYSQL, is it good practice to encode text when it is inserted into the DB?
Some people in other forum said it is very bad to store encoded text in DB, but they don't say why it is bad.
So this question is for people who have a lot of experience in Java and MySQL to answer!
The problem with putting URL or XML encoded text into the database is that makes life difficult for querying and doing other processing of that text.
The other problem is that there are different types of escaping that are required in different contexts.
... but this solution requires configuration & I hate configuration as it is not a very sound solution.
Ermm, asserting that configuration is "not a very sound solution" is not a rational argument. The vast majority of applications with a database component require some kind of database configuration.
Besides, users can enter junk code to hack the DB.
The real solution to SQL injection is to use PreparedStatement and fixed SQL query, insert, update, etc strings. Use placeholders for all of the query parameters and use the PreparedStatement set parameter methods to supply their values. This will properly quote the text in the parameters to remove the possibility of SQL injection attacks.
The other thing you need to worry about is people using unescaped XML / HTML metacharacters (like <, > and quotes) to effect XSS attacks against other users. The way to defeat that is to escape the text at the point you are creating the HTML. For instance, you can use the <c:out> to escape the text.
Finally, HTML URL encoded text can't be inserted directly into an HTML page. The URL encoding scheme (using %'s and +'s) is not the correct encoding scheme for text in an HTML page. There you need to use &...; character entities to encode things. A %xx in text will appear as exactly that when you display your web page in a browser. Try it and see!
Answering the questions in the comments:
iamthepiguy said "encode everything before putting it into Db", but u said "No". Suppose i put Html text into DB, there a lot of special characters & many other stuffs, how can we let Db to handle all of them, for example, if mysql doesn't recognize a char, it will turn to "?" & it means the text got corrupted, it mean the users lost that text. How Mysql handle all kind of special characters?
If you use a PreparedStatement with SQL that has placeholders for all of the text parameters, then the JDBC driver takes care of the escaping automatically.
Also, since there is a very diversity of UTF & special chars, so how many other things we need to worry if we do not encode text to make sure the system run stably?
Same answer.
Encoded text make the system run a bit slower, but we are headache-free.
There are no headaches if you use prepared statements and <c:out> (or the equivalent).
you sid "The way to defeat that is to escape the text at the point you are creating the HTML." so we have to use Java to encode right?
Yes, but you ONLY HTML encode the text when you output it for inclusion in a web page. If you output it as JSON, you encode using JSON escaping ... or more likely, you let the JSON serializer do it for you. If you send the text in other formats, or include it in other things, you encode it as required ... or not at all.
But the point is that you don't store it in the database in encoded form. If you do, then in nearly all cases (including HTML!!) you'd need to decode the HTML URL-encoded text before encoding it in the correct way.
It is somewhat better in terms of stability and configuration, as well as safety from XSS attacks, to encode everything before putting it in the database. The disadvantages are it takes slightly longer, and slightly more space in the DB, and you could escape everything when it is created again, but it's easier to escape everything.
I have a website, and need to store data from a text field into a mysql database.
The frontend is perl. I used utf8::encode to encode the data into utf8.
The request is made to the Java backend which connects to the mysql db and inserts this text.
For the table the default charset is set to utf8.
This works in many cases, but it fails in some cases.
If I use テスト, the data stored in the database shows questions marks: ã??ã?¹ã??.
If I try to insert the utf8 encoded string directly from the sql browser, everything works fine.
Update events set summary = ãã¹ã where event_id = 11657;
While inserting I noticed there are some blank characters that show up in the mysql query browser, something like: ã ã¹ ã.
After inserting from here, data in the database shows some boxes in the database instead of these spaces, and テスト displays correctly on the website after utf8 decoding it.
The problem is only when I insert directly from the website, these special characters come up as question marks in the database.
Can someone please help me with these special characters? Do I need to handle them differently?
We have also faced similar issue in one of the projects.So we had to write a small routine to convert those utf8 characters into html encoded and store into the database.
Use StringEscapeUtils from Apache Commons Lang:
import static org.apache.commons.lang.StringEscapeUtils.escapeHtml;
// ...
String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = escapeHtml(source);
If the database really stored テスト, that's what you should see in the sql browser instead of mojibake.
It sounds like the Java backend is interpreting what Perl sends as ISO-8859-1 rather than UTF-8. This explains hows テ gets converted into \u00E3\u0083\u0086. Then the backend tries to send the data to the database in Windows-1252 - the MySQL default encoding. Unfortunately Windows-1252 cannot represent the Unicode characters in the range \u0080-\u009F, so the last two characters are replaced by question marks.
So you have two problems:
You should make the Java backend read the request in UTF-8 rather than in ISO-8859-1.
The backend should use UTF-8 when talking with the database. The easiest way to do this is adding characterEncoding=utf8 to the connection parameters.
I'm assuming that you are sending POST parameters.
I think that the most likely cause of your initial problem is one of the following:
If the parameters are being sent in the HTTP request body, your Perl front-end is probably not setting the encoding in the content type header of the request. The webserver is probably to assuming ISO-8859-1. The solution to this is to set the request content type properly.
If the parameters are sent in the HTTP request URL, your web server is using the wrong characterset when decoding the request parameters. The solution to this is going be web-server specific ...
It sounds like there might also be a character set problem in talking to the database, but that might just be a consequence of earlier mangling.
First I would like to say thank you for the help in advance.
I am currently writing a web crawler that parses HTML content, strips HTML tags, and then spell checks the text which is retrieved from the parsing.
Stripping HTML tags and spell checking has not caused any problems, using JSoup and Google Spell Check API.
I am able to pull down content from a URL and passing this information into a byte[] and then ultimately a String so that it can be stripped and spell checked. I am running into a problem with character encoding.
For example when parsing http://www.testwareinc.com/...
Original Text: We’ve expanded our Mobile Web and Mobile App testing services.
... the page is using ISO-8859-1 according to meta tag...
ISO-8859-1 Parse: Weve expanded our Mobile Web and Mobile App testing services.
... then trying using UTF-8...
UTF-8 Parse: We�ve expanded our Mobile Web and Mobile App testing services.
Question
Is it possible that HTML of a webpage can include a mix of encodings? And how can that be detected?
It looks like the apostrophe is coded as a 0x92 byte, which according to Wikipedia is an unassigned/private code point.
From there on, it looks like the browser falls back by assuming it's a non-encoded 1-byte Unicode code point : +0092 (Private Use Two) which appears to be represented as an apostrophe. No wait, if it's one byte, it's more probably cp1252: Browsers must have a fallback strategy according to the advertised CP, such as ISO-8859-1 -> CP1252.
So no mix of encoding here but as others said a broken document. But with a fallback heuristic that will sometimes help, sometimes not.
If you're curious enough, you may want to dive into FF or Chrome's source code to see exactly what they do in such a case.
Having more than 1 encoding in a document isn't a mixed document, it is a broken document.
Unfortunately there are a lot of web pages that use an encoding that doesn't match the document definition, or contains some data that is valid in the given encoding and some content that is invalid.
There is no good way to handle this. It is possible to try and guess the encoding of a document, but it is difficult and not 100% reliable. In cases like yours, the simplest solution is just to ignore parts of the document that can't be decoded.
Apache Tika has an encoding detector. There are also commercial alternatives if you need, say, something in C++ and are in a position to spend money.
I can pretty much guarantee that each web page is in one encoding, but it's easy to be mistaken about which one.
seems like issue with special characters. Check this StringEscapeUtils.escapeHtml if it helps. or any method there
edited: added this logic as he was not able to get code working
public static void main(String[] args) throws FileNotFoundException {
String asd = "’";
System.out.println(StringEscapeUtils.escapeXml(asd)); //output - ’
System.out.println(StringEscapeUtils.escapeHtml(asd)); //output - ’
}
In my java app I'm preventing XSS attacks. I want to encode URL and hidden field paramaters in the HttpServletRequest objects I have a handle on.
How would I go about doing this?
Don't do that. You're making it unnecessarily more complicated. Just escape it during display only. See my answer in your other topic: Java 5 HTML escaping To Prevent XSS
To properly display user-entered data on an HTML page, you simply need to ensure that any special HTML characters are properly encoded as entities, via String#replace or similar. The good news is that there is very little you need to encode (for this purpose):
str = str.replace("&", "&").replace("<", "<");
You can also replace > if you like, but there's no need to.
This isn't only because of XSS, but also just so that characters show up properly. You may also want to handle ensuring that characters outside the common latin set are turned into appropriate entities, to protect against charset issues (UTF-8 vs. Windows-1252, etc.).
You can use StringEscapeUtils from the library Apache Jakarta Commons Lang
http://www.jdocs.com/lang/2.1/org/apache/commons/lang/StringEscapeUtils.html
Is there a better way to shorten (Use fewer characters) a String in java besides converting the chars to int's and running them through base36?
For example, say if I wanted to shorten a URL.
Short URL services (like 'tinyurl') work by storing a big database table that maps from short URLs to their full form.
When you request a tinyurl, the service allocates a random-looking short url (that is not currently in use) and creates an entry in its table that maps from the short url to your supplied longer one.
When you try to load the short url in a browser, the request first goes to the tinyURL service, which looks up the full URL and then sends an HTTP redirect response to the browser telling it to go to the real URL.
You can implement your own URL shortening service by doing the same thing, though if you are shortening your own URLs you can maybe do the redirection internally to your web server; e.g. using a servlet request filter.
I described the above in the context of shortening URLs in a way that still allows the URLs to be resolved1. But, this approach can also be used more generally; i.e. by creating a pair of Map<String,String> objects and populate it with bidirectional mappings between sequentially generated short strings and the original (probably longer) strings. It is possible to prove that will give a smaller average size of short string than any algorithmic compression or encoding scheme over the same set of long strings.
The downside is the space needed to store the mappings, and the fact that you need the mappings any place (e.g. on any computer) where you need to do the short-to-long or long-to-short conversions.
1 - When you think about it, that is essential. If you shorten a URL string and the result is no longer resolvable, it not a useful URL for most purposes.
Since URL's are UTF-8, and since the characters are therefore base 256, encoding the same characters as integer code-points in base 32 can only make them longer. Or are you not asking what it sounds like you are asking?
Further, in Java Strings are base 65536 UTF-16, so encoding their code points as base 32 will make Java strings even longer.
Just as encoding binary data in base 64 makes it longer by 4/3's - every 3 bytes requires 4 base 64 bytes to encode.
Put the full Urls in a database and give the id as the redirect URL