I had experienced different JSON encoded value for the same string depending on the language used in the past. Since the APIs were used in closed environment (no 3rd parties allowed), we made a compromise and all our Java applications are manually encoding Unicode characters. LinkedIn's API is returning "corrupted" values, basically the same as our Java applications. I've already posted a question on their forum, the reason I am asking it here as well is quite simple; sharing is caring :) This question is therefore partially connected with LinkedIn, but mostly trying to find an answer to the general encoding problem described below.
As you can see, my last name contains a letter ž, which should be \u017e but Java (or LinkedIn's API for that matter) returns \u009e with JSON and nothing with XML response. PHP's json_decode() ignores it and my last name becomes Kurida.
After an investigation, I've found ž apparently has two representations, 9e and 17e. What exactly is going on here? Is there a solution for this problem?
U+009E is a usually-invisible control character and not an acceptable alternative representation for ž.
The byte 0x9E represents the character ž in Windows code page 1252. That byte, if decoded using ISO-8859-1, would turn into U+009E.
(The confusion comes from the fact that if you write in an HTML page, the browser doesn't actually give you character U+009E, as you might expect, but converts it to U+017E. The same is true of all the character references 0080–009F: they get changed as if the numbers referred to cp1252 bytes instead of Unicode characters. This is utterly bizarre and wrong behaviour, but all the major browsers do it so we're stuck with it now. Except in proper XHTML served as XML, since that has to follow the more sensible XML rules.)
Looking at the forum page, the JSON-reading is clearly not wrong: your name is registered as being “David Kurid[U+009E]a”. However that data has got into their system needs looking at.
Related
I have to read few html files. If i use UTF-8 as charset for reading and writing a file, there are some junk characters getting displayed in html page. It seems the actual file is ANSI encoded since i am using UTF-8 for reading and writing the file, few white spaces are displayed as black diamond with question mark.
Is there a way to find the encoding/charset to be used to read/write a particular file?
No, that's mathematically impossible. Files are just bags of bytes, and most encodings are such that any byte has meaning. Short of using an artificial intelligence getup that analyses how likely it is (looking for words that mix characters from different unicode planes and the like) that you read it using the right encoding, there is therefore no way to be sure.
Some files can be conclusively determined to definitely not be UTF_8 (or, to be corrupted), because there are certain byte sequences that cannot appear in the bytestream that results when you UTF-8 encode some characters. However, this isn't very useful either: You cannot conclude: Oh! Must be UTF-8! based on the lack of these invalid sequences.
You have some options
The right way
When you saved those HTML files, that is when encoding was either chosen (the HTML was received from the webserver and loaded into browser memory, and has been decoded from bytes to chars using the charset listed in the HTTP response header 'Content-Type', then you asked the browser to save it to a file, at which point the browser needs to choose an encoding), or it was known (the tool used to save the HTML saves the HTML 'raw', straight as it was sent over the HTTP connection, but as part of doing this, this tool knows the encoding, as the HTTP server sent it in the 'Content-Type' header), and therefore that was the perfect time to store this information, or to choose a well known encoding (UTF-8 is a good idea).
So, go back to whichever software and/or process managed to save these files and fix it at the source: Either also save the encoding, or, ensure that the HTML file is saved in UTF-8 no matter what the HTTP server you got this HTML from sent it as.
The hacky way
Grab a magnifying glass, put on your finest hat, and get your sherlock holmes on.
The usual strategy is to open a hex editor and travel to the position in the file where you see diamonds or unexpected characters and look at the byte sequence. Especially if it is a somewhat 'well known' western non-ASCII character like é or ö, odds are that doing a web search for the byte(s) you see there, usually you'll find it. Look for the ones with decimal value 128 or higher, in hex, the ones that start with an 8, 9, or a letter - because the ones below that are ASCII and almost all encodings encode those the same way, thus, not useful to differentiate encodings.
For example, if you search for 0xE1 0xBA 0x9E the first hit leads you to this page, scrolling down to 0xe1 0xBA 0x9e it says: That's the UTF-8 version of codepoint 1E9E, the sharp s (ß - common in german). If that makes sense in the text, we figured it out. We will need an AI to do text analysis to figure out if it makes sense. I don't have one, so we'll need an artificial artificial intelligence. In other words, your brain will have to do the job. Just look at it: If, after substituting an ß, the text says Last Name: Boßler, you obviously got it - Boßler is a german last name, as well as a mountain in germany. Web Searching again to the rescue if you are not sure.
Sometimes you have to figure out what character it was supposed to be, and include this in the search. For example, if you check the file and you see a 0xDF and you know a ß has to be there, search for 0xDF ß and you get to this page which shows a ton of encodings and how they store ß. Only a few store it as 0xDF: It's some ISO-8859 variant, or a Cp-125x variant (a.k.a. windows-125x) and you've managed to exclude IBM852. There's no way to know which ISO-8859 or Cp-125 variant it actually is; you'll need more weird characters and hope you hit one where you know what it is supposed to be and these chars are encoded differently between them (unlikely; they are very similar).
Most likely in the end you end up knowing that it is one of a few encodings, because usually there are multiple encodings that would all result in the exact same byte sequence. In fact, if you have all-ASCII characters, there are thousands of encodings that it could be.
We have some data sourced in Italy and being displayed from a server in Poland. We are getting some instances of character substitution. Specifically, the à (small letter A with a grave) is getting substituted with an ŕ (small letter R with an acute). We can see that the à is a 00E0 in the CP1252 Western European character set, and the ŕ is the same value in the CP1250 Eastern European character set, so we know this is a character set issue.
The page is being served by a Websphere app server using JSPs. I have an experimental page where I can reproduce the problem, and sort of fix it, but not in an acceptible manner.
If I set this in my JSP:
response.setContentType("text/html;charset=windows-1250");
The problem is reproduced and the R with acute is displayed.
To sort of fix the problem, on the browser, I change the encoding to "Western European" in IE or "Western Windows-1252" in Chrome.
So this would naturally lead me to believe that if I set "windows-1252" in the content type, it would fix the problem, but it does not. When I do that, the character is then displayed as a question mark.
I have played with all kinds of combinations of response.setContentType, response.setCharacterEncoding, response.setLocale, <meta http-equiv>, <meta charset> and most everything results in the ? showing. Only setting 1250 on the content type and then changing the encoding on the browser itself seems to fix the problem.
Any suggestions?
Thanks
First of all, each source must come with the character set it has been encoded with (i.e. you must know it), otherwise you won't know what character set to use when presenting that source, and your problem will arise with the next data source.
Secondly, if you can, you should ask your sources to move to utf-8, and have those providers re-write their content.
As having a common character set for all you sources is the best solution (and using utf-8 is the most compatible / standard-oriented way of doing it as of today), if you can't make them doing the conversion, by knowing the source encoding you may try to convert the data content from the source charset to your charset using a converter (I haven't used any, so I can't give you any advice on this).
At last, two notes:
1) there's no way to show two contents that use different character sets in a single web application (neither in a single web page), since - like you already found - you may only use one encoding at a time;
2) if your data content is strictly web-oriented, you may ask your sources to use html entities (but keep in mind that this could be a problem if then you'll present that content in e.g. PDF form).
Let's say someone uses this letter: ë. They input it in an EditText Box and it correctly is stored in the MySQL Database (via a php script). But to grap that database field with that special character causes an output of "null" in Java/Android.
It appears my database is setup and storing correctly. But retrieving is the issue. Do I have to fix this in the PHP side or handle it in Java/Android? EDIT: I don't believe this has anything to do with the PHP side anymore so I am more interested int he Java side.
Sounds similar to: android, UTF8 - How do I ensure UTF8 is used for a shared preference
I suspect that the problem occurs over the web interface between the web service and the Android App. One side is sending UTF-16 or ISO 8859-1 characters, and the other is interpreting it as UTF-8 (or vice versa). Make sure:
That the web request from Android is using UTF-8
That the web service replies using UTF-8.
As in the other answer, use a HTTP debugging proxy to check that the characters being sent between the Android App and the web service are what you expect.
I suggest to extract your database access code to a standard Java Env then compile and test it. This will help you to isolate the problem.
Usually you won't get null even if there is encode problem. Check other problem and if other exception throws.
Definitely not problem of PHP if you sure the string is correctly inserted.
Probably a confusion between UTF-8 and UTF-16 or any other character set that you might be using for storing these international characters. In UTF-16, the character ë will be stored as two bytes with the first byte beeing the null byte (0x00). If this double byte is incorrectly transmitted back as, said, UTF-8, then the null byte will be seen as the end of string terminator; resulting in the output of a null value instead of the international character.
First, you need to be 100% sure that your international characters are stored correctly in the database. Seeing the correct result in a php page on a web site is not a guaranty for that; as two wrongs can give you a right. In the past, I have often seen incorrectly stored characters in a database that were still displayed correctly on a web page or some system. This will looks OK until you need to access your database from another system and at this point; everything break loose because you cannot repeat the same kind of errors on the second system.
First I would like to say thank you for the help in advance.
I am currently writing a web crawler that parses HTML content, strips HTML tags, and then spell checks the text which is retrieved from the parsing.
Stripping HTML tags and spell checking has not caused any problems, using JSoup and Google Spell Check API.
I am able to pull down content from a URL and passing this information into a byte[] and then ultimately a String so that it can be stripped and spell checked. I am running into a problem with character encoding.
For example when parsing http://www.testwareinc.com/...
Original Text: We’ve expanded our Mobile Web and Mobile App testing services.
... the page is using ISO-8859-1 according to meta tag...
ISO-8859-1 Parse: Weve expanded our Mobile Web and Mobile App testing services.
... then trying using UTF-8...
UTF-8 Parse: We�ve expanded our Mobile Web and Mobile App testing services.
Question
Is it possible that HTML of a webpage can include a mix of encodings? And how can that be detected?
It looks like the apostrophe is coded as a 0x92 byte, which according to Wikipedia is an unassigned/private code point.
From there on, it looks like the browser falls back by assuming it's a non-encoded 1-byte Unicode code point : +0092 (Private Use Two) which appears to be represented as an apostrophe. No wait, if it's one byte, it's more probably cp1252: Browsers must have a fallback strategy according to the advertised CP, such as ISO-8859-1 -> CP1252.
So no mix of encoding here but as others said a broken document. But with a fallback heuristic that will sometimes help, sometimes not.
If you're curious enough, you may want to dive into FF or Chrome's source code to see exactly what they do in such a case.
Having more than 1 encoding in a document isn't a mixed document, it is a broken document.
Unfortunately there are a lot of web pages that use an encoding that doesn't match the document definition, or contains some data that is valid in the given encoding and some content that is invalid.
There is no good way to handle this. It is possible to try and guess the encoding of a document, but it is difficult and not 100% reliable. In cases like yours, the simplest solution is just to ignore parts of the document that can't be decoded.
Apache Tika has an encoding detector. There are also commercial alternatives if you need, say, something in C++ and are in a position to spend money.
I can pretty much guarantee that each web page is in one encoding, but it's easy to be mistaken about which one.
seems like issue with special characters. Check this StringEscapeUtils.escapeHtml if it helps. or any method there
edited: added this logic as he was not able to get code working
public static void main(String[] args) throws FileNotFoundException {
String asd = "’";
System.out.println(StringEscapeUtils.escapeXml(asd)); //output - ’
System.out.println(StringEscapeUtils.escapeHtml(asd)); //output - ’
}
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Java : How to determine the correct charset encoding of a stream
User will upload a CSV file to the server, server need to check if the CSV file is encoded as UTF-8. If so need to inform user, (s)he uploaded a wrong encoding file. The problem is how to detect the file user uploaded is UTF-8 encoding? The back end is written in Java. So anyone get the suggestion?
At least in the general case, there's no way to be certain what encoding is used for a file -- the best you can do is a reasonable guess based on heuristics. You can eliminate some possibilities, but at best you're narrowing down the possibilities without confirming any one. For example, most of the ISO 8859 variants allow any byte value (or pattern of byte values), so almost any content could be encoded with almost any ISO 8859 variant (and I'm only using "almost" out of caution, not any certainty that you could eliminate any of the possibilities).
You can, however, make some reasonable guesses. For example, a file that start out with the three characters of a UTF-8 encoded BOM (EF BB BF), it's probably safe to assume it's really UTF-8. Likewise, if you see sequences like: 110xxxxx 10xxxxxx, it's a pretty fair guess that what you're seeing is encoded with UTF-8. You can eliminate the possibility that something is (correctly) UTF-8 enocded if you ever see a sequence like 110xxxxx 110xxxxx. (110xxxxx is a lead byte of a sequence, which must be followed by a non-lead byte, not another lead byte in properly encoded UTF-8).
You can try and guess the encoding using a 3rd party library, for example: http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding
Well, you can't. You could show kind of a "preview" (or should I say review?) with some sample data from the file so the user can check if it looks okay. Perhaps with the possibility of selecting different encoding options to help determine the correct one.