I'm trying to find out what has happen in an integration project. We just can't get the encoding right at the end.
A Lithuanian file was imported to the as400. There, text is stored in the encoding EBCDIC. Exporting the data to ANSI file and then read as windows-1257. ASCII-characters works fine and some Lithuanian does, but the rest looks like crap with chars like ~, ¶ and ].
Example string going thou the pipe
Start file
Tuskulënö
as400
Tuskulënö
EAA9A9596
34224335A
exported file (after conversion to windows-1257)
Tuskulėnö
expected result for exported file
Tuskulėnų
Any ideas?
Regards,
Karl
EBCDIC isn't a single encoding, it's a family of encodings (in this case called codepages), similar to how ISO-8859-* is a family of encodings: the encodings within the families share about half the codes for "basic" letters (roughly what is present in ASCII) and differ on the other half.
So if you say that it's stored in EBCDIC, you need to tell us which codepage is used.
A similar problem exists with ANSI: when used for an encoding it refers to a Windows default encoding. Unfortunately the default encoding of a Windows installation can vary based on the locale configured.
So again: you need to find out which actual encoding is used here (these are usually from the Windows-* family, the "normal" English one s Windows-1252).
Once you actually know what encoding you have and want at each point, you can go towards the second step: fixing it.
My personal preference for this kind of problems is this: Have only one step where encodings are converted: take whatever the initial tool produces and convert it to UTF-8 in the first step. From then on, always use UTF-8 to handle that data. If necessary convert UTF-8 to some other encoding in the last step (but avoid this if possible).
Related
I have to read few html files. If i use UTF-8 as charset for reading and writing a file, there are some junk characters getting displayed in html page. It seems the actual file is ANSI encoded since i am using UTF-8 for reading and writing the file, few white spaces are displayed as black diamond with question mark.
Is there a way to find the encoding/charset to be used to read/write a particular file?
No, that's mathematically impossible. Files are just bags of bytes, and most encodings are such that any byte has meaning. Short of using an artificial intelligence getup that analyses how likely it is (looking for words that mix characters from different unicode planes and the like) that you read it using the right encoding, there is therefore no way to be sure.
Some files can be conclusively determined to definitely not be UTF_8 (or, to be corrupted), because there are certain byte sequences that cannot appear in the bytestream that results when you UTF-8 encode some characters. However, this isn't very useful either: You cannot conclude: Oh! Must be UTF-8! based on the lack of these invalid sequences.
You have some options
The right way
When you saved those HTML files, that is when encoding was either chosen (the HTML was received from the webserver and loaded into browser memory, and has been decoded from bytes to chars using the charset listed in the HTTP response header 'Content-Type', then you asked the browser to save it to a file, at which point the browser needs to choose an encoding), or it was known (the tool used to save the HTML saves the HTML 'raw', straight as it was sent over the HTTP connection, but as part of doing this, this tool knows the encoding, as the HTTP server sent it in the 'Content-Type' header), and therefore that was the perfect time to store this information, or to choose a well known encoding (UTF-8 is a good idea).
So, go back to whichever software and/or process managed to save these files and fix it at the source: Either also save the encoding, or, ensure that the HTML file is saved in UTF-8 no matter what the HTTP server you got this HTML from sent it as.
The hacky way
Grab a magnifying glass, put on your finest hat, and get your sherlock holmes on.
The usual strategy is to open a hex editor and travel to the position in the file where you see diamonds or unexpected characters and look at the byte sequence. Especially if it is a somewhat 'well known' western non-ASCII character like é or ö, odds are that doing a web search for the byte(s) you see there, usually you'll find it. Look for the ones with decimal value 128 or higher, in hex, the ones that start with an 8, 9, or a letter - because the ones below that are ASCII and almost all encodings encode those the same way, thus, not useful to differentiate encodings.
For example, if you search for 0xE1 0xBA 0x9E the first hit leads you to this page, scrolling down to 0xe1 0xBA 0x9e it says: That's the UTF-8 version of codepoint 1E9E, the sharp s (ß - common in german). If that makes sense in the text, we figured it out. We will need an AI to do text analysis to figure out if it makes sense. I don't have one, so we'll need an artificial artificial intelligence. In other words, your brain will have to do the job. Just look at it: If, after substituting an ß, the text says Last Name: Boßler, you obviously got it - Boßler is a german last name, as well as a mountain in germany. Web Searching again to the rescue if you are not sure.
Sometimes you have to figure out what character it was supposed to be, and include this in the search. For example, if you check the file and you see a 0xDF and you know a ß has to be there, search for 0xDF ß and you get to this page which shows a ton of encodings and how they store ß. Only a few store it as 0xDF: It's some ISO-8859 variant, or a Cp-125x variant (a.k.a. windows-125x) and you've managed to exclude IBM852. There's no way to know which ISO-8859 or Cp-125 variant it actually is; you'll need more weird characters and hope you hit one where you know what it is supposed to be and these chars are encoded differently between them (unlikely; they are very similar).
Most likely in the end you end up knowing that it is one of a few encodings, because usually there are multiple encodings that would all result in the exact same byte sequence. In fact, if you have all-ASCII characters, there are thousands of encodings that it could be.
We have some data sourced in Italy and being displayed from a server in Poland. We are getting some instances of character substitution. Specifically, the à (small letter A with a grave) is getting substituted with an ŕ (small letter R with an acute). We can see that the à is a 00E0 in the CP1252 Western European character set, and the ŕ is the same value in the CP1250 Eastern European character set, so we know this is a character set issue.
The page is being served by a Websphere app server using JSPs. I have an experimental page where I can reproduce the problem, and sort of fix it, but not in an acceptible manner.
If I set this in my JSP:
response.setContentType("text/html;charset=windows-1250");
The problem is reproduced and the R with acute is displayed.
To sort of fix the problem, on the browser, I change the encoding to "Western European" in IE or "Western Windows-1252" in Chrome.
So this would naturally lead me to believe that if I set "windows-1252" in the content type, it would fix the problem, but it does not. When I do that, the character is then displayed as a question mark.
I have played with all kinds of combinations of response.setContentType, response.setCharacterEncoding, response.setLocale, <meta http-equiv>, <meta charset> and most everything results in the ? showing. Only setting 1250 on the content type and then changing the encoding on the browser itself seems to fix the problem.
Any suggestions?
Thanks
First of all, each source must come with the character set it has been encoded with (i.e. you must know it), otherwise you won't know what character set to use when presenting that source, and your problem will arise with the next data source.
Secondly, if you can, you should ask your sources to move to utf-8, and have those providers re-write their content.
As having a common character set for all you sources is the best solution (and using utf-8 is the most compatible / standard-oriented way of doing it as of today), if you can't make them doing the conversion, by knowing the source encoding you may try to convert the data content from the source charset to your charset using a converter (I haven't used any, so I can't give you any advice on this).
At last, two notes:
1) there's no way to show two contents that use different character sets in a single web application (neither in a single web page), since - like you already found - you may only use one encoding at a time;
2) if your data content is strictly web-oriented, you may ask your sources to use html entities (but keep in mind that this could be a problem if then you'll present that content in e.g. PDF form).
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Java : How to determine the correct charset encoding of a stream
User will upload a CSV file to the server, server need to check if the CSV file is encoded as UTF-8. If so need to inform user, (s)he uploaded a wrong encoding file. The problem is how to detect the file user uploaded is UTF-8 encoding? The back end is written in Java. So anyone get the suggestion?
At least in the general case, there's no way to be certain what encoding is used for a file -- the best you can do is a reasonable guess based on heuristics. You can eliminate some possibilities, but at best you're narrowing down the possibilities without confirming any one. For example, most of the ISO 8859 variants allow any byte value (or pattern of byte values), so almost any content could be encoded with almost any ISO 8859 variant (and I'm only using "almost" out of caution, not any certainty that you could eliminate any of the possibilities).
You can, however, make some reasonable guesses. For example, a file that start out with the three characters of a UTF-8 encoded BOM (EF BB BF), it's probably safe to assume it's really UTF-8. Likewise, if you see sequences like: 110xxxxx 10xxxxxx, it's a pretty fair guess that what you're seeing is encoded with UTF-8. You can eliminate the possibility that something is (correctly) UTF-8 enocded if you ever see a sequence like 110xxxxx 110xxxxx. (110xxxxx is a lead byte of a sequence, which must be followed by a non-lead byte, not another lead byte in properly encoded UTF-8).
You can try and guess the encoding using a 3rd party library, for example: http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding
Well, you can't. You could show kind of a "preview" (or should I say review?) with some sample data from the file so the user can check if it looks okay. Perhaps with the possibility of selecting different encoding options to help determine the correct one.
I know UTF file has BOM for determining encoding but what about other encoding that has
no clue how to guess that encoding.
I am new java programmer.
I have written code for guessing UTF encoding using UTF BOM.
but I have problem with other encoding. How do I guess them.
Anybody can help me?
thanks in Advance.
This question is a duplicate of several previous ones. There are at least two libraries for Java that attempt to guess the encoding (although keep in mind that there is no way to guess right 100% of the time).
GuessEncoding
jchardet (Java port of the algorithm used by mozilla firefox)
Of course, if you know the encoding will only be one of three or four options, you might be able to write a more accurate guessing algorithm.
Short answer is: you cannot.
Even in UTF-8, the BOM is entirely optional and it's often recommended not to use it since many apps do not handle it properly and just display it as if it was a printable char. The original purpose of Byte Order Markers was to tell out the endianness of UTF-16 files.
This said, most apps that handle Unicode implement some sort of guessing algorithm. Read the beginning of the file and look for certain signatures.
If you don't know the encoding and don't have any indicators (like a BOM), its not always possible to accurately "guess" the encoding. Some pointers exist that can give you hints.
For example, a ISO-8859-1 file will (usually) not have any 0x00 chars, however a UTF-16 file have loads of them.
The most common solution is to let the user select the encoding if you cannot detect it.
I am working on a client-server networking app in Java SE. I am using stings terminated by a newline from client to server and the server responds with a null terminated string.
In the output window of Netbeans IDE I am finding some gibberish characters amongst the strings that I send and receive.
I can't figure out what these characters are they mostly look like a rectagular box, when I paste that line containing the character in Notepad++ all the characters following and including that character disapear.
How could I know what characters are appearing in the output sreen of the IDE?
If the response you are getting back from the server is supposed to be human readable text, then this is probably a character encoding problem. For example, if the client and server are both written in Java, it is likely that they using/assuming different character encodings for the text. (It is also possible that the response is not supposed to be human readable text. In that case, the client should not be trying to interpret it as text ... so this question is moot.)
You typically see boxes (splats) when a program tries to render some character code that it does not understand. This maybe a real character (e.g. a Japanese character, mathematical symbol or the like) or it could be an artifact caused by a mismatch between the character sets used for encoding and decoding the text.
To try and figure out what is going on, try modifying your client-side code to read the response as bytes rather than characters, and then output the bytes to the console in hexadecimal. Then post those bytes in your question, together with the displayed characters that you currently see.
If you understand character set naming and have some ideas what the character sets are likely to be, the UNIX / Linux iconv utility may be helpful. Emacs has extensive support for displaying / editing files in a wide range of character encodings. (Even plain old Wordpad can help, if this is just a problem with line termination sequences; e.g. "\n" versus "\r\n" versus "\n\r".)
(I'd avoid trying to diagnose this by copy and pasting. The copy and paste process may itself "mangle" the data, causing you further confusion.)
This is probably just binary data. Most of it will look like gibberish when interpreted as ascii. Make sure you are writing exact number of bytes to the socket, and not some nice number like 4096. Best would be if you can post your code so we can help you find the error(s).