how to determine text encoding

how to determine text encoding - java

I know UTF file has BOM for determining encoding but what about other encoding that has
no clue how to guess that encoding.
I am new java programmer.
I have written code for guessing UTF encoding using UTF BOM.
but I have problem with other encoding. How do I guess them.
Anybody can help me?
thanks in Advance.

This question is a duplicate of several previous ones. There are at least two libraries for Java that attempt to guess the encoding (although keep in mind that there is no way to guess right 100% of the time).
GuessEncoding
jchardet (Java port of the algorithm used by mozilla firefox)
Of course, if you know the encoding will only be one of three or four options, you might be able to write a more accurate guessing algorithm.

Short answer is: you cannot.
Even in UTF-8, the BOM is entirely optional and it's often recommended not to use it since many apps do not handle it properly and just display it as if it was a printable char. The original purpose of Byte Order Markers was to tell out the endianness of UTF-16 files.
This said, most apps that handle Unicode implement some sort of guessing algorithm. Read the beginning of the file and look for certain signatures.

If you don't know the encoding and don't have any indicators (like a BOM), its not always possible to accurately "guess" the encoding. Some pointers exist that can give you hints.
For example, a ISO-8859-1 file will (usually) not have any 0x00 chars, however a UTF-16 file have loads of them.
The most common solution is to let the user select the encoding if you cannot detect it.

Related

Is there a way to find file encoding type (UTF-8 or ANSI or Cp1252 or others) using java

I have to read few html files. If i use UTF-8 as charset for reading and writing a file, there are some junk characters getting displayed in html page. It seems the actual file is ANSI encoded since i am using UTF-8 for reading and writing the file, few white spaces are displayed as black diamond with question mark.
Is there a way to find the encoding/charset to be used to read/write a particular file?

No, that's mathematically impossible. Files are just bags of bytes, and most encodings are such that any byte has meaning. Short of using an artificial intelligence getup that analyses how likely it is (looking for words that mix characters from different unicode planes and the like) that you read it using the right encoding, there is therefore no way to be sure.
Some files can be conclusively determined to definitely not be UTF_8 (or, to be corrupted), because there are certain byte sequences that cannot appear in the bytestream that results when you UTF-8 encode some characters. However, this isn't very useful either: You cannot conclude: Oh! Must be UTF-8! based on the lack of these invalid sequences.
You have some options
The right way
When you saved those HTML files, that is when encoding was either chosen (the HTML was received from the webserver and loaded into browser memory, and has been decoded from bytes to chars using the charset listed in the HTTP response header 'Content-Type', then you asked the browser to save it to a file, at which point the browser needs to choose an encoding), or it was known (the tool used to save the HTML saves the HTML 'raw', straight as it was sent over the HTTP connection, but as part of doing this, this tool knows the encoding, as the HTTP server sent it in the 'Content-Type' header), and therefore that was the perfect time to store this information, or to choose a well known encoding (UTF-8 is a good idea).
So, go back to whichever software and/or process managed to save these files and fix it at the source: Either also save the encoding, or, ensure that the HTML file is saved in UTF-8 no matter what the HTTP server you got this HTML from sent it as.
The hacky way
Grab a magnifying glass, put on your finest hat, and get your sherlock holmes on.
The usual strategy is to open a hex editor and travel to the position in the file where you see diamonds or unexpected characters and look at the byte sequence. Especially if it is a somewhat 'well known' western non-ASCII character like é or ö, odds are that doing a web search for the byte(s) you see there, usually you'll find it. Look for the ones with decimal value 128 or higher, in hex, the ones that start with an 8, 9, or a letter - because the ones below that are ASCII and almost all encodings encode those the same way, thus, not useful to differentiate encodings.
For example, if you search for 0xE1 0xBA 0x9E the first hit leads you to this page, scrolling down to 0xe1 0xBA 0x9e it says: That's the UTF-8 version of codepoint 1E9E, the sharp s (ß - common in german). If that makes sense in the text, we figured it out. We will need an AI to do text analysis to figure out if it makes sense. I don't have one, so we'll need an artificial artificial intelligence. In other words, your brain will have to do the job. Just look at it: If, after substituting an ß, the text says Last Name: Boßler, you obviously got it - Boßler is a german last name, as well as a mountain in germany. Web Searching again to the rescue if you are not sure.
Sometimes you have to figure out what character it was supposed to be, and include this in the search. For example, if you check the file and you see a 0xDF and you know a ß has to be there, search for 0xDF ß and you get to this page which shows a ton of encodings and how they store ß. Only a few store it as 0xDF: It's some ISO-8859 variant, or a Cp-125x variant (a.k.a. windows-125x) and you've managed to exclude IBM852. There's no way to know which ISO-8859 or Cp-125 variant it actually is; you'll need more weird characters and hope you hit one where you know what it is supposed to be and these chars are encoded differently between them (unlikely; they are very similar).
Most likely in the end you end up knowing that it is one of a few encodings, because usually there are multiple encodings that would all result in the exact same byte sequence. In fact, if you have all-ASCII characters, there are thousands of encodings that it could be.

Set unicode into clipboard

I am trying to format several aspects of my clipboard when I set it. from what I understand I need to use DataFlavors and have done some reading on Oracle about it but am not sure if/how it is possible to set Unicode and other such formats. (XML?)

DocFlavor.CHAR_ARRAY should do. This is Unicode in the form of UTF-16, which should congrue with wide chars on Windows. The problem probably is the normal EOM single byte character set that is default.

Encoding problems exporting file

I'm trying to find out what has happen in an integration project. We just can't get the encoding right at the end.
A Lithuanian file was imported to the as400. There, text is stored in the encoding EBCDIC. Exporting the data to ANSI file and then read as windows-1257. ASCII-characters works fine and some Lithuanian does, but the rest looks like crap with chars like ~, ¶ and ].
Example string going thou the pipe
Start file
Tuskulënö
as400
Tuskulënö
EAA9A9596
34224335A
exported file (after conversion to windows-1257)
Tuskulėnö
expected result for exported file
Tuskulėnų
Any ideas?
Regards,
Karl

EBCDIC isn't a single encoding, it's a family of encodings (in this case called codepages), similar to how ISO-8859-* is a family of encodings: the encodings within the families share about half the codes for "basic" letters (roughly what is present in ASCII) and differ on the other half.
So if you say that it's stored in EBCDIC, you need to tell us which codepage is used.
A similar problem exists with ANSI: when used for an encoding it refers to a Windows default encoding. Unfortunately the default encoding of a Windows installation can vary based on the locale configured.
So again: you need to find out which actual encoding is used here (these are usually from the Windows-* family, the "normal" English one s Windows-1252).
Once you actually know what encoding you have and want at each point, you can go towards the second step: fixing it.
My personal preference for this kind of problems is this: Have only one step where encodings are converted: take whatever the initial tool produces and convert it to UTF-8 in the first step. From then on, always use UTF-8 to handle that data. If necessary convert UTF-8 to some other encoding in the last step (but avoid this if possible).

Decoding Java's JSON Unicode values with PHP

I had experienced different JSON encoded value for the same string depending on the language used in the past. Since the APIs were used in closed environment (no 3rd parties allowed), we made a compromise and all our Java applications are manually encoding Unicode characters. LinkedIn's API is returning "corrupted" values, basically the same as our Java applications. I've already posted a question on their forum, the reason I am asking it here as well is quite simple; sharing is caring :) This question is therefore partially connected with LinkedIn, but mostly trying to find an answer to the general encoding problem described below.
As you can see, my last name contains a letter ž, which should be \u017e but Java (or LinkedIn's API for that matter) returns \u009e with JSON and nothing with XML response. PHP's json_decode() ignores it and my last name becomes Kurida.
After an investigation, I've found ž apparently has two representations, 9e and 17e. What exactly is going on here? Is there a solution for this problem?

U+009E is a usually-invisible control character and not an acceptable alternative representation for ž.
The byte 0x9E represents the character ž in Windows code page 1252. That byte, if decoded using ISO-8859-1, would turn into U+009E.
(The confusion comes from the fact that if you write  in an HTML page, the browser doesn't actually give you character U+009E, as you might expect, but converts it to U+017E. The same is true of all the character references 0080–009F: they get changed as if the numbers referred to cp1252 bytes instead of Unicode characters. This is utterly bizarre and wrong behaviour, but all the major browsers do it so we're stuck with it now. Except in proper XHTML served as XML, since that has to follow the more sensible XML rules.)
Looking at the forum page, the JSON-reading is clearly not wrong: your name is registered as being “David Kurid[U+009E]a”. However that data has got into their system needs looking at.

Get file's encoding in Java [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Java : How to determine the correct charset encoding of a stream
User will upload a CSV file to the server, server need to check if the CSV file is encoded as UTF-8. If so need to inform user, (s)he uploaded a wrong encoding file. The problem is how to detect the file user uploaded is UTF-8 encoding? The back end is written in Java. So anyone get the suggestion?

At least in the general case, there's no way to be certain what encoding is used for a file -- the best you can do is a reasonable guess based on heuristics. You can eliminate some possibilities, but at best you're narrowing down the possibilities without confirming any one. For example, most of the ISO 8859 variants allow any byte value (or pattern of byte values), so almost any content could be encoded with almost any ISO 8859 variant (and I'm only using "almost" out of caution, not any certainty that you could eliminate any of the possibilities).
You can, however, make some reasonable guesses. For example, a file that start out with the three characters of a UTF-8 encoded BOM (EF BB BF), it's probably safe to assume it's really UTF-8. Likewise, if you see sequences like: 110xxxxx 10xxxxxx, it's a pretty fair guess that what you're seeing is encoded with UTF-8. You can eliminate the possibility that something is (correctly) UTF-8 enocded if you ever see a sequence like 110xxxxx 110xxxxx. (110xxxxx is a lead byte of a sequence, which must be followed by a non-lead byte, not another lead byte in properly encoded UTF-8).

You can try and guess the encoding using a 3rd party library, for example: http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding

Well, you can't. You could show kind of a "preview" (or should I say review?) with some sample data from the file so the user can check if it looks okay. Perhaps with the possibility of selecting different encoding options to help determine the correct one.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.