If there is some exception when decoding failed, then we can try to detect the file encoding by one by one trying.
But I didn't found a way which will throw some exception like 'UnicodeDecodeError' in python, is there some specific reason?
PS: the decoding process is failed when some bytes maps to undefined char, since most of the encoding scheme left some unencoded redundancy codes.
PPS: I ask this question because I think it is a design problem. I'm not having problem about encoding. But when I want to write some code to auto detect the file encoding just like what Vim (the text editor) does, I find that this design makes things hard.
Any sequence of bytes would only make sense to you as a String if they make sense as character stream which is relevant to your use case.
What do you expect Java to do when the interpretation is not suitable to your use case?
You will see "garbage" output. But the decoding didn't technically fail. Did it? So, it can't really throw any exception.
Your specified encoding is not probably the one compatible.
Related
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Java : How to determine the correct charset encoding of a stream
User will upload a CSV file to the server, server need to check if the CSV file is encoded as UTF-8. If so need to inform user, (s)he uploaded a wrong encoding file. The problem is how to detect the file user uploaded is UTF-8 encoding? The back end is written in Java. So anyone get the suggestion?
At least in the general case, there's no way to be certain what encoding is used for a file -- the best you can do is a reasonable guess based on heuristics. You can eliminate some possibilities, but at best you're narrowing down the possibilities without confirming any one. For example, most of the ISO 8859 variants allow any byte value (or pattern of byte values), so almost any content could be encoded with almost any ISO 8859 variant (and I'm only using "almost" out of caution, not any certainty that you could eliminate any of the possibilities).
You can, however, make some reasonable guesses. For example, a file that start out with the three characters of a UTF-8 encoded BOM (EF BB BF), it's probably safe to assume it's really UTF-8. Likewise, if you see sequences like: 110xxxxx 10xxxxxx, it's a pretty fair guess that what you're seeing is encoded with UTF-8. You can eliminate the possibility that something is (correctly) UTF-8 enocded if you ever see a sequence like 110xxxxx 110xxxxx. (110xxxxx is a lead byte of a sequence, which must be followed by a non-lead byte, not another lead byte in properly encoded UTF-8).
You can try and guess the encoding using a 3rd party library, for example: http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding
Well, you can't. You could show kind of a "preview" (or should I say review?) with some sample data from the file so the user can check if it looks okay. Perhaps with the possibility of selecting different encoding options to help determine the correct one.
I know UTF file has BOM for determining encoding but what about other encoding that has
no clue how to guess that encoding.
I am new java programmer.
I have written code for guessing UTF encoding using UTF BOM.
but I have problem with other encoding. How do I guess them.
Anybody can help me?
thanks in Advance.
This question is a duplicate of several previous ones. There are at least two libraries for Java that attempt to guess the encoding (although keep in mind that there is no way to guess right 100% of the time).
GuessEncoding
jchardet (Java port of the algorithm used by mozilla firefox)
Of course, if you know the encoding will only be one of three or four options, you might be able to write a more accurate guessing algorithm.
Short answer is: you cannot.
Even in UTF-8, the BOM is entirely optional and it's often recommended not to use it since many apps do not handle it properly and just display it as if it was a printable char. The original purpose of Byte Order Markers was to tell out the endianness of UTF-16 files.
This said, most apps that handle Unicode implement some sort of guessing algorithm. Read the beginning of the file and look for certain signatures.
If you don't know the encoding and don't have any indicators (like a BOM), its not always possible to accurately "guess" the encoding. Some pointers exist that can give you hints.
For example, a ISO-8859-1 file will (usually) not have any 0x00 chars, however a UTF-16 file have loads of them.
The most common solution is to let the user select the encoding if you cannot detect it.
I am working on a client-server networking app in Java SE. I am using stings terminated by a newline from client to server and the server responds with a null terminated string.
In the output window of Netbeans IDE I am finding some gibberish characters amongst the strings that I send and receive.
I can't figure out what these characters are they mostly look like a rectagular box, when I paste that line containing the character in Notepad++ all the characters following and including that character disapear.
How could I know what characters are appearing in the output sreen of the IDE?
If the response you are getting back from the server is supposed to be human readable text, then this is probably a character encoding problem. For example, if the client and server are both written in Java, it is likely that they using/assuming different character encodings for the text. (It is also possible that the response is not supposed to be human readable text. In that case, the client should not be trying to interpret it as text ... so this question is moot.)
You typically see boxes (splats) when a program tries to render some character code that it does not understand. This maybe a real character (e.g. a Japanese character, mathematical symbol or the like) or it could be an artifact caused by a mismatch between the character sets used for encoding and decoding the text.
To try and figure out what is going on, try modifying your client-side code to read the response as bytes rather than characters, and then output the bytes to the console in hexadecimal. Then post those bytes in your question, together with the displayed characters that you currently see.
If you understand character set naming and have some ideas what the character sets are likely to be, the UNIX / Linux iconv utility may be helpful. Emacs has extensive support for displaying / editing files in a wide range of character encodings. (Even plain old Wordpad can help, if this is just a problem with line termination sequences; e.g. "\n" versus "\r\n" versus "\n\r".)
(I'd avoid trying to diagnose this by copy and pasting. The copy and paste process may itself "mangle" the data, causing you further confusion.)
This is probably just binary data. Most of it will look like gibberish when interpreted as ascii. Make sure you are writing exact number of bytes to the socket, and not some nice number like 4096. Best would be if you can post your code so we can help you find the error(s).
I've written a little application that does some text manipulation and writes the output to a file (html, csv, docx, xml) and this all appears to work fine on Mac OS X. On windows however I seem to get character encoding problems and a lot of '"' seems to disappear and be replaced with some weird stuff. Usually the closing '"' out of a pair.
I use a FreeMarker to create my output files and there is a byte[] array and in one case also a ByteArrayStream between reading the templates and writing the output. I assume this is a character encoding problem so if someone could give me advise or point me to some 'Best Practice' resource for dealing with character encoding in java.
Thanks
There's really only one best practice: be aware that Strings and bytes are two fundamentally different things, and that whenever you convert between them, you are using a character encoding (either implicitly or explicitly), which you need to pay attention to.
Typical problematic spots in the Java API are:
new String(byte[])
String.getBytes()
FileReader, FileWriter
All of these implicitly use the platform default encoding, which depends on the OS and the user's locale settings. Usually, it's a good idea to avoid this and explicitly declare an encoding in the above cases (which FileReader/Writer unfortunately don't allow, so you have to use an InputStreamReader/Writer).
However, your problems with the quotation marks and your use of a template engine may have a much simpler explanation. What program are you using to write your templates? It sounds like it's one that inserts "smart quotes", which are part of the Windows-specific cp1251 encoding but don't exist in the more global ISO-8859-1 encoding.
What you probably need to do is to be aware which encoding your templates are saved in, and configure your template engine to use that encoding when reading in the templates. Also be aware that some texxt files, specifically XML, explicitly declare the encoding in a header, and if that header disagrees with the actual encoding used by the file, you'll invariable run into problems.
You can control which encoding your JVM will run with by supplying f,ex
-Dfile.encoding=utf-8
for (UTF-8 of course) as an argument to the JVM. Then you should get predictable results on all platforms. Example:
java -Dfile.encoding=utf-8 my.MainClass
Running the JVM with a 'standard' encoding via the confusing named -Dfile.encoding will resolve a lot of problems.
Ensuring your app doesn't make use of byte[] <-> String conversions without encoding specified is important, since sometimes you can't enforce the VM encoding (e.g. if you have an app server used by multiple applications)
If you're confused by the whole encoding issue, or want to revise your knowledge, Joel Spolsky wrote a great article on this.
I had to make sure that the OutputStreamWriter uses the correct encoding
OutputStream out = ...
OutputStreamWriter writer = new OutputStreamWriter(out, "UTF-8");
template.process(model, writer);
Plus if you use a ByteArrayOutputStream also make sure to call toString with the correct encoding:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
...
baos.toString("UTF-8");
One of our providers are sometimes sending XML feeds that are tagged as UTF-8 encoded documents but includes characters that are not included in the UTF-8 charset. This causes the parser to throw an exception and stop building the DOM object when these characters are encountered:
DocumentBuilder.parse(ByteArrayInputStream bais)
throws the following exception:
org.xml.sax.SAXParseException: Invalid byte 2 of 2-byte UTF-8 sequence.
Is there a way to "capture" these problems early and avoid the exception (i.e. finding and removing those characters from the stream)? What I'm looking for is a "best effort" type of fallback for wrongly encoded documents. The correct solution would obviously be to attack the problem at the source and make sure that only correct documents are delivered, but what is a good approach when that is not possible?
if the problem truly is the wrong encoding (as opposed to a mixed encoding), you don't need to re-encode the document to parse it. just parse it as a Reader instead of an InputStream and the dom parser will ignore the header:
DocumentBuilder.parse(new InpputSource(new InputStreamReader(inputStream, "<real encoding>")));
You should manually take a look at the invalid documents and see what is the common problem to them. It's quite probable they are in fact in another encoding (most probably windows-1252), and the best solution then would be to take every document from the broken system and recode it to UTF-8 before parsing.
Another possible cause is mixed encodings (the content of some elements is in one encoding and the content of other elements is in another encoding). That would be harder to fix.
You would also need a way to know when the broken system gets fixed so you can stop using your workaround.
You should tell them to send you correct UTF-8. Failing that any solution should reencode the bad characters as valid UTF-8 then pass it to the parser. The reason for this is that if the bad characters are preserved then different programs might interpret any output different ways, which can lead to security holes.