I finally wrote me little app. It's desktop app but it has embedded web server. When I lunched it from NetBeans everything is ok. When I lunch dist jar I have correct character encoding in GUI, but web server output is corrupted ("?" instead of national characters).
I use NetBeans 6.7.1, jdk1.6.0_16, http server from Java 6 SE and lib Rome 1.0
I don't put any source code here, because I have no idea witch part should I put.
//edit:
data are hardcoded in Strings. Those Strings are passed to Rome as arguments to create RSS nodes, Romes RSS feeds are are written to String and then Strings are passed to HttpHandler.
Check the encoding in the source files.
Check any point where encoding/decoding is performed (often any place where String -> byte[] or byte[] -> String). Anything that converts bytes to Strings is performing an encoding operation myEncoding -> UTF-16.
Check that you are passing the appropriate encoding information to 3rd party libraries that perform encoding/decoding.
If generating XML, ensure that the header encoding matches the encoding used to write the bytes (<?xml version="1.0" encoding="UTF-8"?>).
If serving content over HTTP, ensure that the content type and charset header is correct (e.g. Content-Type: text/html; charset=utf-8). A charset is usually only applicable if serving a text MIME type (it is not applicable for application/rss+xml, for example). Check your MIME documentation.
This issue probably has nothing to do with NetBeans. Usually character encoding issues are due to not defining the character encoding somewhere, in which case the actual character encoding will be determined pretty much by luck.
For instance, Java Strings are UTF-16 internally, but the encoding used by Java Readers is determined by the platform default unless explicitly specified.
Related
I'm working on a big java web application in Eclipse, whose files have different encodings: some are in UTF-8, others in Cp1252, yet others are in ISO-8859-1 (with no distinction between JSP's or java source files, or CSS) — but I know the encoding of each file.
I'm converting the project to Maven, and this is a great occasion to turn all of them to UTF-8.
Of course I don't want to lose a single character (so fully automated conversions do not apply here).
How should I go about it? Is there a tool that can help me ensure I don't lose any special character?
The webapp is in Italian, so, especially in JSP's, there could be lots of accented letters (probably not everywhere HTML entities have been used).
The project is in Eclipse, but I can use an external editor if that could make the conversion easier.
It's very easy to write code to convert encodings - although I'd expect there are tools to do it anyway. Simply:
Create one FileInputStream to the existing file, and wrap it in an InputStreamReader with the appropriate encoding
Create one FileOutputStream to the new file, and wrap it in an OutputStreamWriter with the appropriate encoding
Loop over the reader, reading characters into a buffer and writing out the contents of that buffer (just as many characters as you read) until you've read the whole file
Close all resources (automatic with a try-with-resources block)
The first two steps are simpler with Files.newBufferedReader and Files.newBufferedWriter, too.
Converting a single file can be done with the iconv function (I used LibIconv for Windows).
It lets you specify the source and destinations encodings, and warns when characters can't be converted.
I tried it with a couple of source files and all the accented letters were correctly converted in UTF-8 from Cp1252.
I'm using Spring MVC / Message to translate a java properties file in my application. All language are rendering correctly (except Japanese and Chinese.. They both appear as '?' question marks) The resulting page has a proper UTF-8 encoding.. Is it required to install a language pack to see the characters in the browser or am I encountering some other encoding issue?
I'm using this declaration for charset
They appear in my IDE / Text editors correctly on the same machine.
any thanks appreciated!
Does your response have the right Content header set? For example:
Content-Type: text/html; charset=utf-8
You say that the page has a proper UTF-8 encoding, but it's worth verifying. Next I would check the encoding of the properties files themselves. They might not be saved in UTF-8.
Also, no, you don't need a language pack to see chinese/japanese characters in a browser. As a sanity check you could google "chinese newspaper" and make sure you can see other chinese pages.
I am zipping a file name contains some special characters like Péréquation LES HOPITAUX NEUFS.xls to a different folder, say temp.
I am able to zip the file but the problem is the name of file is changing automatically to
P+¬r+¬quation LES HOPITAUX NEUFS.xls.
How can I support unicode characters for file names inside a zip archive?
It depends a little bit on what code you're using to create the archive. The old Java compression classes are not so flexible as you need.
You may use Apache Commons Compress. Michael Simons wrote this nice piece of code:
ZipArchiveOutputStream ostream = ...; // Your initialization code here
ostream.setEncoding("Cp437"); // This should handle your "special" characters
ostream.setFallbackToUTF8(true); // For "unknown" characters!
ostream.setUseLanguageEncodingFlag(true);
ostream.setCreateUnicodeExtraFields(
ZipArchiveOutputStream.UnicodeExtraFieldPolicy.NOT_ENCODEABLE);
If you're using Java 7 then you finally have a Charset parameter (that can be UTF-8) on the ZipOutputStream constructor
The big problem, anyway, is that many implementations don't understand Unicode encoding because original ZIP file format is ASCII and there is not an official standard for Unicode. See this post for further details.
The Zip specification (historically) does not specify what character encoding to be used for the embedded file names and comments, the original IBM PC character encoding set, commonly referred to as IBM Code Page 437, is supposed to be the only encoding supported. Jar specification meanwhile explicitly specifies to use UTF-8 as the encoding to encode and decode all file names and comments in Jar files. Our java.util.jar and java.util.zip implementation therefor strictly followed Jar specification to use UTF-8 as the sole encoding when dealing with the file names and comments stored in Jar/Zip files.
Consequence? the ZIP file created by "traditional" ZIP tool is not accessible for java.util.jar/zip based tool, and vice versa, if the file name contains characters that are not compatible between Cp437 (as an alternative, tools might simply use the default platform encoding) and UTF-8
For most European, you're "lucky":-) that you only need to avoid a "handful" of characters, such as the umlauts (OK, I'm just kidding), but for Japanese and Chinese, most of the characters are simply out of luck. This is why bug 4244499 had been the No.1 on the Top 25 Java Bugs for so many years. The bug is no longer on the list:-) it has been finally "fixed" in OpenJDK 7, b57. I still keep a snapshot as the record/kudo for myself:-)
The solution (I would use "solution" than a "fix") in JDK7 b57 is to introduce in a new set of ZipInputStream ZipOutStream and ZipFile constructors with a specific "charset" as the parameter, as showed below.
ZipFile(File, Charset)
ZipInputStream(InputStream, Charset)
ZipOutputStream(OutputStream, Charset)
With these new constructors, applications can now access those non-UTF-8 ZIP files via ZipInputStream or ZipFile objects created with the specific encoding, or create a Zip files encoded in non-UTF-8 via the new ZipOutputStream(os, charset) constructor, if necessary.
zip is a stripped-down version of the Jar tool with a "-encoding" option to support non-UTF8 encoding for entry name and comment, it can serve as a demo for how to use the new APIs (I used it as a unit test). I'm still debating with myself if it is a good idea to officially introduce "-encoding" into the Jar tool...
I am able to have my application upload files via FTP using the FTPClient Java library.
(I happen to be uploading to an Oracle XML DB repository.)
Everything uploads fine unless the xml file has curly quotes in it. In which case I get the error:
LPX-00200: could not convert from encoding UTF-8 to UCS2
I can upload what I believe to be the same file using the Windows CMD line FTP tool. I am wondering if there is some encoding setting that the windows CMD line tool uses that maybe I need to set in my Java code.
Anyone know stuff about this? Thanks!!
I don't know that application but you could try to use -Dfile.encoding=UTF-8 on your JVM command line
Not familiar with Oracle XML DB repositories—can they accept compressed uploads? Zipping or gzipping your file would save resources and frustrate any ASCII file type autodetection in use.
In binary this problem goes away.
FTPClient.setType(FTPClient.TYPE_BINARY);
http://www.sauronsoftware.it/projects/ftp4j/manual.php#3
If your file contains curly quotes, they are in the high-order bit set range in iso-8859-1 and windows-1252 character sets. In UTF-8, those characters usually take two bytes in UTF-8.
It's quite possible that you've accidentally encoded the xml file in one of these encodings instead of UTF-8. That would result in a conversion error, because the high-order bit being set is only allowed in sequences of multiple UTF-8 octets.
If you're in Windows, open the file in Notepad and try re-saving the document using Save As... with the UTF-8 encoding, and upload the changed file.. In Unix, use iconv or a similar tool to convert from iso-8859-1 to UTF-8 before uploading.
If the XML document explicitly marks its encoding, make sure it's marked with the correct encoding (e.g. UTF-8). In many xml parsers, you can parse iso-8859-1 or windows-1252 character set encoded XML as long as it's marked as such.
I'm experimenting with internationalization by making a Hello World program that uses properties files + ResourceBundle to get different strings.
Specifically, I have a file "messages_en_US.properties" that stores "hello.world=Hello World!", which works fine of course.
I then have a file "messages_ja_JP.properties" which I've tried all sorts of things with, but it always appears as some type of garbled string when printed to the console or in Swing. The problem is obviously with the reading of the content into a Java string, as a Java string in Japanese typed directly into the source can print fine.
Things I've tried:
The .properties file in UTF-8 encoding with the Japanese string as-is for the value. Something I read indicates that Java expects a properties file to be in the native encoding of the system...? It didn't work either way.
The file in default encoding (ISO-8859-1) and the value stored as escaped Unicode created by the native2ascii program included with Java. Tried with a source file in various Japanese encodings... SHIFT-JIS, EUC-JP, ISO-2022-JP.
Edit:
I actually figured this out while I was typing this, but I figured I'd post it anyway and answer it in case it helps anyone.
I realized that native2ascii was assuming (surprise) that it was converting from my operating system's default encoding each time, and as such not producing the correct escaped Unicode string.
Running native2ascii with the "-encoding encoding_name" option where encoding_name was the name of the source file's encoding (SHIFT-JIS in this case) produced the correct result and everything works fine.
Ant also has a native2ascii task that runs native2ascii on a set of input files and sends output files wherever you want, so I was able to add a builder that does that in Eclipse so that my source folder has the strings in their original encoding for easy editing and building automatically puts converted files of the same name in the output folder.
As of JDK 1.6, Properties has a load() method that accepts a Reader. That means you can save all the property files as UTF-8 and read them all directly by passing an InputStreamReader to load(). I think that's the most elegant solution, but it requires your app to run on a Java 6 runtime.
Historically, load() only accepted an InputStream, and the stream was decoded as ISO-8859-1. Not the system default encoding, always ISO-8859-1. That's important, because it makes a certain hack possible. Say your property file is stored as UTF-8. After you retrieve a property, you can re-encode it as ISO-8859-1 and decode it again as UTF-8, like this:
String realProp = new String(prop.getBytes("ISO-8859-1"), "UTF-8");
It's ugly and fragile, but it does work. But I think the best solution, at least for the next few years, is the one you found: bulk-convert the files with native2ascii using a build tool like Ant.
An alternative way to handle the properties files is:
http://www.unipad.org/main/
This is an editor which can read/write files in \u unicode escape format, this is the format native2ascii creates.
It don't know how well it works with Japanese, I've used it for Hungarian.