Recently, I try to internationalize an application to Chinese speaking country.
I realize there are wide variety of encoding methods for Chinese character : Guobiao, Big5, Unicode, HZ
Whenever user input some text, my Java application need to know what kind of input encoding method the users are using, in order for my Java application to convert the input, into a processable data.
I feel that, it is not reliable for me to make assumption on their input encoding method, based on their OS. This is because when someone is using OS with China locale, JVM will by default using Guobiao encoding. However, user may use Big5 inputting tool, to key in Big5 encoding characters.
I was wondering what is the reliable method you all use, in order to detect the encoding type of user input?
For actual user input, you never have to detect it. It is defined by the environment.
On Windows, for a UNICODE application, the API will deliver UTF-16. For an MBCS application, it will deliver the current code page, and there's an API to tell you what that is.
On Linux, the locale determines the encoding of input as delivered to APIs.
Since you say you are in Java, you really don't need to care. All Java UI programs will deliver either char or String values, and those are always, immutably, in Unicode.
Related
I learnt that, Java allows file names to have unicode characters.
How to name a file,
naïve.java,
using english keyboard?
Is there a notation similar to unicode escape notation(used in java source code) that we can use to name java files with unicode characters?
It seems that you are referring to JLS §7.2 “Host Support for Packages”:
A package name component or class name might contain a character that cannot correctly appear in a host file system's ordinary directory name, such as a Unicode character on a system that allows only ASCII characters in file names. As a convention, the character can be escaped by using, say, the # character followed by four hexadecimal digits giving the numeric value of the character, as in the \uxxxx escape (§3.3).
Under this convention, the package name:
children.activities.crafts.papierM\u00e2ch\u00e9
which can also be written using full Unicode as:
children.activities.crafts.papierMâché
might be mapped to the directory name:
children/activities/crafts/papierM#00e2ch#00e9
Note that this is described as a convention and only for characters not being supported by the host’s filesystem. So the first obstacle is to find a system without Unicode support in the file system API before you can check whether javac still adheres to this convention.
So for most systems, if you want to name a file naïve.java, there’s not only support to name it directly that way, you also have to name it that way as the fallback escaping scheme is not supported by tools designed to run only on systems which don’t need it.
That leads to your other question about how to enter it via the keyboard. Well, that’s system dependent. The most portable solution is:
open your browser
navigate to this question and mark naïve.java with the mouse
press ctrl+c
use your favorite tool to create a new .java file
when asked for the new name, press ctrl+v
As a general solution, refrain from using every feature, the Java programming language offers…
All String objects in Java contain UTF-16 Unicode. All Java objects that open files eventually name those files strings.
However, your keyboard is not Java's problem, it's your operating system's problem.
So:
File foo = new File("n\u00e4ive.java");
You edited the question to say that what you really want is to have a Java source file with interesting characters. That's a matter for your favorite text editor and operating system.
Java is perfectly happy to compile files with arbitrary names. However, creating those files and managing those files is not its problem. How you would go about creating such a file is between you and your operating system. Windows, Linux, OSX: all have different tools for entering Unicode characters that aren't part of the obvious keyboard map.
I have written a web application where users from all over the world can upload files. These may different character encodings. For that reason I have defined my MySQL database in UTF-8, so it renders all special chars correctly.
However, when a user tries to access a file, the web browser says that it cannot be found, with a weird encoded result. I have tried some different encoding approaches (such as URLEncoder/Decoder), but there are still some use-cases where they don't work.
I want to know how Dropbox, Google, etc. solve this encoding problem.
Do they save the string in a form like test+test%C3%BC.txt?
Do they rename the file and store it with a secure name (like 123890123.file)?
Do they use some other technique?
I am also wondering whether the URLEncoder is the best approach to get things working. For example, it replaces a space with a +, instead of with %20 — why does it do that if no browser can handle a plus for a space?
I need to keep the base URL the same, and encode only the filename.
For example,
www.example.com/folder/blä äh.txt
should be encoded as
www.example.com/folder/bl%..+%...txt
How can I achieve this?
I am writing a program in java with Eclipse IDE, and i want to write my comments in Greek. So i changed the encoding from Window->Preferences->General->Content Types->Text->Java Source File, to UTF-8. The comments in my code are ok but when i run my program some words contains weird characters e.g San Germ�n (San Germán). If i change the encoding to ISO-8859-1, all are ok when i run the program but the comments in my code are not(weird characters !). So, what is going wrong with it?
Edit: My program is in java swing and the weird characters with UTF-8 are Strings in cells of a JTable.
EDIT(2): Ok, i solve my problem i keep the UTF-8 encoding for java file but i change the encoding of the strings. String k = new String(myStringInByteArray,ISO-8859-1);
This is most likely due to the compiler not using the correct character encoding when reading your source. This is a very common source of error when moving between systems.
The typical way to solve it is to use plain ASCII (which is identical in both Windows 1252 and UTF-8) and the "\u1234" encoding scheme (unicode character 0x1234), but it is a bit cumbersome to handle as Eclipse (last time I looked) did not transparently support this.
The property file editor does, though, so a reasonable suggestion could be that you put all your strings in a property file, and load the strings as resources when needing to display them. This is also an excellent introduction to Locales which are needed when you want to have your application be able to speak more than one language.
The unicode character from a rails app appears as ??? in the mysql database (ie when I view through putty or linux console), but my rails app reads it properly and shows as intended.I have another java application which reads from the rails database and stores the values in its own database. and try to show in from its database. But in the webpage, it appears like ??? instead of the unicode characters.
How is that the rails application is able to show it properly and not the java application. Do I need to specify any encoding within the java application?
You really need to find out whether it's the Java app that's wrong, the Rails app that's wrong, or both. Use PuTTY or a Linux console isn't a great way of checking this, as they may well not support the relevant Unicode characters. Ideally, you should find a GUI which you can connect to the database, and use that to check the values. Alternatively, find some MySQL functions which will return the Unicode code points directly.
It's quite possible that the Rails app is doing the wrong thing, but in a reversible way (possibly not always reversible - you may just be lucky at the moment). I've seen this before, where a developer has consistently used the same incorrect code page when both encoding and decoding text, and managed to get the right results out without actually storing the correct data. Obviously that screws up any other system trying to get at the same data.
You may want to check the connection parameters: http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html
I guess your Java application may use wrong encoding when reading from rails' database, wrong encoding of its own database or in connection with it.
I have a JavaFX/Groovy application that I'm trying to localize.
It turns out that when I use JavaFX standard execution with the Java VM arg "-Dfile.encoding=UTF-8" locally, all of my international characters (for example, ü) display correctly.
However, if I invoke the app via a JNLP file, using java-vm-args="-Dfile.encoding=UTF-8" e.g.
<resources>
<j2se version="1.6+" java-vm-args="-Dfile.encoding=UTF-8"/>
...other stuff...
</resources>
The application shows international characters as a couple of other random characters (like √¬).
Am I specifying the file encoding incorrectly in the JNLP, or is there some difference between Standard Execution and Webstart that affects this?
Much Appreciated.
EDIT: I am using a Groovy API to access the Remember The Milk RESTful web service. All text that is problematic will come from data retrieved (like task names) and is not actually stored on disk in binary or text. It's curious that "-Dfile.encoding=UTF-8" would actually fix it locally.
I would strongly advise you to explicitly specify the encoding everywhere you're going to be converting text to binary or vice versa. Relying on the JVM default - even after you've set that default - doesn't feel like a good idea to me. You haven't really said what you're doing with text, but if you explicitly set the encoding when you save or load, it should be fine.