Pretty desperate for help after 2 days trying to debug this issue.
I have some text that contains unicode characters, for example, the word:
korte støvler
If I run code that writes this word to a file on one of the problem machines, it works correctly. However, when I write the file exactly the same way in a storm bolt, it does not encode correctly, and the ø character is replaced with question marks.
In the storm_env.ini file I have set
STORM_JAR_JVM_OPTS:-Dfile.encoding=UTF-8
I also set the encoding as UTF-8 in the code, and in mvn when it is packaged.
I have run tests on the boxes to check JVM default encodings, and they are all UTF-8.
I have tried 3 different methods of writing the file and all cause the same issue, so it is definitely not that.
This issue was fixed by simply build another machine on ec2. It had exactly the same software versions and configuration as the boxes with issues.
I am writing a program in java with Eclipse IDE, and i want to write my comments in Greek. So i changed the encoding from Window->Preferences->General->Content Types->Text->Java Source File, to UTF-8. The comments in my code are ok but when i run my program some words contains weird characters e.g San Germ�n (San Germán). If i change the encoding to ISO-8859-1, all are ok when i run the program but the comments in my code are not(weird characters !). So, what is going wrong with it?
Edit: My program is in java swing and the weird characters with UTF-8 are Strings in cells of a JTable.
EDIT(2): Ok, i solve my problem i keep the UTF-8 encoding for java file but i change the encoding of the strings. String k = new String(myStringInByteArray,ISO-8859-1);
This is most likely due to the compiler not using the correct character encoding when reading your source. This is a very common source of error when moving between systems.
The typical way to solve it is to use plain ASCII (which is identical in both Windows 1252 and UTF-8) and the "\u1234" encoding scheme (unicode character 0x1234), but it is a bit cumbersome to handle as Eclipse (last time I looked) did not transparently support this.
The property file editor does, though, so a reasonable suggestion could be that you put all your strings in a property file, and load the strings as resources when needing to display them. This is also an excellent introduction to Locales which are needed when you want to have your application be able to speak more than one language.
I've written a program that has to deal with the Swedish letters å ä and ö. I wrote it on a Windows computer and everything works perfectly fine there. But when I tried to run the program in Unix the Swedish letters don't show and the program doesn't work when dealing with the Swedish letters. It's in java by the way.
Any ideas what to do, so it works when running on Unix?
You should use something like with encoding
FileInputStream(file.getAbsolutePath()), fileEncoding)
where fileEncodig == "UTF-8" or other encoding, also usefull to add -Dfile.encoding=UTF-8 system property or programmatically
System.setProperty("file.encoding", fileEncoding);
You can use the same encoding to String#getBytes(fileEncoding) or String constructor with getBytes. The example of encoding and decoding symbols using UTF-8 charset:
new String (value.getBytes("ISO-8859-1"),"UTF-8");
I used this also for reading UTF-8 bundled resources.
The same thing use for FileOutputStream
FileOutputStream(file.getAbsolutePath()), fileEncoding)
I'm writing an applet that's supposed to show both English and Japanese (unicode) characters on a JLabel. The Japanese characters show up fine when I run the applet on my system, but all I get is mojibake when I run it from the web page. The page can display Japanese characters if they're hard-coded into the HTML, but not in the applet. I'm pretty sure I've seen this sort of thing working before. Is there anything I can do in the Java code to fix this?
My first guess would be that the servlet container is not sending back the right character set for your webapp resources. Have a look at the response in an HTTP sniffer to see what character set is included - if the response says that the charset is e.g. CP-1252, then Japanese characters would not be decoded correctly.
You may be able to fix this in code by explicitly setting a Content-Type header with the right charset; but I'd argue it's more appropriate to fix the servlet container's config to return the correct character set for the relevant resources.
Well I'm not sure what was causing the problem, but I set EVERYTHING to read in and display out in UTF-8 and it seems to work now.
An application I am working on reads information from files to populate a database. Some of the characters in the files are non-English, for example accented French characters.
The application is working fine in Windows but on our Solaris machine it is failing to recognise the special characters and is throwing an exception. For example when it encounters the accented e in "Gérer" it says :-
Encountered: "\u0161" (353), after : "\'G\u00c3\u00a9rer les mod\u00c3"
(an exception which is thrown from our application)
I suspect that in order to stop this from happening I need to change the file.encoding property of the JVM. I tried to do this via System.setProperty() but it has not stopped the error from occurring.
Are there any suggestions for what I could do? I was thinking about setting the basic locale of the solaris platform in /etc/default/init to be UTF-8. Does anyone think this might help?
Any thoughts are much appreciated.
That looks like a file that was converted by native2ascii using the wrong parameters. To demonstrate, create a file with the contents
Gérer les modÚ
and save it as "a.txt" with the encoding UTF-8. Then run this command:
native2ascii -encoding windows-1252 a.txt b.txt
Open the new file and you should see this:
G\u00c3\u00a9rer les mod\u00c3\u0161
Now reverse the process, but specify ISO-8859-1 this time:
native2ascii -reverse -encoding ISO-8859-1 b.txt c.txt
Read the new file as UTF-8 and you should see this:
Gérer les modÀ\u0161
It recovers the "é" okay, but chokes on the "Ú", like your app did.
I don't know what all is going wrong in your app, but I'm pretty sure incorrect use of native2ascii is part of it. And that was probably the result of letting the app use the system default encoding. You should always specify the encoding when you save text, whether it's to a file or a database or what--never let it default. And if you don't have a good reason to choose something else, use UTF-8.
Try to use
java -Dfile.encoding=UTF-8 ...
when starting the application in both systems.
Another way to solve the problem is to change the encoding from both system to UTF-8, but i prefer the first option (less intrusive on the system).
EDIT:
Check this answer on stackoverflow, It might help either:
Changing the default encoding for String(byte[])
Instead of setting the system-wide character encoding, it might be easier and more robust, to specify the character encoding when reading and writing specific text data. How is your application reading the files? All the Java I/O package readers and writers support passing in a character encoding name to be used when reading/writing text to/from bytes. If you don't specify one, it will then use the platform default encoding, as you are likely experiencing.
Some databases are surprisingly limited in the text encodings they can accept. If your Java application reads the files as text, in the proper encoding, then it can output it to the database however it needs it. If your database doesn't support any encoding whose character repetoire includes the non-ASCII characters you have, then you may need to encode your non-English text first, for example into UTF-8 bytes, then Base64 encode those bytes as ASCII text.
PS: Never use String.getBytes() with no character encoding argument for exactly the reasons you are seeing.
I managed to get past this error by running the command
export LC_ALL='en_GB.UTF-8'
This command set the locale for the shell that I was in. This set all of the LC_ environment variables to the Unicode file encoding.
Many thanks for all of your suggestions.
You can also set the encoding at the command line, like so java -Dfile.encoding=utf-8.
I think we'll need more information to be able to help you with your problem:
What exception are you getting exactly, and which method are you calling when it occurs.
What is the encoding of the input file? UTF8? UTF16/Unicode? ISO8859-1?
It'll also be helpful if you could provide us with relevant code snippets.
Also, a few things I want to point out:
The problem isn't occurring at the 'é' but later on.
It sounds like the character encoding may be hard coded in your application somewhere.
Also, you may want to verify that operating system packages to support UTF-8 (SUNWeulux, SUNWeuluf etc) are installed.
Java uses operating system's default encoding while reading and writing files. Now, one should never rely on that. It's always a good practice to specify the encoding explicitly.
In Java you can use following for reading and writing:
Reading:
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(inputPath),"UTF-8"));
Writing:
PrintWriter pw = new PrintWriter(new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputPath), "UTF-8")));