Problems with UTF-8/ISO-8859-1 encoding on Windows --> Unix

Problems with UTF-8/ISO-8859-1 encoding on Windows --> Unix - java

I've written a program that has to deal with the Swedish letters å ä and ö. I wrote it on a Windows computer and everything works perfectly fine there. But when I tried to run the program in Unix the Swedish letters don't show and the program doesn't work when dealing with the Swedish letters. It's in java by the way.
Any ideas what to do, so it works when running on Unix?

You should use something like with encoding
FileInputStream(file.getAbsolutePath()), fileEncoding)
where fileEncodig == "UTF-8" or other encoding, also usefull to add -Dfile.encoding=UTF-8 system property or programmatically
System.setProperty("file.encoding", fileEncoding);
You can use the same encoding to String#getBytes(fileEncoding) or String constructor with getBytes. The example of encoding and decoding symbols using UTF-8 charset:
new String (value.getBytes("ISO-8859-1"),"UTF-8");
I used this also for reading UTF-8 bundled resources.
The same thing use for FileOutputStream
FileOutputStream(file.getAbsolutePath()), fileEncoding)

Related

Java Jar getting called from unix, some special characters are not printed properly

I have a unix script which calls java jar and gives some encrypted text (doesn't have any special character) as input. Java code decrypts it and then sends the decrypted message to database.
But sometimes special characters (à,ē) are given as inputs. So they are encrypted and sent to jar file. So far so good, but when we print the decrypted message,the spl characters are getting converted to question marks. I tried printing some special characters directly. They are also getting converted to question marks when I ran the Unix script manually. Output is junk characters instead of question marks or special characters.
When I try to put some logs like this
LOGGER.info("áéróspåcê") it is getting converted to ??r?sp?c? when the script is running through crontab whereas
"áéróspåcê" is getting converted to Ã¡Ã©rÃ³spÃ¥cÃª when I trigger the script manually.

The display output is determined by LOCALE and the fonts available in the terminal. Furthermore, while Java utilizes UTF, depending upon how this encrypted text is presented, there is no guarantee the data is presented as UTF.
You can try at your unix prompt the command locale, and especially LANG and potentially LC_ALL.
$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=
With UTF-8 and the correct fonts for the terminal, it will display properly.
However, there could still be issues with the way the encrypted text is being presented to the java program, which could cause other issues.
Then, when you get to the database, the database must also support multi-byte characters.

I'm afraid the windows console does not support special characters how ever this post might me related to this issue How to use unicode characters in Windows command line?

ensure encoding in Java and Terminal

I have a Java2EE Application runnning on a WildFly 10, I am using Terminator (Terminal) to monitor what's going on and Sublime Text 2 to open the log files.
Now I am sending XML through HTTP and for some reason the encoding is messed up (I am german, so äüöß are screwed up). It should be UTF-8 since everything I use is UTF-8 by default, plus I double-checked anyways, and yes, it's UTF-8, but still the encoding is messed up.
But now when I check log files, terminal output or whatever ...
All I see are question marks instead of ä, ö, ü and ß
So does anyone have productive ideas that could help me?

Try this command in jboss-cli.sh
/subsystem=undertow/servlet-container=default:write-attribut‌e(name=default-encod‌ing,value=UTF-8)
then
reload
In general - it is not clear, whether you have problems displaying national chars in OS (check locale, LANG environment variables et c) or doing some programming error.
Also, if you are URLDecoding XMLs, be sure to specify encoding.
For example:
URLDecoder.decode(xml, "UTF-8"))

Netbeans Java Console Encoding UTF-8 and Umlauts

My problem ist about a little java program written using NetBeans 7.4. There is obviously an encoding issue since I need to handle German input containing special characters (äüöÄÜÖß).
Reading in text from files works like a charm, special characters are saved and displayed as expected:
String fileText = new Scanner(file, "UTF-8" ).useDelimiter("\\A").next();
However I also need to read the user input from console - in this case I only care about the one in NetBeas itself since this code will not be used outside the IDE. Entering special characters here leads to the usual symbols (box, question mark) instead of the umlauts.
Scanner scanner = new Scanner(System.in, "UTF-8");
userQuery = scanner.nextLine();
Input: könig
Output: k�nig
I have been stuck on this for quite a while now, having tried every option Google brought my way, but so far no luck. Most people seem to have fixed this by changing the standard encoding (Project Properties -> Sources -> Encoding), which is already set to UTF-8 though.
There is no issue using those characters in any other way, such as saving them in strings or printing them to the console. So the issue seems to be with the NetBeans console encoding setting.
I tried manually changing that without any luck. I'm not sure this setting even affects the NetBeans console, since trying to access the console object just returns null.
System.setProperty("console.encoding", "UTF-8");
Anybody have an idea where to look next? I have already exhausted all Google searches (not much useful on pages > 5, as always).
Thanks!

I have also been confused by I/O encoding in the Netbeans console window for years, and have finally found out why.
At least on my system (Netbeans 8.1 on Windows 10), the Netbeans console confusingly uses UTF-8 for output (that's why your output works for UTF-8 input files), but uses Windows-1252 for input. (So much for POLA :)
So if you change your scanner to use that encoding
Scanner scanner = new Scanner(System.in, "Windows-1252");
everything should work fine. Or you can tell Netbeans to use UTF-8 as console input encoding by adding
-J-Dfile.encoding=UTF-8
to the variable netbeans_default_options in etc/netbeans.conf (in Netbeans installation directory).
For maximum consistency with running the app from the system command line, I would have preferred to use Windows-1252 (or rather IBM850) as Netbeans console encoding on Windows. But Netbeans seems to ignore the given switch for the console output, it always uses UTF-8, so that is the best we can do.
I really like Netbeans, but I'd wish they would clean up this mess...

Eclipse UTF-8-weird characters

I am writing a program in java with Eclipse IDE, and i want to write my comments in Greek. So i changed the encoding from Window->Preferences->General->Content Types->Text->Java Source File, to UTF-8. The comments in my code are ok but when i run my program some words contains weird characters e.g San Germ�n (San Germán). If i change the encoding to ISO-8859-1, all are ok when i run the program but the comments in my code are not(weird characters !). So, what is going wrong with it?
Edit: My program is in java swing and the weird characters with UTF-8 are Strings in cells of a JTable.
EDIT(2): Ok, i solve my problem i keep the UTF-8 encoding for java file but i change the encoding of the strings. String k = new String(myStringInByteArray,ISO-8859-1);

This is most likely due to the compiler not using the correct character encoding when reading your source. This is a very common source of error when moving between systems.
The typical way to solve it is to use plain ASCII (which is identical in both Windows 1252 and UTF-8) and the "\u1234" encoding scheme (unicode character 0x1234), but it is a bit cumbersome to handle as Eclipse (last time I looked) did not transparently support this.
The property file editor does, though, so a reasonable suggestion could be that you put all your strings in a property file, and load the strings as resources when needing to display them. This is also an excellent introduction to Locales which are needed when you want to have your application be able to speak more than one language.

Java application failing on special characters

An application I am working on reads information from files to populate a database. Some of the characters in the files are non-English, for example accented French characters.
The application is working fine in Windows but on our Solaris machine it is failing to recognise the special characters and is throwing an exception. For example when it encounters the accented e in "Gérer" it says :-
Encountered: "\u0161" (353), after : "\'G\u00c3\u00a9rer les mod\u00c3"
(an exception which is thrown from our application)
I suspect that in order to stop this from happening I need to change the file.encoding property of the JVM. I tried to do this via System.setProperty() but it has not stopped the error from occurring.
Are there any suggestions for what I could do? I was thinking about setting the basic locale of the solaris platform in /etc/default/init to be UTF-8. Does anyone think this might help?
Any thoughts are much appreciated.

That looks like a file that was converted by native2ascii using the wrong parameters. To demonstrate, create a file with the contents
Gérer les modÚ
and save it as "a.txt" with the encoding UTF-8. Then run this command:
native2ascii -encoding windows-1252 a.txt b.txt
Open the new file and you should see this:
G\u00c3\u00a9rer les mod\u00c3\u0161
Now reverse the process, but specify ISO-8859-1 this time:
native2ascii -reverse -encoding ISO-8859-1 b.txt c.txt
Read the new file as UTF-8 and you should see this:
Gérer les modÀ\u0161
It recovers the "é" okay, but chokes on the "Ú", like your app did.
I don't know what all is going wrong in your app, but I'm pretty sure incorrect use of native2ascii is part of it. And that was probably the result of letting the app use the system default encoding. You should always specify the encoding when you save text, whether it's to a file or a database or what--never let it default. And if you don't have a good reason to choose something else, use UTF-8.

Try to use
java -Dfile.encoding=UTF-8 ...
when starting the application in both systems.
Another way to solve the problem is to change the encoding from both system to UTF-8, but i prefer the first option (less intrusive on the system).
EDIT:
Check this answer on stackoverflow, It might help either:
Changing the default encoding for String(byte[])

Instead of setting the system-wide character encoding, it might be easier and more robust, to specify the character encoding when reading and writing specific text data. How is your application reading the files? All the Java I/O package readers and writers support passing in a character encoding name to be used when reading/writing text to/from bytes. If you don't specify one, it will then use the platform default encoding, as you are likely experiencing.
Some databases are surprisingly limited in the text encodings they can accept. If your Java application reads the files as text, in the proper encoding, then it can output it to the database however it needs it. If your database doesn't support any encoding whose character repetoire includes the non-ASCII characters you have, then you may need to encode your non-English text first, for example into UTF-8 bytes, then Base64 encode those bytes as ASCII text.
PS: Never use String.getBytes() with no character encoding argument for exactly the reasons you are seeing.

I managed to get past this error by running the command
export LC_ALL='en_GB.UTF-8'
This command set the locale for the shell that I was in. This set all of the LC_ environment variables to the Unicode file encoding.
Many thanks for all of your suggestions.

You can also set the encoding at the command line, like so java -Dfile.encoding=utf-8.

I think we'll need more information to be able to help you with your problem:
What exception are you getting exactly, and which method are you calling when it occurs.
What is the encoding of the input file? UTF8? UTF16/Unicode? ISO8859-1?
It'll also be helpful if you could provide us with relevant code snippets.
Also, a few things I want to point out:
The problem isn't occurring at the 'é' but later on.
It sounds like the character encoding may be hard coded in your application somewhere.

Also, you may want to verify that operating system packages to support UTF-8 (SUNWeulux, SUNWeuluf etc) are installed.

Java uses operating system's default encoding while reading and writing files. Now, one should never rely on that. It's always a good practice to specify the encoding explicitly.
In Java you can use following for reading and writing:
Reading:
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(inputPath),"UTF-8"));
Writing:
PrintWriter pw = new PrintWriter(new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputPath), "UTF-8")));

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Problems with UTF-8/ISO-8859-1 encoding on Windows --> Unix - java

Related

Java Jar getting called from unix, some special characters are not printed properly

ensure encoding in Java and Terminal

Netbeans Java Console Encoding UTF-8 and Umlauts

Eclipse UTF-8-weird characters

Java application failing on special characters

Categories

Resources