Understanding Java Opts Locale Settings

Understanding Java Opts Locale Settings - java

Basically, what is the difference between LANG and user.language?
I've got the following java opts set:
-DLANG=de_DE.UTF-8 -Dfile.encoding=UTF-8 -Duser.language=en -Duser.country=EN
However, if my *.properties file contains an special character like äöü it's not properly formatted in my webapp.
However, if I set the proper unicode for the special character like M\u00E4ngeldaten -> \u00E4 = ä (ref: UTF-8 table) it's properly formatted.
I've also made sure my IDE uses UTF-8 as default encoding.
Why is that behaviour eventhough i've set file encoding UTF-8 for my JBoss instance AND Eclipse IDE?

Related

javac does not output unicode on command line

Context: Windows 10, cmd.exe, javac 9.0.1.
I have unicode encoded source code. If I run javac -encoding UTF-8 ... and I have an error, I just can't get it to display the source correctly.
As you can see in the picture, the cli can print unicode chars just fine.

It would appear that javac is not using your terminal's character encoding.
You can specify the character encoding for a JVM using the flag:
java -Dfile.encoding=UTF-8 ...
(Or whatever encoding)
Javac is just a thin wrapper around a Java program. You can pass arguments directly to its JVM using the -J flag. So:
javac -J-Dfile.encoding=UTF-8 ...

You can know your current(default) encoding by running
System.getProperty("file.encoding")
and you can change default encoding with this property.
For Windows it is usually - cp1252,
Long Story, queue from IBM KB:
Internally, the Java virtual machine (JVM) always operates with data
in Unicode. However, all data transferred into or out of the JVM is in
a format matching the file.encoding property. Data read into the JVM
is converted from file.encoding to Unicode and data sent out of the
JVM is converted from Unicode to file.encoding.

UTF-8 characters in a path to a jarfile

I have a folder with Unicode(UTF-8) symbols in its name, for example, Я_Папка, and the folder contains foo.jar.
Now I need to execute the foo.jar:
chcp 65001
Active code page: 65001
C:\>java -Dsun.jnu.encode=UTF-8 -jar C:\Я_Папка\foo.jar
Error: Unable to access jarfile C:\Я_Папка\foo.jar
-Dsun.jnu.encode=UTF-8 switch tells java to use UTF-8 encoding for the platform string.
-Dfile.encode=UTF-8 switch can't help - it works only with the contents of files rather than the command line
My question here - how to make -jar switch understand UTF-8 encoding?

Java doesn't support Unicode (UTF-8) for platform string on Windows by design. You can only use code page set in the System locale.

If you're not on Russian locale you should probably change "system locale" setting. I've had same problem, see the detailed answer.

Eclipse wrong Java properties UTF-8 encoding

I have a JavaEE project, in which I use message properties files. The encoding of those file is set to UTF-8. In the file I use the german umlauts like ä, ö, ü. The problem is, sometimes those characters are replaced with unicode like \uFFFD\uFFFD, but not for every character. Now, I have a case where ä and ü are both replaced with \uFFFD\uFFFD, but not for every occurring of ä and ü.
The Git diff shows me something like this:
mail.adresses=E-Mail hinzufügen:
-mail.adresses.multiple=E-Mails durch Kommata getrennt hinzufügen.
+mail.adresses.multiple=E-Mails durch Kommata getrennt hinzuf\uFFFD\uFFFDgen.
mail.title=Einladungs-E-Mail
box.preview=Vorschau
box.share.text=Sie können jetzt die ausgewählten Bilder mit Ihren Freunden teilen.
## -6880,7 +6880,7 ## browser.cancel=Abbrechen
browser.selectImage=übernehmen
browser.starImage=merken
browser.removeImage=Löschen
-browser.searchForSimilarImages=ähnliche
+browser.searchForSimilarImages=\uFFFD\uFFFDhnliche
browser.clear_drop_box=löschen
Also, there are lines changed, which I have not touched. I don't understand why I get such a behavior. What could be the cause for the above problem?
My system:
Antergos / Arch Linux
System encoding UTF-8
Python 3.5.0 (default, Sep 20 2015, 11:28:25)
[GCC 5.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
Eclipse Mars 1
Text file encoding UTF-8
Properties file encoding UTF-8
Tomcat 8
Java JDK 8
If I use another Editor like Atom to edit those message properties files, I don't ran into this problem.
I also realized in a case, if I copy the original value browser.searchForSimilarImages=ähnliche from Git diff and replace the wrong value browser.searchForSimilarImages=\uFFFD\uFFFDhnliche in Eclipse with that, then I have the correct umlauts in the message properties file.

Root cause:
By default ISO 8859-1 character encoding is used for Eclipse properties file (read here), so if the file contains any character beyond ISO 8859-1 then it will not be processed as expected.
Solution 1
If you use Eclipse then you will notice that it implicitly converts the special character into \uXXXX equivalent. Try copying
会意字 / 會意字
into a properties file opened in Eclipse.
EDIT: As per comment from OP
Update the encoding of your Eclipse as shown below. If you set encoding as UTF-32 then even you can see Chinese character, which you cannot see generally.
How to change Encoding of properties file in Eclipse: See this Eclipse Bugzilla bug for more details, which talks about several other possibilities and in the end suggest what I have highlighted below.
Chinese characters can be seen in Eclipse after encoding is set properly:
Solution 2
If above doesn't work consistently for you (it does work for me and I never see encoding issues) then try this using some Eclipse plugin which handles encoding of properties or other files. For example Eclipse ResourceBundle Editor or Extended Resource-Bundle editor
I would recommend using Eclipse ResourceBundle Editor.
Solution 3
Another possibility to change encoding of file is using Edit --> Set Encoding option. It really matters because it changes the default character set and file encoding. Play around with by changing encoding using Edit --> Set Encoding option and do following Java sysout System.out.println("Default Charset=" + Charset.defaultCharset()); and System.out.println(System.getProperty("file.encoding"));
As an aside: 1
Process the properties file to have content with ISO 8859-1 character encoding by using native2ascii - Native-to-ASCII Converter
What native2ascii does: It converts all the non-ISO 8859-1 character in their equivalent \uXXXX. This is a good tool because you need not to search the \uXXXX equivalent of special character.
Usage for UTF-8: native2ascii -encoding utf8 e:\a.txt e:\b.txt
As an aside: 2
Every computer program whether an IDE, application server, web server, browser, etc. understands only bits, so it need to know how to interpret the bits to make expected sense out of it because depending upon encoding used, same bits can represent different characters. And that's where "Encoding" comes into picture by giving a unique identifier to represent a character so that all computer programs, diverse OS etc. knows exact right way to interpret it.
So, if you have written into a file using some encoding scheme, lets say UTF-8, and then reading using any editor but running with encoding scheme as UTF-8 then you can expect to get correct display.
Please do read my this answer to get more details but from browser-server perspective.

Add the following arguments to your eclipse.ini file.
-Dclient.encoding.override=UTF-8
-Dfile.encoding=UTF-8
By default Eclipse uses the encoding format picked up by the Java Virtual Machine (JVM). Also, you can set the file encoding to utf-8.

Resolved by doing the below changes :
Modified below properties in eclipse.ini and close and start the eclipse applications
-Dclient.encoding.override=UTF-8
-Dfile.encoding=UTF-8
Set the encoding to the UTF-8 [Navigation path : Edit -> Set encoding]

Properties Files are expected to be ISO-8859-1 (Latin-1) encoded.
Most likely this what eclipse was set to by default as well.
You have to make sure that every tool which is run in the build or whatever disregards the spec and uses UTF-8 instead.

This looks like a mixture of Eclipse and git encoding or rather not-encoding.
Git uses raw bytes and doesn't care about encoding. Using git diff you might get characters like shown here. An example there is R<C3><BC>ckg<C3><A4>ngig # should be "Rückgängig".
As you can see there's two funny bracket things showing per umlaut. And in your editor, there are always two \uFFFD for each umlaut in the lines starting with +.
So I assume that your UTF-8 editor tries to interpret the git notation and fails. This in turn leads to the representation \uFFFD, which basically meands that this is character whose value is unknown or unrepresentable (see here).
Like suggested in the first link, you can try setting LESSCHARSET=UTF-8 in your environment variable (Windows). Hmm, in Linux it should be in etc/profile ?

see: a marker such as FFFD (REPLACEMENT CHARACTER) in http://unicode.org/faq/utf_bom.html
and see native2ascii --help
-encoding encoding_name
Specifies the name of the character encoding to be used by the conversion procedure. If this option is not present, then the
default character encoding (as determined by the java.nio.charset.Charset.defaultCharset method) is used. The encoding_name
string must be the name of a character encoding that is supported by the JRE. See Supported Encodings at
http://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html
a case
$ file yourfile.properties
yourfile.properties : ISO-8859 text, with very long lines
$ native2ascii -encoding ISO-8859-1 yourfile.properties yourfile.properties

ant: warning: unmappable character for encoding UTF8

I have seen numerous of questions like mine but they don't answer my question because I'm using ant and I'm not using eclipse. I run this code: ant clean dist and it tells me numerous times that warning: unmappable character for encoding UTF8.
I see on the Java command that there is a -encoding option, but that doesn't help me cuz I'm using the ant.
I'm on Linux and I'm trying to run the developer version of Sentrick; I haven't made no modifications to anything, I just downloaded it and followed all their instructions and it ain't makes no difference. I emailed the developper and they told me it was this problem but I suspect that it is actually something that gotta do with this error at the end:
BUILD FAILED
/home/daniel/sentricksrc/sentrick/build.xml:22: The following error occurred while executing this line:
/home/daniel/sentricksrc/sentrick/ant/common-targets.xml:83: Test de.denkselbst.sentrick.tokeniser.components.DetectedAbbreviationAnnotatorTest failed
I'm not sure what I'm gonna do now because I really need for it to work

Try to change file encoding of your source files and set the Default Java File Encoding to UTF-8 also.
For Ant:
add -Dfile.encoding=UTF8 to your ANT_OPTS environment variable
Setting the Default Java File Encoding to UTF-8:
export JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF8
Or you can start up java with an argument -Dfile.encoding=UTF8

The problem is not eclipse or ant. The problem is that you have a build file with special characters in it. Like smart quotes or m-dashes from MS Word. Anyway, you have characters in your XML file that are not part of the UTF-8 character set. So you should fix your XML to remove those invalid characters and replace them with similar looking but valid UTF-8 versions. Look for special characters like &#64 © — ® etc. and replace them with the (c) or whatever is useful to you.
BTW, the bad character is in common-targets.xml at line 83

Changing encoding to Cp 1252 worked for my project with same error. I tried changing eclipse properties several times but it did not help me in any way. I added encoding property to my pom.xml file and the error gone. http://ctrlaltsolve.blogspot.in/2015/11/encoding-properties-in-maven.html

Java source file encoding with Chinese character

I import a Java project from Windows platform to Ubuntu.
My Ubuntu is 10.10, Gnome environment: My LANGUAGE is set to en_US:en
My terminal's character encoding is: Unicode (UTF-8)
My IDE is eclipse and text file encoding is: GBK.
In source file, there are some Chinese constant character.
The project build successful on Windows with ant,
but on Ubuntu, I get compile error:
illegal character: \65533
I don't want to use \uxxxx format as the file is already there,
And I've tried the -encoding option for javac, but still can't compile.

I think the problem lies not with Ubuntu, Ubuntu's console, Javac or Eclipse but with the way you transfer the file from windows to Ubuntu. You have to store it as utf-8 before you copy it to Ubuntu otherwise the codepoint-information that is set in your Windows your locale is already lost.

Did you specify the encoding option of the <javac> task in your build.xml?
It should look like this:
<javac encoding="GBK" ...>
If you haven't specified it, then on Windows it will use the platform default encoding (which is GBK in your setup) and on Linux it will use the platform default encoding (which is UTF-8 in your setup).
Since you want the build to work on both platforms (preferably without changing the configuration of either platform), you need to specify the encoding when you compile.

You need to convert you source codes from you windows codepage to UTF-8. Use iconv for this.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.