Java Unix character encoding

Java Unix character encoding - java

I'm running my Java program from Unix. To simplify matters, I describe only the relevant part.
public static void main(String[] args) {
System.out.println("féminin");
}
My output is garbage. It is obviously a character-encoding problem, the French character é is not showing up correctly. I've tried the following:
public static void main(String[] args) {
PrintStream ps = new PrintStream(System.out, true, "ISO-8859-1");
ps.println("féminin");
}
But my output is still showing ? in palce of french character.
I ran the sam efile in command prompt with java -Dfile.encoding=IBM437 DSIClient féminin
it worked fine. But How can I resolve this character-encoding issue with Unix? Thanks

The problem is most likely that your code editor and your terminal emulator use different encodings, and Java's notion of the default encoding may in addition be different.
To see if your terminal and your editor agree, you could simply cat your java source file. Does the é show up correctly? If so, you use the same encoding in your source code editor and your terminal, but it is not Java's default encoding. If, OTOH, you can't see the é, you need to find out what encoding is used by your terminal and your editor and brind them in agreement.

Related

What could be the reason about the encoding problem (Coding with Java/using Turkish Characters in a String) in IntelliJ Idea Run Console? [duplicate]

tldr: I downgraded to JDK 17 (17.0.2) and now it works...
I was watching a beginners Java tutorial by Kody Simpson on YT (youtube.com/watch?v=t9LP9Nt9Nco), and in that tutorial the boy Kody prints crazy symbols called Unicode like "☯Ωøᚙ", but for me it just prints "?" - a question mark.
char letter = '\u1699';
System.out.println(letter);
I tried pretty much every solution on Stack Overflow, such as:
Changing File Encoding to UTF-8, although mine was using UTF-8 by default.
Putting '-Dconsole.encoding=UTF-8' and '-Dfile.encoding=UTF-8' in the Edit Custom VM options.
Messing with Region Settings in control panel.
None of it worked.
Every post was also from many years ago, such as this one, which is from 12 years:
unicode characters appear as question marks in IntelliJ IDEA console
I ended up deleting and re-downloading Intellij because I thought I messed up some settings and wanted a restart, but this time I made the Project SDK an older version, Oracle openJDK version 14.0.1, and now somehow it worked and printed the 'ᚙ' symbol.
Then I realized the problem might be the latest version of the JDK which is version 18, so I downloaded JDK 17.0.2, and it BOOM it still works and prints out the symbol 'ᚙ', so thats nice :). But when I switched back to JDK version 18 it just prints "?" again.
Also its strange because I can copy paste the ᚙ symbol into the writing code area whatever you call it, (on JDK version 18)
char letter = 'ᚙ';
System.out.println(letter);
But when I press RUN and try to PRINT ... it STILL GIVES QUESTION MARK.
I have no clue why this happens, I started learning coding 2 days so I'm probably dumb, or the new version has got a bug, but I never found a solution through Google or here, so this is why I'm making my first ever Stack Overflow post.

I can replicate your problem: printing works correctly when running your code if compiled with JDK 17, and fails when running your code if compiled with JDK 18.
One of the changes implemented in Java 18 was JEP 400: UTF-8 by Default. The summary for that JEP stated:
Specify UTF-8 as the default charset of the standard Java APIs. With
this change, APIs that depend upon the default charset will behave
consistently across all implementations, operating systems, locales,
and configurations.
That sounds good, except one of the goals of that change was (with my emphasis added):
Standardize on UTF-8 throughout the standard Java APIs, except for
console I/O.
So I think your problem arose because you had ensured that the console's encoding in Intellij IDEA was UTF-8, but the PrintStream that you were using to write to that console (i.e. System.out) was not.
The Javadoc for PrintStream states (with my emphasis added):
All characters printed by a PrintStream are converted into bytes using
the given encoding or charset, or the default charset if not
specified.
Since your PrintStream was System.out, you had not specified any "encoding or charset", and were therefore using the "default charset", which was presumably not UTF-8. So to get your code to work on Java 18, you just need to ensure that your PrintStream is encoding with UTF-8. Here's some sample code to show the problem and the solution:
package pkg;
import java.io.FileDescriptor;
import java.io.FileOutputStream;
import java.io.PrintStream;
import java.nio.charset.StandardCharsets;
public class Humpty {
public static void main(String[] args) throws java.io.UnsupportedEncodingException {
char letter = 'ᚙ';
String charset1 = System.out.charset().displayName(); // charset() requires JDK 18
System.out.println("Writing the character " + letter + " to a PrintStream with charset " + charset1); // fails
PrintStream ps = new PrintStream(new FileOutputStream(FileDescriptor.out), true, StandardCharsets.UTF_8);
String charset2 = ps.charset().displayName(); // charset() requires JDK 18
ps.println("Writing the character " + letter + " to a PrintStream with charset " + charset2); // works
}
}
This is the output in the console when running that code:
C:\Java\jdk-18\bin\java.exe -javaagent:C:\Users\johndoe\AppData\Local\JetBrains\Toolbox\apps\IDEA-U\ch-0\221.5080.93\lib\idea_rt.jar=64750:C:\Users\johndoe\AppData\Local\JetBrains\Toolbox\apps\IDEA-U\ch-0\221.5080.93\bin -Dfile.encoding=UTF-8 -classpath C:\Users\johndoe\IdeaProjects\HelloIntellij\out\production\HelloIntellij pkg.Humpty
Writing the character ? to a PrintStream with charset windows-1252
Writing the character ᚙ to a PrintStream with charset UTF-8
Process finished with exit code 0
Notes:
PrintStream has a new method in Java 18 named charset() which "returns the charset used in this PrintStream instance". The code above calls charset(), and shows that for my machine my "default charset" is windows-1252, not UTF-8.
I used Intellij IDEA 2022.1 Beta (Ultimate Edition) for testing.
In the console I used font DejaVu Sans to ensure that the character "ᚙ" could be rendered.
UPDATE: To address the issue raised in the comments below by Mostafa Zeinali, the PrintStream used by System.out can be redirected to a UTF-8 PrintStream by calling System.setOut(). Here's sample code:
String charsetOut = System.out.charset().displayName();
if (!"UTF-8".equals(charsetOut)) {
System.out.println("The charset for System.out is " + charsetOut + ". Changing System.out to use charset UTF-8");
System.setOut(new PrintStream(new FileOutputStream(FileDescriptor.out), true, StandardCharsets.UTF_8));
System.out.println("The charset for System.out is now " + System.out.charset().displayName());
}
This is the output from that code on my Windows 10 machine:
The charset for System.out is windows-1252. Changing System.out to use charset UTF-8
The charset for System.out is now UTF-8
Note that System.out is a final variable, so you can't directly assign a new PrintStream to it. This code fails to compile with the error "Cannot assign a value to final variable 'out'":
System.out = new PrintStream(new FileOutputStream(FileDescriptor.out), true, StandardCharsets.UTF_8); // Won't compile

TLDR: Use this on Java 18:
-Dfile.encoding="UTF-8" -Dsun.stdout.encoding="UTF-8" -Dsun.stderr.encoding="UTF-8"
From JEP 400:
There are three charset-related system properties used internally by the JDK. They remain unspecified and unsupported, but are documented here for completeness:
sun.stdout.encoding and sun.stderr.encoding — the names of the charsets used for the standard output stream (System.out) and standard error stream (System.err), and in the java.io.Console API. sun.jnu.encoding — the name of the charset used by the implementation of java.nio.file when encoding or decoding filename paths, as opposed to file contents. On macOS its value is "UTF-8"; on other platforms it is typically the default charset.
As you can see, those two system properties "remain unspecified and unsupported". But they solved my problem. Therefore, please use them at your own risk, and DO NOT use them in production env. I'm running Eclipse on Windows 10 btw.
I think there must be a good way to set the default charset of JVM upon running, and it is stupid that passing -Dfile.encoding="UTF-8" does not do that. As you can read in JEP 400:
If file.encoding is set to "UTF-8" (i.e., java -Dfile.encoding=UTF-8), then the default charset will be UTF-8. This no-op value is defined in order to preserve the behavior of existing command lines.
And this is exactly what it is "NOT" doing. Passing Dfile.encoding="UTF-8" does "not" preserve the behavior of existing command lines! I think this shows that Java 18's implementation of JEP 400 is not doing what it should actually be doing, which is the root of your problem in the first place.

Had such trouble as well. Changing setting (File > Settings... > Editor > General > Console) into UTF-32 helped to solve this issue.

Partial Display of Unicode Characters in cmd.exe from java, no characters from shell commands

I am experiencing some issues with java Unicode output. I've tried multiple things and I know see the Unicode characters, but they are preceded by a diamond with a question mark inside.
Here is my test file created with notepad:
Here is the file working in notepad++:
Here is my cmd.exe output:
cmd font settings:
Run cmd /U, still no characters (found different font was used here [Consolas] which is why its question marks in boxes):
Windows version:
I tried powershell as well, which seems to think its a different encoding:
I wrote a small java program, and that is able to print the Unicode, but it some cases with an extra character preceding it.
public class App {
private static final Logger logger = LogManager.getLogger(App.class);
public static void main(String[] args) {
Charset utf8Charset = Charset.forName("UTF-8");
Charset defaultCharset = Charset.defaultCharset();
System.out.println(defaultCharset);
System.out.println("A 😃");
System.out.println("B ✔ ");
System.out.println("C ❌");
}
}
I then run this java app with the following flag
java -Dfile.encoding=UTF-8
here is the output:
Why can my Java app print Unicode but not the cmd.exe directly?
How is java adding in characters cmd.exe doesn't know about?
What else can I test/try/change to get Unicode to behave better.
Notes:
I read the Lucida Console should work; I tried all fonts in the field, NSimSun showed the x and checkmark, but not the emoji face. This font is a bit hard on my eyes though.
When I opened the txt file in Word, if I selected all the text and set it to Consolas or Lucida Console it seemed to change the emojis to Segoe UI, so maybe the font's just don't support it - though I've seen other posts which suggest Lucida Console should work?

Java program doesn't print cyrillic characters, but only question marks

my code(Qwe.java)
public class Qwe {
public static void main(String[] args) {
System.out.println("тест привет");
}
}
where
тест привет
is russian words
Qwe.java in UTF-8
on my machine(ubuntu 14.04) result is
тест привет
on server(ubuntu 12.04) I have:
???? ??????
$java Qwe > test.txt
in test.txt is see
???? ??????

I fix it just use export JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF8

The java source text must use the same encoding as the javac compiler. That seems to have been the case, and UTF-8 is of course ideal.
The the file Qwe.class is fine, with internally using Unicode for String. The output to the console uses the server platform encoding. That is, java converts the text in Unicode to bytes using probably the default (platform) encoding, and that cannot handle Cyrillic.
So you have to write to a file, never using FileWriter (a utility class for local files only), but using:
... new OutputStreamWriter(new FileOutputStream(file), "UTF-8")
You can also set user locales on the server but that is not my beer.
In general I would switch to a file logger.

I am not sure but it might only accept ASCII characters from the english language unless you have some extension or something. But like I said my best guess is it is not finding the characters and just outputting garbage in their place.
"Java, any unknown character which is passed through the write() methods of an OutputStream get printed as a plain question mark “?”"
as taken from here

Chinese Characters Displayed as Questions Marks in Mac Terminal

I am trying to retrieve some UTF-8 uni coded Chinese characters from a database using a Java file. When I do this the characters are returned as question marks.
However, when I display the characters from the database (using select * from ...) the characters are displayed normally. When I print a String in a Java file consisting of Chinese characters, they are also printed normally.
I had this problem in Eclipse: when I ran the program, the characters were being printed as question marks. However this problem was solved when I saved the Java file in UTF-8 format.
Running "locale" in the terminal currently returns this:
LANG="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_CTYPE="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_ALL=
I have also tried to compile my java file using this:
javac -encoding UTF-8 [java file]
But still, the output is question marks.
It's quite strange how it will only sometimes display the characters. Does anyone have an explanation for this? Or even better, how to fix this so that the characters are correctly displayed?

The System.out printstream isn't created as a UTF-8 print stream. You can convert it to be one like this:
import java.io.PrintStream;
import java.io.UnsupportedEncodingException;
public class JavaTest {
public static void main(String[] args) {
try{
PrintStream out = new PrintStream(System.out, true, "UTF-8");
out.println("Hello");
out.println("施华洛世奇");
out.println("World");
}
catch(UnsupportedEncodingException UEE){
//Yada yada yada
}
}
}
You can also set the default encoding as per here by:
java -Dfile.encoding=UTF-8 -jar JavaTest.jar

Accents in file name using Java on Solaris

I have a problem where I can't write files with accents in the file name on Solaris.
Given following code
public static void main(String[] args) {
System.out.println("Charset = "+ Charset.defaultCharset().toString());
System.out.println("testéörtkuoë");
FileWriter fw = null;
try {
fw = new FileWriter("testéörtkuoë");
fw.write("testéörtkuoëéörtkuoë");
fw.close();
I get following output
Charset = ISO-8859-1
test??rtkuo?
and I get a file called "test??rtkuo?"
Based on info I found on StackOverflow, I tried to call the Java app by adding "-Dfile.encoding=UTF-8" at startup.
This returns following output
Charset = UTF-8
testéörtkuoë
But the filename is still "test??rtkuo?"
Any help is much appreciated.
Stef

All these characters are present in ISO-8859-1. I suspect part of the problem is that the code editor is saving files in a different encoding to the one your operating system is using.
If the editor is using ISO-8859-1, I would expect it to encode ëéö as:
eb e9 f6
If the editor is using UTF-8, I would expect it to encode ëéö as:
c3ab c3a9 c3b6
Other encodings will produce different values.
The source file would be more portable if you used Unicode escape sequences. At least be certain your compiler is using the same encoding as the editor.
Examples:
ë \u00EB
é \u00E9
ö \u00F6
You can look up these values using the Unicode charts.
Changing the default file encoding using -Dfile.encoding=UTF-8 might have unintended consequences for how the JVM interacts with the system.
There are parallels here with problems you might see on Windows.
I'm unable to reproduce the problem directly - my version of OpenSolaris uses UTF-8 as the default encoding.

If you attempt to list the filenames with the java io apis, what do you see? Are they encoded correctly? I'm curious as to whether the real problem is with encoding the filenames or with the tools that you are using to check them.

What happens when you do:
ls > testéörtkuoë
If that works (writes to the file correctly), then you know you can write to files with accents.

I got a similar problem. Contrary to that example, the program was unable to list the files correct using sysout.println, despite the ls was showing correct values.
As described in the documentation, the enviroment variable file.encoding should not be used to define charset and, in this case, the JVM ignores it
The symptom:
I could not type accents in shell.
ls was showing correct values
File.list() was printing incorrect values
the environ file.encoding was not affecting the output
the environ user.(language|country) was not affecting the output
The solution:
Although the enviroment variable LC_* was set in the shell with values inherited from /etc/defaut/init, as listed by set command, the locale showed different values.
$ set | grep LC
LC_ALL=pt_BR.ISO8859-1
LC_COLLATE=pt_BR.ISO8859-1
LC_CTYPE=pt_BR.ISO8859-1
LC_MESSAGES=C
LC_MONETARY=pt_BR.ISO8859-1
LC_NUMERIC=pt_BR.ISO8859-1
LC_TIME=pt_BR.ISO8859-1
$ locale
LANG=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=
The solution was simple exporting LANG. This environment variable really affect the jvm
LANG=pt_BR.ISO8859-1
export LANG

Java uses operating system's default encoding while reading and writing files. Now, one should never rely on that. It's always a good practice to specify the encoding explicitly.
In Java you can use following for reading and writing:
Reading:
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(inputPath),"UTF-8"));
Writing:
PrintWriter pw = new PrintWriter(new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputPath), "UTF-8")));

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.