I have a problem where I can't write files with accents in the file name on Solaris.
Given following code
public static void main(String[] args) {
System.out.println("Charset = "+ Charset.defaultCharset().toString());
System.out.println("testéörtkuoë");
FileWriter fw = null;
try {
fw = new FileWriter("testéörtkuoë");
fw.write("testéörtkuoëéörtkuoë");
fw.close();
I get following output
Charset = ISO-8859-1
test??rtkuo?
and I get a file called "test??rtkuo?"
Based on info I found on StackOverflow, I tried to call the Java app by adding "-Dfile.encoding=UTF-8" at startup.
This returns following output
Charset = UTF-8
testéörtkuoë
But the filename is still "test??rtkuo?"
Any help is much appreciated.
Stef
All these characters are present in ISO-8859-1. I suspect part of the problem is that the code editor is saving files in a different encoding to the one your operating system is using.
If the editor is using ISO-8859-1, I would expect it to encode ëéö as:
eb e9 f6
If the editor is using UTF-8, I would expect it to encode ëéö as:
c3ab c3a9 c3b6
Other encodings will produce different values.
The source file would be more portable if you used Unicode escape sequences. At least be certain your compiler is using the same encoding as the editor.
Examples:
ë \u00EB
é \u00E9
ö \u00F6
You can look up these values using the Unicode charts.
Changing the default file encoding using -Dfile.encoding=UTF-8 might have unintended consequences for how the JVM interacts with the system.
There are parallels here with problems you might see on Windows.
I'm unable to reproduce the problem directly - my version of OpenSolaris uses UTF-8 as the default encoding.
If you attempt to list the filenames with the java io apis, what do you see? Are they encoded correctly? I'm curious as to whether the real problem is with encoding the filenames or with the tools that you are using to check them.
What happens when you do:
ls > testéörtkuoë
If that works (writes to the file correctly), then you know you can write to files with accents.
I got a similar problem. Contrary to that example, the program was unable to list the files correct using sysout.println, despite the ls was showing correct values.
As described in the documentation, the enviroment variable file.encoding should not be used to define charset and, in this case, the JVM ignores it
The symptom:
I could not type accents in shell.
ls was showing correct values
File.list() was printing incorrect values
the environ file.encoding was not affecting the output
the environ user.(language|country) was not affecting the output
The solution:
Although the enviroment variable LC_* was set in the shell with values inherited from /etc/defaut/init, as listed by set command, the locale showed different values.
$ set | grep LC
LC_ALL=pt_BR.ISO8859-1
LC_COLLATE=pt_BR.ISO8859-1
LC_CTYPE=pt_BR.ISO8859-1
LC_MESSAGES=C
LC_MONETARY=pt_BR.ISO8859-1
LC_NUMERIC=pt_BR.ISO8859-1
LC_TIME=pt_BR.ISO8859-1
$ locale
LANG=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=
The solution was simple exporting LANG. This environment variable really affect the jvm
LANG=pt_BR.ISO8859-1
export LANG
Java uses operating system's default encoding while reading and writing files. Now, one should never rely on that. It's always a good practice to specify the encoding explicitly.
In Java you can use following for reading and writing:
Reading:
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(inputPath),"UTF-8"));
Writing:
PrintWriter pw = new PrintWriter(new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputPath), "UTF-8")));
Related
tldr: I downgraded to JDK 17 (17.0.2) and now it works...
I was watching a beginners Java tutorial by Kody Simpson on YT (youtube.com/watch?v=t9LP9Nt9Nco), and in that tutorial the boy Kody prints crazy symbols called Unicode like "☯Ωøᚙ", but for me it just prints "?" - a question mark.
char letter = '\u1699';
System.out.println(letter);
I tried pretty much every solution on Stack Overflow, such as:
Changing File Encoding to UTF-8, although mine was using UTF-8 by default.
Putting '-Dconsole.encoding=UTF-8' and '-Dfile.encoding=UTF-8' in the Edit Custom VM options.
Messing with Region Settings in control panel.
None of it worked.
Every post was also from many years ago, such as this one, which is from 12 years:
unicode characters appear as question marks in IntelliJ IDEA console
I ended up deleting and re-downloading Intellij because I thought I messed up some settings and wanted a restart, but this time I made the Project SDK an older version, Oracle openJDK version 14.0.1, and now somehow it worked and printed the 'ᚙ' symbol.
Then I realized the problem might be the latest version of the JDK which is version 18, so I downloaded JDK 17.0.2, and it BOOM it still works and prints out the symbol 'ᚙ', so thats nice :). But when I switched back to JDK version 18 it just prints "?" again.
Also its strange because I can copy paste the ᚙ symbol into the writing code area whatever you call it, (on JDK version 18)
char letter = 'ᚙ';
System.out.println(letter);
But when I press RUN and try to PRINT ... it STILL GIVES QUESTION MARK.
I have no clue why this happens, I started learning coding 2 days so I'm probably dumb, or the new version has got a bug, but I never found a solution through Google or here, so this is why I'm making my first ever Stack Overflow post.
I can replicate your problem: printing works correctly when running your code if compiled with JDK 17, and fails when running your code if compiled with JDK 18.
One of the changes implemented in Java 18 was JEP 400: UTF-8 by Default. The summary for that JEP stated:
Specify UTF-8 as the default charset of the standard Java APIs. With
this change, APIs that depend upon the default charset will behave
consistently across all implementations, operating systems, locales,
and configurations.
That sounds good, except one of the goals of that change was (with my emphasis added):
Standardize on UTF-8 throughout the standard Java APIs, except for
console I/O.
So I think your problem arose because you had ensured that the console's encoding in Intellij IDEA was UTF-8, but the PrintStream that you were using to write to that console (i.e. System.out) was not.
The Javadoc for PrintStream states (with my emphasis added):
All characters printed by a PrintStream are converted into bytes using
the given encoding or charset, or the default charset if not
specified.
Since your PrintStream was System.out, you had not specified any "encoding or charset", and were therefore using the "default charset", which was presumably not UTF-8. So to get your code to work on Java 18, you just need to ensure that your PrintStream is encoding with UTF-8. Here's some sample code to show the problem and the solution:
package pkg;
import java.io.FileDescriptor;
import java.io.FileOutputStream;
import java.io.PrintStream;
import java.nio.charset.StandardCharsets;
public class Humpty {
public static void main(String[] args) throws java.io.UnsupportedEncodingException {
char letter = 'ᚙ';
String charset1 = System.out.charset().displayName(); // charset() requires JDK 18
System.out.println("Writing the character " + letter + " to a PrintStream with charset " + charset1); // fails
PrintStream ps = new PrintStream(new FileOutputStream(FileDescriptor.out), true, StandardCharsets.UTF_8);
String charset2 = ps.charset().displayName(); // charset() requires JDK 18
ps.println("Writing the character " + letter + " to a PrintStream with charset " + charset2); // works
}
}
This is the output in the console when running that code:
C:\Java\jdk-18\bin\java.exe -javaagent:C:\Users\johndoe\AppData\Local\JetBrains\Toolbox\apps\IDEA-U\ch-0\221.5080.93\lib\idea_rt.jar=64750:C:\Users\johndoe\AppData\Local\JetBrains\Toolbox\apps\IDEA-U\ch-0\221.5080.93\bin -Dfile.encoding=UTF-8 -classpath C:\Users\johndoe\IdeaProjects\HelloIntellij\out\production\HelloIntellij pkg.Humpty
Writing the character ? to a PrintStream with charset windows-1252
Writing the character ᚙ to a PrintStream with charset UTF-8
Process finished with exit code 0
Notes:
PrintStream has a new method in Java 18 named charset() which "returns the charset used in this PrintStream instance". The code above calls charset(), and shows that for my machine my "default charset" is windows-1252, not UTF-8.
I used Intellij IDEA 2022.1 Beta (Ultimate Edition) for testing.
In the console I used font DejaVu Sans to ensure that the character "ᚙ" could be rendered.
UPDATE: To address the issue raised in the comments below by Mostafa Zeinali, the PrintStream used by System.out can be redirected to a UTF-8 PrintStream by calling System.setOut(). Here's sample code:
String charsetOut = System.out.charset().displayName();
if (!"UTF-8".equals(charsetOut)) {
System.out.println("The charset for System.out is " + charsetOut + ". Changing System.out to use charset UTF-8");
System.setOut(new PrintStream(new FileOutputStream(FileDescriptor.out), true, StandardCharsets.UTF_8));
System.out.println("The charset for System.out is now " + System.out.charset().displayName());
}
This is the output from that code on my Windows 10 machine:
The charset for System.out is windows-1252. Changing System.out to use charset UTF-8
The charset for System.out is now UTF-8
Note that System.out is a final variable, so you can't directly assign a new PrintStream to it. This code fails to compile with the error "Cannot assign a value to final variable 'out'":
System.out = new PrintStream(new FileOutputStream(FileDescriptor.out), true, StandardCharsets.UTF_8); // Won't compile
TLDR: Use this on Java 18:
-Dfile.encoding="UTF-8" -Dsun.stdout.encoding="UTF-8" -Dsun.stderr.encoding="UTF-8"
From JEP 400:
There are three charset-related system properties used internally by the JDK. They remain unspecified and unsupported, but are documented here for completeness:
sun.stdout.encoding and sun.stderr.encoding — the names of the charsets used for the standard output stream (System.out) and standard error stream (System.err), and in the java.io.Console API. sun.jnu.encoding — the name of the charset used by the implementation of java.nio.file when encoding or decoding filename paths, as opposed to file contents. On macOS its value is "UTF-8"; on other platforms it is typically the default charset.
As you can see, those two system properties "remain unspecified and unsupported". But they solved my problem. Therefore, please use them at your own risk, and DO NOT use them in production env. I'm running Eclipse on Windows 10 btw.
I think there must be a good way to set the default charset of JVM upon running, and it is stupid that passing -Dfile.encoding="UTF-8" does not do that. As you can read in JEP 400:
If file.encoding is set to "UTF-8" (i.e., java -Dfile.encoding=UTF-8), then the default charset will be UTF-8. This no-op value is defined in order to preserve the behavior of existing command lines.
And this is exactly what it is "NOT" doing. Passing Dfile.encoding="UTF-8" does "not" preserve the behavior of existing command lines! I think this shows that Java 18's implementation of JEP 400 is not doing what it should actually be doing, which is the root of your problem in the first place.
Had such trouble as well. Changing setting (File > Settings... > Editor > General > Console) into UTF-32 helped to solve this issue.
I have a situation where I have to read a CSV file which contains special character like 'µ'. This has to be done using java. I am doing:
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(new FileInputStream(,"UTF-8"));
in windows it runs OK. But in redhat linux environment, it converts those special characters to '?'. Any help is highly appreciated
Output written to System.out will be encoded using the "platform default encoding" which on linux is determined from locale environment variables (see output of locale command), and those in turn are set in user or system level configuration files.
On server installations, the default encoding is often ASCII. "µ" is not an ASCII character so it will be converted to "?" when it is printed.
There are a couple of ways to change the default encoding:
Set the Java file.encoding system property when you run your program, e.g.
java -Dfile.encoding=utf-8 yourprogram
Set LC_CTYPE env variable before you run your program, for example:
export LC_CTYPE=en_US.UTF-8
java yourprogram
Those methods also change the default encoding for input and file names etc. You can change the encoding specifically for System.out with Java code:
PrintStream originalOut = System.out; // in case you need it later
System.setOut(new PrintStream(System.out, true, "utf-8"));
I'm writing a project which parses a UTF-8 encoded file.
I'm doing it this way
ArrayList<String> al = new ArrayList<>();
BufferedReader bufferedReader = new BufferedReader(new
InputStreamReader(new FileInputStream(filename),"UTF8"));
String line = null;
while ((line = bufferedReader.readLine()) != null)
{
al.add(line);
}
return al;
The strange thing is that it reads the file properly when I run it in IntelliJ, but not when I run it through java -jar (It gives me garbage values instead of UTF8).
What can I do to either
Run my Java through java -jar in the same environment as intelliJ or
Fix my code so that it reads UTF-8 into the string
I think that what is going on here is that you just don't have your terminal setup correctly for your default encoding. Basically, if your program runs correctly, then it's grabbing the UTF-8 bytes, storing them as Java strings, then outputting them to the terminal in whatever the default encoding scheme is. To find out what your default encoding scheme see this question. Then you need to ensure that your terminal that you are running your java -jar command from is compatible with it. For example, see my terminal settings/preferences on my Mac.
Oracle docs give a pretty straightforward answer about Charset:
Standard charsets
Every implementation of the Java platform is required to support the following standard charsets. Consult the release documentation for your implementation to see if any other charsets are supported. The behavior of such optional charsets may differ between implementations.
...
UTF-8
Eight-bit UCS Transformation Format
So you should use new InputStreamReader(new FileInputStream(filename),"UTF-8"));
I download a file from a website using a Java program and the header looks like below
Content-Disposition attachment;filename="Textkürzung.asc";
There is no encoding specified
What I do is after downloading I pass the name of the file to another application for further processing. I use
System.out.println(filename);
In the standard out the string is printed as Textk³rzung.asc
How can I change the Standard Out to "UTF-8" in Java?
I tried to encode to "UTF-8" and the content is still the same
Update:
I was able to fix this without any code change. In the place where I call this my jar file from the other application, i did the following
java -DFile.Encoding=UTF-8 -jar ....
This seem to have fixed the issue
thank you all for your support
The default encoding of System.out is the operating system default. On international versions of Windows this is usually the windows-1252 codepage. If you're running your code on the command line, that is also the encoding the terminal expects, so special characters are displayed correctly. But if you are running the code some other way, or sending the output to a file or another program, it might be expecting a different encoding. In your case, apparently, UTF-8.
You can actually change the encoding of System.out by replacing it:
try {
System.setOut(new PrintStream(new FileOutputStream(FileDescriptor.out), true, "UTF-8"));
} catch (UnsupportedEncodingException e) {
throw new InternalError("VM does not support mandatory encoding UTF-8");
}
This works for cases where using a new PrintStream is not an option, for instance because the output is coming from library code which you cannot change, and where you have no control over system properties, or where changing the default encoding of all files is not appropriate.
The result you're seeing suggests your console expects text to be in Windows "code page 850" encoding - the character ü has Unicode code point U+00FC. The byte value 0xFC renders in Windows code page 850 as ³. So if you want the name to appear correctly on the console then you need to print it using the encoding "Cp850":
PrintWriter consoleOut = new PrintWriter(new OutputStreamWriter(System.out, "Cp850"));
consoleOut.println(filename);
Whether this is what your "other application" expects is a different question - the other app will only see the correct name if it is reading its standard input as Cp850 too.
Try to use:
PrintStream out = new PrintStream(System.out, true, "UTF-8");
out.println(test);
I am trying to write a Java app that will run on a linux server but that will process files generated on legacy Windows machines using cp-1252 as the character set. Is there anyway to encode these files as utf-8 instead of the cp-1252 it is generated as?
If the file names as well as content is a problem, the easiest way to solve the problem is setting the locale on the Linux machine to something based on ISO-8859-1 rather than UTF-8. You can use locale -a to list available locales. For example if you have en_US.iso88591 you could use:
export LANG=en_US.iso88591
This way Java will use ISO-8859-1 for file names, which is probably good enough. To run the Java program you still have to set the file.encoding system property:
java -Dfile.encoding=cp1252 -cp foo.jar:bar.jar blablabla
If no ISO-8859-1 locale is available you can generate one with localedef. Installing it requires root access though. In fact, you could generate a locale that uses CP-1252, if it is available on your system. For example:
sudo localedef -f CP1252 -i en_US en_US.cp1252
export LANG=en_US.cp1252
This way Java should use CP1252 by default for all I/O, including file names.
Expanded further here: http://jonisalonen.com/2012/java-and-file-names-with-invalid-characters/
You can read and write text data in any encoding that you wish. Here's a quick code example:
public static void main(String[] args) throws Exception
{
// List all supported encodings
for (String cs : Charset.availableCharsets().keySet())
System.out.println(cs);
File file = new File("SomeWindowsFile.txt");
StringBuilder builder = new StringBuilder();
// Construct a reader for a specific encoding
Reader reader = new InputStreamReader(new FileInputStream(file), "windows-1252");
while (reader.ready())
{
builder.append(reader.read());
}
reader.close();
String string = builder.toString();
// Construct a writer for a specific encoding
Writer writer = new OutputStreamWriter(new FileOutputStream(file), "UTF8");
writer.write(string);
writer.flush();
writer.close();
}
If this still 'chokes' on read, see if you can verify that the the original encoding is what you think it is. In this case I've specified windows-1252, which is the java string for cp-1252.