Java program doesn't print cyrillic characters, but only question marks

Java program doesn't print cyrillic characters, but only question marks - java

my code(Qwe.java)
public class Qwe {
public static void main(String[] args) {
System.out.println("тест привет");
}
}
where
тест привет
is russian words
Qwe.java in UTF-8
on my machine(ubuntu 14.04) result is
тест привет
on server(ubuntu 12.04) I have:
???? ??????
$java Qwe > test.txt
in test.txt is see
???? ??????

I fix it just use export JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF8

The java source text must use the same encoding as the javac compiler. That seems to have been the case, and UTF-8 is of course ideal.
The the file Qwe.class is fine, with internally using Unicode for String. The output to the console uses the server platform encoding. That is, java converts the text in Unicode to bytes using probably the default (platform) encoding, and that cannot handle Cyrillic.
So you have to write to a file, never using FileWriter (a utility class for local files only), but using:
... new OutputStreamWriter(new FileOutputStream(file), "UTF-8")
You can also set user locales on the server but that is not my beer.
In general I would switch to a file logger.

I am not sure but it might only accept ASCII characters from the english language unless you have some extension or something. But like I said my best guess is it is not finding the characters and just outputting garbage in their place.
"Java, any unknown character which is passed through the write() methods of an OutputStream get printed as a plain question mark “?”"
as taken from here

Related

Writing strings with chars like "ñ" to a txt file

Im having a strange issue trying to write in text files with strings which contain characters like "ñ", "á".. and so on. Let me first show you my little piece of code:
import java.io.*;
public class test {
public static void main(String[] args) throws Exception {
String content = "whatever";
int c;
c = System.in.read();
content = content + (char)c;
FileWriter fw = new FileWriter("filename.txt");
BufferedWriter bw = new BufferedWriter(fw);
bw.write(content);
bw.close();
}
}
In this example, im just reading a char from the keyboard input and appending it to a given string; then writting the final string into a txt. The problem is that if I type an "ñ" for example (i have a Spanish layout keyboard), when i check the txt, it shows a strange char "¤" where there should be a "ñ", that is, the content of the file is "whatever¤". The same happens with "ç", "ú"..etc. However it writes it fine ("whateverñ") if i just forget about the keyboard input and i write:
...
String content = "whateverñ";
...
or
...
content = content + "ñ";
...
It makes me think that there might be something wrong with the read() method? Or maybe im using it wrongly? or should i use a different method to get the keyboard input? or..? Im a bit lost here.
(Im using the jdk 7u45 # Windows 7 Pro x64)

So ...
It works (i.e. you can read the accented characters on the output file) if you write them as literal strings.
It doesn't work when you read them from System.in and then write them.
This suggests that the problem is on the input side. Specifically, I think your console / keyboard must be using a character encoding for the input stream that does not match the encoding that Java thinks should be used.
You should be able to confirm this tentative diagnosis by outputting the characters you are reading in hexadecimal, and then checking the codes against the unicode tables (which you can find at unicode.org for example).
It strikes me as "odd" that the "platform default encoding" appears to be working on the output side, but not the input side. Maybe someone else can explain ... and offer a concrete suggestion for fixing it. My gut feeling is that the problem is in the way your keyboard is configured, not in Java or your application.

files do not remember their encoding format, when you look at a .txt, the text editor makes a "best guess" to the encoding used.
if you try to read the file into your program again, the text should be back to normal.
also, try printing the "strange" character directly.

Chinese Characters Displayed as Questions Marks in Mac Terminal

I am trying to retrieve some UTF-8 uni coded Chinese characters from a database using a Java file. When I do this the characters are returned as question marks.
However, when I display the characters from the database (using select * from ...) the characters are displayed normally. When I print a String in a Java file consisting of Chinese characters, they are also printed normally.
I had this problem in Eclipse: when I ran the program, the characters were being printed as question marks. However this problem was solved when I saved the Java file in UTF-8 format.
Running "locale" in the terminal currently returns this:
LANG="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_CTYPE="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_ALL=
I have also tried to compile my java file using this:
javac -encoding UTF-8 [java file]
But still, the output is question marks.
It's quite strange how it will only sometimes display the characters. Does anyone have an explanation for this? Or even better, how to fix this so that the characters are correctly displayed?

The System.out printstream isn't created as a UTF-8 print stream. You can convert it to be one like this:
import java.io.PrintStream;
import java.io.UnsupportedEncodingException;
public class JavaTest {
public static void main(String[] args) {
try{
PrintStream out = new PrintStream(System.out, true, "UTF-8");
out.println("Hello");
out.println("施华洛世奇");
out.println("World");
}
catch(UnsupportedEncodingException UEE){
//Yada yada yada
}
}
}
You can also set the default encoding as per here by:
java -Dfile.encoding=UTF-8 -jar JavaTest.jar

Failed to check if file with German name is exist in the file system

Background:
I have 2 machines: one is running German windows 7 and my PC running English(with Hebrew locale) windows 7.
In my Perl code I'm trying to check if the file that I got from the German machine exists on my machine.
The file name is ßßßzllpoöäüljiznppü.txt
Why is it failed when I do the following code:
use Encode;
use Encode::locale;
sub UTF8ToLocale
{
my $str = decode("utf8",$_[0]);
return encode(locale, $str);
}
if(!-e UTF8ToLocale($read_file))
{
print "failed to open the file";
}
else
{
print $read_file;
}
Same thing goes also when I'm trying to open the file:
open (wtFile, ">", UTF8ToLocale($read_file));
binmode wtFile;
shift #_;
print wtFile #_;
close wtFile;
The file name is converted from German to utf8 in my java application and this is passed to the perl script.
The perl script takes this file name and convert it from utf8 to the system locale, see UTF8ToLocale($read_file) function call, and I believe that is the problem.
Questions:
Can you please tell me what is the OS file system charset encoding?
When I create German file name in OS that the locale is Hebrew in which Charset is it saved?
How do I solve this problem?
Update:
Here is another code that I run with hard coded file name on my PC, the script file is utf8 encoded:
use Encode;
use Encode::locale;
my $string = encode("utf-16",decode("utf8","C:\\TestPerl\\ßßßzllpoöäüljiznppü.txt"));
if (-e $string)
{
print "exists\r\n";
}
else
{
print "not exists\r\n"
}
The output is "not exists".
I also tried different charsets: cp1252, cp850, utf-16le, nothing works.
If I'm changing the file name to English or Hebrew(my default locale) it works.
Any ideas?

Windows 7 uses UTF-16 internally [citation needed] (I don't remember the byte order). You don't need to convert file names because of that. However, if you transport the file via a FAT file system (eg an old USB stick) or other non Unicode aware file systems these benefits will get lost.
The locale setting you are talking about only affects the language of the user interface and the apparent folder names (Programme (x86) vs. Program Files (x86) with the latter being the real name in the file system).
The larger problem I can see is the internal encoding of the file contents that you want to transfer as some applications may default to different encodings depending on the locale. There is no solution to that except being explicit when the file is created. Sticking to UTF-8 is generally a good idea.
And why do you convert the file names with another tool? Any Unicode encoding should be sufficient for transfer.
Your script does not work because you reference an undefined global variable called $read_file. Assuming your second code block is not enclosed in any scope, especially not in a sub, then the #_ variable is not available. To get command line arguments you should consider using the #ARGV array. The logic ouf your script isn't clear anyway: You print error messages to STDOUT, not STDERR, you "decode" the file name and then print out the non-decoded string in your else-branch, you are paranoid about encodings (which is generally good) but you don't specify an encoding for your output stream etc.

Greek String doesn't match regex when read from keyboard

public static void main(String[] args) throws IOException {
String str1 = "ΔΞ123456";
System.out.println(str1+"-"+str1.matches("^\\p{InGreek}{2}\\d{6}")); //ΔΞ123456-true
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
String str2 = br.readLine(); //ΔΞ123456 same as str1.
System.out.println(str2+"-"+str2.matches("^\\p{InGreek}{2}\\d{6}")); //Ξ”Ξ�123456-false
System.out.println(str1.equals(str2)); //false
}
The same String doesn't match regex when read from keyboard.
What causes this problem, and how can we solve this?
Thanks in advance.
EDIT: I used System.console() for input and output.
public static void main(String[] args) throws IOException {
PrintWriter pr = System.console().writer();
String str1 = "ΔΞ123456";
pr.println(str1+"-"+str1.matches("^\\p{InGreek}{2}\\d{6}")+"-"+str1.length());
String str2 = System.console().readLine();
pr.println(str2+"-"+str2.matches("^\\p{InGreek}{2}\\d{6}")+"-"+str2.length());
pr.println("str1.equals(str2)="+str1.equals(str2));
}
Output:
ΔΞ123456-true-8
ΔΞ123456
ΔΞ123456-true-8
str1.equals(str2)=true

There are multiple places where transcoding errors can take place here.
Ensure that your class is being compiled correctly (unlikely to be an issue in an IDE):
Ensure that the compiler is using the same encoding as your editor (i.e. if you save as UTF-8, set your compiler to use that encoding)
Or switch to escaping to the ASCII subset that most encodings are a superset of (i.e. change the string literal to "\u0394\u039e123456")
Ensure you are reading input using the correct encoding:
Use the Console to read input - this class will detect the console encoding
Or configure your Reader to use the correct encoding (probably windows-1253) or set the console to Java's default encoding
Note that System.console() returns null in an IDE, but there are things you can do about that.

If you use Windows, it may be caused by the fact that console character encoding ("OEM code page") is not the same as a system encoding ("ANSI code page").
InputStreamReader without explicit encoding parameter assumes input data to be in the system default encoding, therefore characters read from the console are decoded incorrectly.
In order to correctly read non-us-ascii characters in Windows console you need to specify console encoding explicitly when constructing InputStreamReader (required codepage number can be found by executing mode con cp in the command line):
BufferedReader br = new BufferedReader(
new InputStreamReader(System.in, "CP737"));
The same problem applies to the output, you need to construct PrintWriter with proper encoding:
PrintWriter out = new PrintWrtier(new OutputStreamWriter(System.out, "CP737"));
Note that since Java 1.6 you can avoid these workarounds by using Console object obtained from System.console(). It provides Reader and Writer with correctly configured encoding as well as some utility methods.
However, System.console() returns null when streams are redirected (for example, when running from IDE). A workaround for this problem can be found in McDowell's answer.
See also:
Code page

I get true in both cases with nothing changed on your code. (I tested with greek layout keyboard - I'm from Greece :])
Probably your keyboard is sending ascii in 8859-7 ISO and not UTF-8. Mine sends UTF-8.
EDIT: I still get true with the addition of the equals command..
System.out.println(str1.equals(str2));
Check if you can get it working by changing everything to greek in the regional options (if you are using windows).
Rundll32 Shell32.dll,Control_RunDLL Intl.cpl,,0
If this is the case then you can act accordingly.. as 'axtavt' said

The keyboard is likely not sending the characters as UTF-8, but as the operating system's default character encoding.
See also
Java : How to determine the correct charset encoding of a stream
Java App : Unable to read iso-8859-1 encoded file correctly

Accents in file name using Java on Solaris

I have a problem where I can't write files with accents in the file name on Solaris.
Given following code
public static void main(String[] args) {
System.out.println("Charset = "+ Charset.defaultCharset().toString());
System.out.println("testéörtkuoë");
FileWriter fw = null;
try {
fw = new FileWriter("testéörtkuoë");
fw.write("testéörtkuoëéörtkuoë");
fw.close();
I get following output
Charset = ISO-8859-1
test??rtkuo?
and I get a file called "test??rtkuo?"
Based on info I found on StackOverflow, I tried to call the Java app by adding "-Dfile.encoding=UTF-8" at startup.
This returns following output
Charset = UTF-8
testéörtkuoë
But the filename is still "test??rtkuo?"
Any help is much appreciated.
Stef

All these characters are present in ISO-8859-1. I suspect part of the problem is that the code editor is saving files in a different encoding to the one your operating system is using.
If the editor is using ISO-8859-1, I would expect it to encode ëéö as:
eb e9 f6
If the editor is using UTF-8, I would expect it to encode ëéö as:
c3ab c3a9 c3b6
Other encodings will produce different values.
The source file would be more portable if you used Unicode escape sequences. At least be certain your compiler is using the same encoding as the editor.
Examples:
ë \u00EB
é \u00E9
ö \u00F6
You can look up these values using the Unicode charts.
Changing the default file encoding using -Dfile.encoding=UTF-8 might have unintended consequences for how the JVM interacts with the system.
There are parallels here with problems you might see on Windows.
I'm unable to reproduce the problem directly - my version of OpenSolaris uses UTF-8 as the default encoding.

If you attempt to list the filenames with the java io apis, what do you see? Are they encoded correctly? I'm curious as to whether the real problem is with encoding the filenames or with the tools that you are using to check them.

What happens when you do:
ls > testéörtkuoë
If that works (writes to the file correctly), then you know you can write to files with accents.

I got a similar problem. Contrary to that example, the program was unable to list the files correct using sysout.println, despite the ls was showing correct values.
As described in the documentation, the enviroment variable file.encoding should not be used to define charset and, in this case, the JVM ignores it
The symptom:
I could not type accents in shell.
ls was showing correct values
File.list() was printing incorrect values
the environ file.encoding was not affecting the output
the environ user.(language|country) was not affecting the output
The solution:
Although the enviroment variable LC_* was set in the shell with values inherited from /etc/defaut/init, as listed by set command, the locale showed different values.
$ set | grep LC
LC_ALL=pt_BR.ISO8859-1
LC_COLLATE=pt_BR.ISO8859-1
LC_CTYPE=pt_BR.ISO8859-1
LC_MESSAGES=C
LC_MONETARY=pt_BR.ISO8859-1
LC_NUMERIC=pt_BR.ISO8859-1
LC_TIME=pt_BR.ISO8859-1
$ locale
LANG=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=
The solution was simple exporting LANG. This environment variable really affect the jvm
LANG=pt_BR.ISO8859-1
export LANG

Java uses operating system's default encoding while reading and writing files. Now, one should never rely on that. It's always a good practice to specify the encoding explicitly.
In Java you can use following for reading and writing:
Reading:
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(inputPath),"UTF-8"));
Writing:
PrintWriter pw = new PrintWriter(new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputPath), "UTF-8")));

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.