Can't properly print non-English characters like ë in Windows console

Can't properly print non-English characters like ë in Windows console - java

For some weird reason I can't seem to print ë in Java.
public class Eindopdracht0002test
{
public static void main(String[] args)
{
System.out.println("\u00EB");
}
}
It's supposed to print "België" (dutch for Belgium) however it returns "Belgi├½".
Does anyone know how to resolve this?

In UTF-8 ë is written as 11000011 10101011 (source: https://unicode-table.com/en/00EB).
Console in Windows is using code pages which are 8-bit mappings to characters (you can check code page of your console with chcp command). This means when ë is sent to output stream (console) as 11000011 10101011 bits, console sees it as two characters, which in 850 code page (based on your comments) are mapped to:
├ - 11000011 (195 in decimal)
½ - 10101011 (171 in decimal)
If you don't want to use UTF-8 encoding you can create separate Writer and specify different encoding which will translate characters to bytes according to that encoding. To do so you can use
OutputStreamWriter(OutputStream out, String charsetName)
which in your case may look like
OutputStreamWriter(System.out, "cp850") osw = OutputStreamWriter(System.out, "cp850");
// needed encoding ------------^^^^^
since you want send characters with specified encoding to standard output stream.
To use println method and ensure it will automatically flush its data you can wrap created OutputStreamWriter in
PrintWriter(OutputStream out, boolean autoFlush)
like
PrintWriter out = new PrintWriter(osw, true);
You can also do both these things in one line:
PrintWriter out = new PrintWriter(new OutputStreamWriter(System.out, "cp850"), true);
Now if you use out.println("\u00EB"); it should use recognize ë character and use cp850 encoding to locate its mapping (which is 137) and send proper byte representation (here 10001001) to System.out (console).

Related

Reading characters with UTF-8 standards from console in java

I want to read some Unicode characters from console (Farsi Characters).
I have used System.in but it didn't work. Looks like that Standard Input does not understand the characters I'm writing in the input so its just returns some mumbo jumbo to my String variable. I am absolutely sure that String variable's standard is set to "UTF-8". Believe me i doubled check.
Some pieces of code that I tried.
String t = new String (new Scanner(System.in).nextLine().getBytes() , "UTF-8");
didn't work.
byte b[] = new byte[4];
System.in.read(b);
String st = new String (b , "UTF-16");
System.out.println(st);
I wrote the above code for reading just one Farsi character. didn't work either.

First of all, the console must be in UTF-8 mode.
If using NetBeans, edit the file <NetBeansRoot>/etc/netbeans.conf.
Under netbeans_default_options, add -J-Dfile.encoding=UTF-8.
Once you're sure the console and your project encoding are set to UTF-8, try this:
Scanner console = new Scanner(new InputStreamReader(System.in, "UTF-8"));
while (console.hasNextLine())
System.out.println(console.nextLine());
Note: System.in is an InputStream, i.e. a stream of bytes, it produces the bytes from the console 1-to-1.
To read characters you need a Reader. A Reader takes an InputStream and an encoding, and produces characters.
If it doesn't help, try another console (e.g. Windows cmd, but first run chcp 65001).

Java program doesn't print cyrillic characters, but only question marks

my code(Qwe.java)
public class Qwe {
public static void main(String[] args) {
System.out.println("тест привет");
}
}
where
тест привет
is russian words
Qwe.java in UTF-8
on my machine(ubuntu 14.04) result is
тест привет
on server(ubuntu 12.04) I have:
???? ??????
$java Qwe > test.txt
in test.txt is see
???? ??????

I fix it just use export JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF8

The java source text must use the same encoding as the javac compiler. That seems to have been the case, and UTF-8 is of course ideal.
The the file Qwe.class is fine, with internally using Unicode for String. The output to the console uses the server platform encoding. That is, java converts the text in Unicode to bytes using probably the default (platform) encoding, and that cannot handle Cyrillic.
So you have to write to a file, never using FileWriter (a utility class for local files only), but using:
... new OutputStreamWriter(new FileOutputStream(file), "UTF-8")
You can also set user locales on the server but that is not my beer.
In general I would switch to a file logger.

I am not sure but it might only accept ASCII characters from the english language unless you have some extension or something. But like I said my best guess is it is not finding the characters and just outputting garbage in their place.
"Java, any unknown character which is passed through the write() methods of an OutputStream get printed as a plain question mark “?”"
as taken from here

reading unicode data from a file

Default encoding is ISO-8859-1
BufferedReader bis = new BufferedReader(new InputStreamReader(new FileInputStream("file having unicode characters"),"UTF-8"));
String strTemp = bis.readLine();// on debugging strTemp is having actual unicode data
System.out.println(strTemp);// uses default encoding which is ISO-8859-1,So not printing ///actual data
PrintStream psTemp = new PrintStream(System.out, true, "UTF-8");
psTemp.println(strTemp);// here i am giving encoding as UTF-8,still not printing unicode data.
Even if i am giving encoding as UTF-8 in PrintStream constructor i am not able to print unicode data, if i change default encoding from ISO-8859-1 to UTF-8, it works. Why this is so?

if i change default encoding from ISO-8859-1 to UTF-8, it works. Why this is so?
I expect that this works because it is telling your console / shell / whatever is displaying the characters to expect UTF-8 characters. If the default behaviour is to expect ISO-8859-1, then sending it UTF-8 is not going to work.

Are you printing on eclipse console ? or in the shell ? Try to print to a file and check the result.
For example, windows shell is limited to "cp850" charset. The problem might be caused by the OS shell, not the JVM.

Character Encoding in Java

In eclipse, I changed the default encoding to ISO-8859-1. Then I wrote this:
String str = "Русский язык ";
PrintStream ps = new PrintStream(System.out, true, "UTF-8");
ps.print(str);
It should print the String correctly, as I am specifying UTF-8 encoding. However, it is not printing.

The ISO-8859-1 character encoding only supports characters between 0 and 255, and anything else is likely to be turned into '?'

If you save the source file (the .java file) as ISO-8859-1 than str will be encoded by javac using ISO-8859-1. Your problem does not lie in the creation of PrintStream: the str you are printing is wrong from the beginning.

Yes, it looks like the terminal that your are sending this output does not support this encoding.
If you are running Eclipse, you could set the encoding as follows:
In Run Configurations...->Common ->Encoding->Other
Select UTF-8

You are basically telling the PrintStream writer to expect the input characters to be UTF-8 encoded and to output it as UTF-8. There is no conversion. If you set your IDE to use ISO-8859-1 as character encoding for your file, which in turns contains the input string than you pipe ISO-8859-1 encoded characters into an UTF-8 expecting writer. So the writer treats the bytes receiving as UTF encoded characters which will result in data junk.
Either set your IDE to encode your source files in UTF-8 and check that your characters are correctly displayed and stored. Or tell your writer to treat them as ISO-8859-1, either way should do.

Greek String doesn't match regex when read from keyboard

public static void main(String[] args) throws IOException {
String str1 = "ΔΞ123456";
System.out.println(str1+"-"+str1.matches("^\\p{InGreek}{2}\\d{6}")); //ΔΞ123456-true
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
String str2 = br.readLine(); //ΔΞ123456 same as str1.
System.out.println(str2+"-"+str2.matches("^\\p{InGreek}{2}\\d{6}")); //Ξ”Ξ�123456-false
System.out.println(str1.equals(str2)); //false
}
The same String doesn't match regex when read from keyboard.
What causes this problem, and how can we solve this?
Thanks in advance.
EDIT: I used System.console() for input and output.
public static void main(String[] args) throws IOException {
PrintWriter pr = System.console().writer();
String str1 = "ΔΞ123456";
pr.println(str1+"-"+str1.matches("^\\p{InGreek}{2}\\d{6}")+"-"+str1.length());
String str2 = System.console().readLine();
pr.println(str2+"-"+str2.matches("^\\p{InGreek}{2}\\d{6}")+"-"+str2.length());
pr.println("str1.equals(str2)="+str1.equals(str2));
}
Output:
ΔΞ123456-true-8
ΔΞ123456
ΔΞ123456-true-8
str1.equals(str2)=true

There are multiple places where transcoding errors can take place here.
Ensure that your class is being compiled correctly (unlikely to be an issue in an IDE):
Ensure that the compiler is using the same encoding as your editor (i.e. if you save as UTF-8, set your compiler to use that encoding)
Or switch to escaping to the ASCII subset that most encodings are a superset of (i.e. change the string literal to "\u0394\u039e123456")
Ensure you are reading input using the correct encoding:
Use the Console to read input - this class will detect the console encoding
Or configure your Reader to use the correct encoding (probably windows-1253) or set the console to Java's default encoding
Note that System.console() returns null in an IDE, but there are things you can do about that.

If you use Windows, it may be caused by the fact that console character encoding ("OEM code page") is not the same as a system encoding ("ANSI code page").
InputStreamReader without explicit encoding parameter assumes input data to be in the system default encoding, therefore characters read from the console are decoded incorrectly.
In order to correctly read non-us-ascii characters in Windows console you need to specify console encoding explicitly when constructing InputStreamReader (required codepage number can be found by executing mode con cp in the command line):
BufferedReader br = new BufferedReader(
new InputStreamReader(System.in, "CP737"));
The same problem applies to the output, you need to construct PrintWriter with proper encoding:
PrintWriter out = new PrintWrtier(new OutputStreamWriter(System.out, "CP737"));
Note that since Java 1.6 you can avoid these workarounds by using Console object obtained from System.console(). It provides Reader and Writer with correctly configured encoding as well as some utility methods.
However, System.console() returns null when streams are redirected (for example, when running from IDE). A workaround for this problem can be found in McDowell's answer.
See also:
Code page

I get true in both cases with nothing changed on your code. (I tested with greek layout keyboard - I'm from Greece :])
Probably your keyboard is sending ascii in 8859-7 ISO and not UTF-8. Mine sends UTF-8.
EDIT: I still get true with the addition of the equals command..
System.out.println(str1.equals(str2));
Check if you can get it working by changing everything to greek in the regional options (if you are using windows).
Rundll32 Shell32.dll,Control_RunDLL Intl.cpl,,0
If this is the case then you can act accordingly.. as 'axtavt' said

The keyboard is likely not sending the characters as UTF-8, but as the operating system's default character encoding.
See also
Java : How to determine the correct charset encoding of a stream
Java App : Unable to read iso-8859-1 encoded file correctly

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Can't properly print non-English characters like ë in Windows console - java

For some weird reason I can't seem to print ë in Java. public class Eindopdracht0002test { public static void main(String[] args) { System.out.println("\u00EB"); } } It's supposed to print "België" (dutch for Belgium) however it returns "Belgi├½". Does anyone know how to resolve this?

Related

Reading characters with UTF-8 standards from console in java

Java program doesn't print cyrillic characters, but only question marks

reading unicode data from a file

Character Encoding in Java

Greek String doesn't match regex when read from keyboard

Categories

Resources