Reading characters with UTF-8 standards from console in java

Reading characters with UTF-8 standards from console in java - java

I want to read some Unicode characters from console (Farsi Characters).
I have used System.in but it didn't work. Looks like that Standard Input does not understand the characters I'm writing in the input so its just returns some mumbo jumbo to my String variable. I am absolutely sure that String variable's standard is set to "UTF-8". Believe me i doubled check.
Some pieces of code that I tried.
String t = new String (new Scanner(System.in).nextLine().getBytes() , "UTF-8");
didn't work.
byte b[] = new byte[4];
System.in.read(b);
String st = new String (b , "UTF-16");
System.out.println(st);
I wrote the above code for reading just one Farsi character. didn't work either.

First of all, the console must be in UTF-8 mode.
If using NetBeans, edit the file <NetBeansRoot>/etc/netbeans.conf.
Under netbeans_default_options, add -J-Dfile.encoding=UTF-8.
Once you're sure the console and your project encoding are set to UTF-8, try this:
Scanner console = new Scanner(new InputStreamReader(System.in, "UTF-8"));
while (console.hasNextLine())
System.out.println(console.nextLine());
Note: System.in is an InputStream, i.e. a stream of bytes, it produces the bytes from the console 1-to-1.
To read characters you need a Reader. A Reader takes an InputStream and an encoding, and produces characters.
If it doesn't help, try another console (e.g. Windows cmd, but first run chcp 65001).

Related

Java changes special characters when using FileReader

I have a problem with Java because I have a file with ASCII encoding and when I pass that value to the output file it changes special characters that I need to keep:
Original file:
Output file:
The code I use to read an ASCII file and pass it to a string that has a length of 7000 and the problem with that file where it reaches the special characters that within the frame or string that is the position 486 to 498 the FileRender does not bring the special characters correctly changes them for others and does not keep them (as I understand it is a binary):
fr = new FileReader(sourceFile);
//BufferedReader br = new BufferedReader(fr);
BufferedReader br = new BufferedReader(
new InputStreamReader(new FileInputStream(sourceFile), "UTF-8"));
String asciiString;
asciiString = br.readLine();
Edit:
I am doing a conversion from ASCII to EBCDIC. I am using CharFormatConverter.java
I really don't understand why the special characters are lost and not maintained. I found the UTF-8 code in another forum, but characters are still lost. Read file utf-8
Edit:
I was thinking about using FileReader for the ASCII data and FileInputStream to get the binary (but I can't figure out how to get it out with respect to the positions) that is in the ASCII file and thus have the two formats separated and then merge them after the conversion.
Regards.

If your info in the file is a binary info and not textual you can not read it as a String and no charset will help you. As charset is a schema that tells you how to interpret particular character into numeric code and vise-versa. If your info is not textual charset won't help you. You will need to read your info as binary - a sequence of bytes - and write them the same way. you will need to use InputStream implementation that reads info as binary. In your case a good candidate might be FileInputStream. But some other options may be used

Since your base code (CharFormatConverter) is byte-oriented, and it looks like your input files are binary, you should replace Readers by InputStreams, which produce bytes (not characters).
This is the ordinary way to read and process an InputStream:
private void convertFileToEbcdic(File sourceFile)
throws IOException
{
try (InputStream input=new FileInputStream(sourceFile))
{
byte[] buffer=new byte[4096];
int len;
do {
len=input.read(buffer);
if (len>0)
{
byte[] ebcdic=convertBufferFromAsciiToEbcdic(buffer, len);
// Now ebcdic contains the buffer converted to EBCDIC. You may use it.
}
} while (len>=0);
}
}
private byte[] convertBufferFromAsciiToEbcdic(byte[] ascii, int length)
{
// Create an array of same input as received
// and fill it with the input data converted to EBCDIC
}

Can't properly print non-English characters like ë in Windows console

For some weird reason I can't seem to print ë in Java.
public class Eindopdracht0002test
{
public static void main(String[] args)
{
System.out.println("\u00EB");
}
}
It's supposed to print "België" (dutch for Belgium) however it returns "Belgi├½".
Does anyone know how to resolve this?

In UTF-8 ë is written as 11000011 10101011 (source: https://unicode-table.com/en/00EB).
Console in Windows is using code pages which are 8-bit mappings to characters (you can check code page of your console with chcp command). This means when ë is sent to output stream (console) as 11000011 10101011 bits, console sees it as two characters, which in 850 code page (based on your comments) are mapped to:
├ - 11000011 (195 in decimal)
½ - 10101011 (171 in decimal)
If you don't want to use UTF-8 encoding you can create separate Writer and specify different encoding which will translate characters to bytes according to that encoding. To do so you can use
OutputStreamWriter(OutputStream out, String charsetName)
which in your case may look like
OutputStreamWriter(System.out, "cp850") osw = OutputStreamWriter(System.out, "cp850");
// needed encoding ------------^^^^^
since you want send characters with specified encoding to standard output stream.
To use println method and ensure it will automatically flush its data you can wrap created OutputStreamWriter in
PrintWriter(OutputStream out, boolean autoFlush)
like
PrintWriter out = new PrintWriter(osw, true);
You can also do both these things in one line:
PrintWriter out = new PrintWriter(new OutputStreamWriter(System.out, "cp850"), true);
Now if you use out.println("\u00EB"); it should use recognize ë character and use cp850 encoding to locate its mapping (which is 137) and send proper byte representation (here 10001001) to System.out (console).

Error when reading non-English language character from file

I am building an app where users have to guess a secret word. I have *.txt files in assets folder. The problem is that words are in Albanian language. Our language uses letters like "ë" and "ç", so whenever I try to read from the file some word containing any of those characters I get some wicked symbol and I can not implement string.compare() for these characters. I have tried many options with UTF-8, changed Eclipse setting but still the same error.
I wold really appreciate if someone has got any advice.
The code I use to read the files is:
AssetManager am = getAssets();
strOpenFile = "fjalet.txt";
InputStream fins = am.open(strOpenFile);
reader = new BufferedReader(new InputStreamReader(fins));
ArrayList<String> stringList = new ArrayList<String>();
while ((aDataRow = reader.readLine()) != null) {
aBuffer += aDataRow + "\n";
stringList.add(aDataRow);
}
Otherwise the code works fine, except for mentioned characters

It seems pretty clear that the default encoding that is in force when you create the InputStreamReader does not match the file.
If the file you are trying to read is UTF-8, then this should work:
reader = new BufferedReader(new InputStreamReader(fins, "UTF-8"));
If the file is not UTF-8, then that won't work. Instead you should use the name of the file's true encoding. (My guess is that it is in ISO/IEC_8859-1 or ISO/IEC_8859-16.)
Once you have figured out what the file's encoding really is, you need to try to understand why it does not correspond to your Java platform's default encoding ... and then make a pragmatic decision on what to do about it. (Should you hard-wire the encoding into your application ... as above? Should you make it a configuration property or command parameter? Should you change the default encoding? Should you change the file?)

You need to determine the character encoding that was used when creating the file, and specify this encoding when reading it. If it's UTF-8, for example, use
reader = new BufferedReader(new InputStreamReader(fins, "UTF-8"));
or
reader = new BufferedReader(new InputStreamReader(fins, StandardCharsets.UTF_8));
if you're under Java 7.
Text editors like Notepad++ have good heuristics to guess what the encoding of a file is. Try opening it with such an editor and see which encoding it has guessed (if the characters appear correctly).

You should know encoding of the file.
InputStream class reads file binary. Although you can interpet input as character, it will be implicit guessing, which may be wrong.
InputStreamReader class converts binary to chars. But it should know character set.
You should use the following version to feed it by character set.
UPDATE
Don't suggest you have UTF-8 encoded file, which may be wrong. Here in Russia we have such encodings as CP866, WIN1251 and KOI8, which are all differ from UTF8. Probably you have some popular Albanian encoding of text files. Check your OS setting to guess.

Display Hindi language in console using Java

StringBuffer contents=new StringBuffer();
BufferedReader input = new BufferedReader(new FileReader("/home/xyz/abc.txt"));
String line = null; //not declared within while loop
while (( line = input.readLine()) != null){
contents.append(line);
}
System.out.println(contents.toString());
File abc.txt contains
\u0905\u092d\u0940 \u0938\u092e\u092f \u0939\u0948 \u091c\u0928\u0924\u093e \u091c\u094b \u091a\u093e\u0939\u0924\u0940 \u0939\u0948 \u092
I want to dispaly in Hindi language in console using Java.
if i simply print like this
String str="\u0905\u092d\u0940 \u0938\u092e\u092f \u0939\u0948 \u091c\u0928\u0924\u093e \u091c\u094b \u091a\u093e\u0939\u0924\u0940 \u0939\u0948 \u092";
System.out.println(str);
then it works fine but when i try to read from a file it doesn't work.
help me out.

Use Apache Commons Lang.
import org.apache.commons.lang3.StringEscapeUtils;
// open the file as ASCII, read it into a string, then
String escapedStr; // = "\u0905\u092d\u0940 \u0938\u092e\u092f \u0939\u0948 ..."
// (to include such a string in a Java program you would have to double each \)
String hindiStr = StringEscapeUtils.unescapeJava( escapedStr );
System.out.println(hindiStr);
(Make sure your console is set up to display Hindi (correct fonts, etc) and the console's encoding matches your Java encoding. The Java code above is just the bare bones.)

You should store the contents in the file as UTF-8 encoded Hindi characters. For instance, in your case it would be अभी समय है जनता जो चाहती है. That is, instead of saving unicode escapes, directly save the raw Hindi characters. You can then simply read like normal.
You just have to make sure that the editor you use saves it using UTF-8 encoding. See Spanish language chars are not displayed properly?
Otherwise, you'll have to make the file a .properties file and read using java.util.Properties as it offers unicode unescaping support inherently.
Also read Reading unicode character in java

Greek String doesn't match regex when read from keyboard

public static void main(String[] args) throws IOException {
String str1 = "ΔΞ123456";
System.out.println(str1+"-"+str1.matches("^\\p{InGreek}{2}\\d{6}")); //ΔΞ123456-true
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
String str2 = br.readLine(); //ΔΞ123456 same as str1.
System.out.println(str2+"-"+str2.matches("^\\p{InGreek}{2}\\d{6}")); //Ξ”Ξ�123456-false
System.out.println(str1.equals(str2)); //false
}
The same String doesn't match regex when read from keyboard.
What causes this problem, and how can we solve this?
Thanks in advance.
EDIT: I used System.console() for input and output.
public static void main(String[] args) throws IOException {
PrintWriter pr = System.console().writer();
String str1 = "ΔΞ123456";
pr.println(str1+"-"+str1.matches("^\\p{InGreek}{2}\\d{6}")+"-"+str1.length());
String str2 = System.console().readLine();
pr.println(str2+"-"+str2.matches("^\\p{InGreek}{2}\\d{6}")+"-"+str2.length());
pr.println("str1.equals(str2)="+str1.equals(str2));
}
Output:
ΔΞ123456-true-8
ΔΞ123456
ΔΞ123456-true-8
str1.equals(str2)=true

There are multiple places where transcoding errors can take place here.
Ensure that your class is being compiled correctly (unlikely to be an issue in an IDE):
Ensure that the compiler is using the same encoding as your editor (i.e. if you save as UTF-8, set your compiler to use that encoding)
Or switch to escaping to the ASCII subset that most encodings are a superset of (i.e. change the string literal to "\u0394\u039e123456")
Ensure you are reading input using the correct encoding:
Use the Console to read input - this class will detect the console encoding
Or configure your Reader to use the correct encoding (probably windows-1253) or set the console to Java's default encoding
Note that System.console() returns null in an IDE, but there are things you can do about that.

If you use Windows, it may be caused by the fact that console character encoding ("OEM code page") is not the same as a system encoding ("ANSI code page").
InputStreamReader without explicit encoding parameter assumes input data to be in the system default encoding, therefore characters read from the console are decoded incorrectly.
In order to correctly read non-us-ascii characters in Windows console you need to specify console encoding explicitly when constructing InputStreamReader (required codepage number can be found by executing mode con cp in the command line):
BufferedReader br = new BufferedReader(
new InputStreamReader(System.in, "CP737"));
The same problem applies to the output, you need to construct PrintWriter with proper encoding:
PrintWriter out = new PrintWrtier(new OutputStreamWriter(System.out, "CP737"));
Note that since Java 1.6 you can avoid these workarounds by using Console object obtained from System.console(). It provides Reader and Writer with correctly configured encoding as well as some utility methods.
However, System.console() returns null when streams are redirected (for example, when running from IDE). A workaround for this problem can be found in McDowell's answer.
See also:
Code page

I get true in both cases with nothing changed on your code. (I tested with greek layout keyboard - I'm from Greece :])
Probably your keyboard is sending ascii in 8859-7 ISO and not UTF-8. Mine sends UTF-8.
EDIT: I still get true with the addition of the equals command..
System.out.println(str1.equals(str2));
Check if you can get it working by changing everything to greek in the regional options (if you are using windows).
Rundll32 Shell32.dll,Control_RunDLL Intl.cpl,,0
If this is the case then you can act accordingly.. as 'axtavt' said

The keyboard is likely not sending the characters as UTF-8, but as the operating system's default character encoding.
See also
Java : How to determine the correct charset encoding of a stream
Java App : Unable to read iso-8859-1 encoded file correctly

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading characters with UTF-8 standards from console in java - java

Related

Java changes special characters when using FileReader

Can't properly print non-English characters like ë in Windows console

Error when reading non-English language character from file

Display Hindi language in console using Java

Greek String doesn't match regex when read from keyboard

Categories

Resources