reading unicode data from a file

reading unicode data from a file - java

Default encoding is ISO-8859-1
BufferedReader bis = new BufferedReader(new InputStreamReader(new FileInputStream("file having unicode characters"),"UTF-8"));
String strTemp = bis.readLine();// on debugging strTemp is having actual unicode data
System.out.println(strTemp);// uses default encoding which is ISO-8859-1,So not printing ///actual data
PrintStream psTemp = new PrintStream(System.out, true, "UTF-8");
psTemp.println(strTemp);// here i am giving encoding as UTF-8,still not printing unicode data.
Even if i am giving encoding as UTF-8 in PrintStream constructor i am not able to print unicode data, if i change default encoding from ISO-8859-1 to UTF-8, it works. Why this is so?

if i change default encoding from ISO-8859-1 to UTF-8, it works. Why this is so?
I expect that this works because it is telling your console / shell / whatever is displaying the characters to expect UTF-8 characters. If the default behaviour is to expect ISO-8859-1, then sending it UTF-8 is not going to work.

Are you printing on eclipse console ? or in the shell ? Try to print to a file and check the result.
For example, windows shell is limited to "cp850" charset. The problem might be caused by the OS shell, not the JVM.

Related

Java URLEncode giving different results

I have this code stub:
System.out.println(param+"="+value);
param = URLEncoder.encode(param, "UTF-8");
value = URLEncoder.encode(value, "UTF-8");
System.out.println(param+"="+value);
This gives this result in Eclipse:
p=指甲油
p=%E6%8C%87%E7%94%B2%E6%B2%B9
But when I run the same code from command line, I get the following output:
p=指甲油
p=%C3%8A%C3%A5%C3%A1%C3%81%C3%AE%E2%89%A4%C3%8A%E2%89%A4%CF%80
What could be the problem?

Your Mac was using Mac OS Roman encoding in the terminal. Those Chinese characters are incorrectly been interpreted using Mac OS Roman encoding instead of UTF-8 encoding before sending to Java.
As evidence, those Chinese characters exist in UTF-8 encoding of the following (hex) bytes:
指 = 0xE6 0x8C 0x87
甲 = 0xE7 0x94 0xB2
油 = 0xE6 0xB2 0xB9
Then check the Mac OS Roman codepage layout, those (hex) bytes represent the following characters:
0xE6 0x8C 0x87 = Ê å á
0xE7 0x94 0xB2 = Á î ≤
0xE6 0xB2 0xB9 = Ê ≤ π
Now, put them together and URL-encode them using UTF-8:
System.out.println(URLEncoder.encode("ÊåáÁî≤Ê≤π", "UTF-8"));
Look what it prints?
%C3%8A%C3%A5%C3%A1%C3%81%C3%AE%E2%89%A4%C3%8A%E2%89%A4%CF%80
To fix your problem, tell your Mac to use UTF-8 encoding in the terminal. Honestly, I can't answer that part off top of head as I don't do Mac. Your Eclipse encoding configuration is totally fine, but for the case that, you could configure it via Window > Preferences > General > Workspace > Text File Encoding.
Update: I missed a comment:
I am reading the value from a text file
If those variables are originating from a text file instead of from commandline input — as I initially expected —, then you need to solve the problem differently. Apparently, you was using a Reader implementation for that which is using the runtime environment's default character encoding like so:
Reader reader = new FileReader("/file.txt");
// ...
You should instead be explicitly specifying the desired encoding while creating the reader. You can do that with the InputStreamReader constructor.
Reader reader = new InputStreamReader(new FileInputStream("/file.txt"), "UTF-8");
// ...
This will explicitly tell Java to read /file.txt using UTF-8 instead of runtime environment's default encoding as available by Charset#defaultCharset().
System.out.println("This runtime environment uses as default charset " + Charset.defaultCharset());

Reading Unicode characters in Java

I am using "FileInputStream" and "FileReader" to read a data from a file which contains unicode characters.
When i am setting the default encoding to "cp-1252" both are reading junk data, when i am setting default encoding to UTF-8 both are reading fine.
Is it true that both these use System Default Encoding to read the data?
Then whats the benifit of using Character stream if it depends on System Encoding.
Is there any way apart from:
BufferedReader fis = new BufferedReader(new InputStreamReader(new FileInputStream("some unicode file"),"UTF-8"));
to read the data correctly when the default encoding is other than UTF-8.

FileReader and FileWriter should IMHO be deprecated.
Use
new InputStreamReader(new FileInputStream(file), "UTF-8")
or so.
Here also there exists an overloaded version without the encoding parameter, using the default platform encoding: System.getProperty("file.encoding").

Error when reading non-English language character from file

I am building an app where users have to guess a secret word. I have *.txt files in assets folder. The problem is that words are in Albanian language. Our language uses letters like "ë" and "ç", so whenever I try to read from the file some word containing any of those characters I get some wicked symbol and I can not implement string.compare() for these characters. I have tried many options with UTF-8, changed Eclipse setting but still the same error.
I wold really appreciate if someone has got any advice.
The code I use to read the files is:
AssetManager am = getAssets();
strOpenFile = "fjalet.txt";
InputStream fins = am.open(strOpenFile);
reader = new BufferedReader(new InputStreamReader(fins));
ArrayList<String> stringList = new ArrayList<String>();
while ((aDataRow = reader.readLine()) != null) {
aBuffer += aDataRow + "\n";
stringList.add(aDataRow);
}
Otherwise the code works fine, except for mentioned characters

It seems pretty clear that the default encoding that is in force when you create the InputStreamReader does not match the file.
If the file you are trying to read is UTF-8, then this should work:
reader = new BufferedReader(new InputStreamReader(fins, "UTF-8"));
If the file is not UTF-8, then that won't work. Instead you should use the name of the file's true encoding. (My guess is that it is in ISO/IEC_8859-1 or ISO/IEC_8859-16.)
Once you have figured out what the file's encoding really is, you need to try to understand why it does not correspond to your Java platform's default encoding ... and then make a pragmatic decision on what to do about it. (Should you hard-wire the encoding into your application ... as above? Should you make it a configuration property or command parameter? Should you change the default encoding? Should you change the file?)

You need to determine the character encoding that was used when creating the file, and specify this encoding when reading it. If it's UTF-8, for example, use
reader = new BufferedReader(new InputStreamReader(fins, "UTF-8"));
or
reader = new BufferedReader(new InputStreamReader(fins, StandardCharsets.UTF_8));
if you're under Java 7.
Text editors like Notepad++ have good heuristics to guess what the encoding of a file is. Try opening it with such an editor and see which encoding it has guessed (if the characters appear correctly).

You should know encoding of the file.
InputStream class reads file binary. Although you can interpet input as character, it will be implicit guessing, which may be wrong.
InputStreamReader class converts binary to chars. But it should know character set.
You should use the following version to feed it by character set.
UPDATE
Don't suggest you have UTF-8 encoded file, which may be wrong. Here in Russia we have such encodings as CP866, WIN1251 and KOI8, which are all differ from UTF8. Probably you have some popular Albanian encoding of text files. Check your OS setting to guess.

Character Encoding in Java

In eclipse, I changed the default encoding to ISO-8859-1. Then I wrote this:
String str = "Русский язык ";
PrintStream ps = new PrintStream(System.out, true, "UTF-8");
ps.print(str);
It should print the String correctly, as I am specifying UTF-8 encoding. However, it is not printing.

The ISO-8859-1 character encoding only supports characters between 0 and 255, and anything else is likely to be turned into '?'

If you save the source file (the .java file) as ISO-8859-1 than str will be encoded by javac using ISO-8859-1. Your problem does not lie in the creation of PrintStream: the str you are printing is wrong from the beginning.

Yes, it looks like the terminal that your are sending this output does not support this encoding.
If you are running Eclipse, you could set the encoding as follows:
In Run Configurations...->Common ->Encoding->Other
Select UTF-8

You are basically telling the PrintStream writer to expect the input characters to be UTF-8 encoded and to output it as UTF-8. There is no conversion. If you set your IDE to use ISO-8859-1 as character encoding for your file, which in turns contains the input string than you pipe ISO-8859-1 encoded characters into an UTF-8 expecting writer. So the writer treats the bytes receiving as UTF encoded characters which will result in data junk.
Either set your IDE to encode your source files in UTF-8 and check that your characters are correctly displayed and stored. Or tell your writer to treat them as ISO-8859-1, either way should do.

Reading Arabic chars from text file

I had finished a project in which I read from a text file written with notepad.
The characters in my text file are in Arabic language,and the file encoding type is UTF-8.
When launching my project inside Netbeans(7.0.1) everything seemed to be ok,but when I built the project as a (.jar) file the characters where displayed in this way: ÇáãæÇÞÚááÊØæíÑ.
How could I solve This problem please?

Most likely you are using JVM default character encoding somewhere. If you are 100% sure your file is encoded using UTF-8, make sure you explicitly specify UTF-8 when reading as well. For example this piece of code is broken:
new FileReader("file.txt")
because it uses JVM default character encoding - which you might not have control over and apparently Netbeans uses UTF-8 while your operating system defines something different. Note that this makes FileReader class completely useless if you want your code to be portable.
Instead use the following code snippet:
new InputStreamReader(new FileInputStream("file.txt"), "UTF-8");
You are not providing your code, but this should give you a general impression how this should be implemented.

Maybe this example will help a little. I will try to print content of utf-8 file to IDE console and system console that is encoded in "Cp852".
My d:\data.txt contains ąźżćąś adsfasdf
Lets check this code
//I will read chars using utf-8 encoding
BufferedReader in = new BufferedReader(new InputStreamReader(
new FileInputStream("d:\\data.txt"), "utf-8"));
//and write to console using Cp852 encoding (works for my windows7 console)
PrintWriter out = new PrintWriter(new OutputStreamWriter(System.out,
"Cp852"),true); // "Cp852" is coding used in
// my console in Win7
// ok, lets read data from file
String line;
while ((line = in.readLine()) != null) {
// here I use IDE encoding
System.out.println(line);
// here I print data using Cp852 encoding
out.println(line);
}
When I run it in Eclipse output will be
ąźżćąś adsfasdf
Ą«ľ†Ą? adsfasdf
but output from system console will be

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

reading unicode data from a file - java

Are you printing on eclipse console ? or in the shell ? Try to print to a file and check the result. For example, windows shell is limited to "cp850" charset. The problem might be caused by the OS shell, not the JVM.

Related

Java URLEncode giving different results

Reading Unicode characters in Java

Error when reading non-English language character from file

Character Encoding in Java

Reading Arabic chars from text file

Categories

Resources