Reading Arabic chars from text file

Reading Arabic chars from text file - java

I had finished a project in which I read from a text file written with notepad.
The characters in my text file are in Arabic language,and the file encoding type is UTF-8.
When launching my project inside Netbeans(7.0.1) everything seemed to be ok,but when I built the project as a (.jar) file the characters where displayed in this way: ÇáãæÇÞÚááÊØæíÑ.
How could I solve This problem please?

Most likely you are using JVM default character encoding somewhere. If you are 100% sure your file is encoded using UTF-8, make sure you explicitly specify UTF-8 when reading as well. For example this piece of code is broken:
new FileReader("file.txt")
because it uses JVM default character encoding - which you might not have control over and apparently Netbeans uses UTF-8 while your operating system defines something different. Note that this makes FileReader class completely useless if you want your code to be portable.
Instead use the following code snippet:
new InputStreamReader(new FileInputStream("file.txt"), "UTF-8");
You are not providing your code, but this should give you a general impression how this should be implemented.

Maybe this example will help a little. I will try to print content of utf-8 file to IDE console and system console that is encoded in "Cp852".
My d:\data.txt contains ąźżćąś adsfasdf
Lets check this code
//I will read chars using utf-8 encoding
BufferedReader in = new BufferedReader(new InputStreamReader(
new FileInputStream("d:\\data.txt"), "utf-8"));
//and write to console using Cp852 encoding (works for my windows7 console)
PrintWriter out = new PrintWriter(new OutputStreamWriter(System.out,
"Cp852"),true); // "Cp852" is coding used in
// my console in Win7
// ok, lets read data from file
String line;
while ((line = in.readLine()) != null) {
// here I use IDE encoding
System.out.println(line);
// here I print data using Cp852 encoding
out.println(line);
}
When I run it in Eclipse output will be
ąźżćąś adsfasdf
Ą«ľ†Ą? adsfasdf
but output from system console will be

Related

UTF-8 encoding CSV file

I have a CSV file, which using Excel to save as CSV UTF-8 encoded.
I have my java code read the file as byte array
then
String result = new String(b, 0, b.length, "UTF-8");
But somehow the content "Montréal" becomes "Montr?al" when save to DB, what might be the problem?
The environment is unix with:
-bash-4.1$ locale
LANG=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=
BTW it works on my windows machine when I run my code and see in DB the correct "Montréal". So my guess is that the environment has some default locale setting that forces the use of dedault encoding.
Thanks

I don't have your complete code, but I tried the following code and it works for me:
String x = "c:/Book2.csv";
BufferedReader br = null;
try{
br = new BufferedReader(new InputStreamReader(new FileInputStream(
x), "UTF8"));
String b;
while ((b = br.readLine()) != null) {
System.out.println(b);
}
} finally {
if (br != null){
br.close();
}
}
If you see "Montr?al" printed on your console, don't worry. It does not mean that the program is not working. Now, you may want to check if your console supports printing UTF-8 characters. So, you can put a debug and inspect the variable to check if has what you want.
If you see correct value in debug and it prints a "?" in your output, you can rest assured that the String variable is having the right value and you can write it back to any file or DB as needed.
If you see "?" when you query your DB, the tool you may be using is not printing the output correctly. Try reading the DB value in java code an check by putting a debug in you code. I usually use putty to query the DB to see the double byte characters correctly. That's all I have, hope that helps.

You have to use ISO/IEC 8859, not UTF-8, if you look at the list of character encodings on Wikipedia page you'll understand the difference.
Basically, UTF-8 its the commom encoding used by western country...
Also, you can check your terminal encoding, maybe the problem is there.

jar java program encoded

I have made a small java program in netbeans that's read a text file. When I run the program in my netbeans, everything goes fine. So I made an executable jar of my program, but when I run that jar I get wired characters when the program read the text file.
For example:
I get "CÃ©leste" but it has to be Céleste.
That's my code to read the file:
private void readFWFile(File file){
try {
FileReader fr = new FileReader(file);
BufferedReader br = new BufferedReader(fr);
String ligne;
while((ligne = br.readLine()) != null) {
System.out.println(ligne);
}
fr.close();
} catch (IOException ex) {
Logger.getLogger(FWFileReader.class.getName()).log(Level.SEVERE, null, ex);
}
}

The FileReader class uses the "platform default character encoding" to decode bytes in the file into characters. It seems that your file is encoded in UTF-8, while the default encoding is something else on your system.
You can read the file in a specific encoding using InputStreamReader:
Reader fr = new InputStreamReader(new FileInputStream(file), "UTF-8");

This kind of output is caused by a mismatch somewhere - your file is encoded in UTF-8 but the console where you print the data expects a single-byte encoding such as Windows-1252.
You need to (a) ensure you read the file as UTF-8 and (b) ensure you write to the console using the encoding it expects.
FileReader always uses the platform default encoding when reading files. If this is UTF-8 then
your Java code reads the file as UTF-8 and sees Céleste
you then print out that data as UTF-8
in NetBeans the console clearly expects UTF-8 and displays the data correctly
outside NetBeans the console expects a single-byte encoding and displays the incorrect rendering.
Or if your default encoding is a single byte one then
your Java code reads the file as a single byte encoding and sees CÃ©leste
you then print out that data as the same encoding
NetBeans treats the bytes you wrote as UTF-8 and displays Céleste
outside NetBeans you see the wrong data you originally read.
Use an InputStreamReader with a FileInputStream to ensure you read the data in the correct encoding, and make sure that when you print data to the console you do so using the encoding that the console expects.

How can I change the text-coding of my Java Programm?

I have a Java-Programm, which I develop with Netbeans.
I changed the settings on Netbeans, so that it will understand UTF-8.
But if I clean, and build my Programm and use it with my Windows System, the textcoding changes and letters like: "ü", "ä", and "ö" aren't displayed and used properly anymore.
How can I communicate with my OS and tell him to use UTF-8?
Or is there any good workaround?
EDIT: Sry for beeing so unspecific.
Well, first of all: I use Docx4j and the Apache POI with the getText() Methods to get some Texts from doc, docx, and pdf's and save them in a String.
Then Im trying to match Keywords within those texts, that I read out of an .txt file.
Those Keywords are displayed in a Combobox in the runnable Java-file.
I can see the encoding problems there. It wont match any of Keywords using the words described above.
In my IDE its working fine.
Im trying to post some code here, after I redesign it.
TXT-File is in UTF-8. If I convert it ti ANSI I see the same Problems like in the Jar.
reading out of it:
if(inputfile.exists() && inputfile.canRead())
{
try {
FileReader reader = new FileReader(inputfilepath);
BufferedReader in = new BufferedReader(reader);
String zeile = null;
while ((zeile = in.readLine()) != null) {
while(zeile.startsWith("#"))
{
if (zeile.startsWith(KUERZELTITEL)) {
int cut = zeile.indexOf('=');
zeile = zeile.substring(cut, zeile.length());
eingeleseneTagzeilen.put(KUERZELTITEL, zeile.substring(1));
kuerzel = zeile.substring(1);
}
...
this did it for me:
File readfile = new File(inputfilepath);
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream(readfile), "UTF8"));
Thx!

Congratulations, I also use UTF-8 for my projects, which seems best.
Simply make sure that editor and compiler use the same encoding. This ensures that string literals in java are correctly encoded in the jar, .class files.
In NetBeans 7.3 there is now one setting (I am using maven builds).
Properties files are historically in ISO-8859-1 or encoded as \uXXXX. So there you have to take care.
Internally Java uses Unicode, so there might be no other problems.
FileReader reader = new FileReader(inputfilepath);
should be
BufferedReader reader = new BufferedReader(new InputStreamReader(
new FileInputStream(inputfilepath), "UTF-8")));
The same procedure (explicit extra encoding parameter) for FileWriter (OutputStreamWriter + encoding), String.getBytes(encoding), new String(bytes, encoding).

Try passing -Dfile.encoding=utf-8 as JVM argument.

Error when reading non-English language character from file

I am building an app where users have to guess a secret word. I have *.txt files in assets folder. The problem is that words are in Albanian language. Our language uses letters like "ë" and "ç", so whenever I try to read from the file some word containing any of those characters I get some wicked symbol and I can not implement string.compare() for these characters. I have tried many options with UTF-8, changed Eclipse setting but still the same error.
I wold really appreciate if someone has got any advice.
The code I use to read the files is:
AssetManager am = getAssets();
strOpenFile = "fjalet.txt";
InputStream fins = am.open(strOpenFile);
reader = new BufferedReader(new InputStreamReader(fins));
ArrayList<String> stringList = new ArrayList<String>();
while ((aDataRow = reader.readLine()) != null) {
aBuffer += aDataRow + "\n";
stringList.add(aDataRow);
}
Otherwise the code works fine, except for mentioned characters

It seems pretty clear that the default encoding that is in force when you create the InputStreamReader does not match the file.
If the file you are trying to read is UTF-8, then this should work:
reader = new BufferedReader(new InputStreamReader(fins, "UTF-8"));
If the file is not UTF-8, then that won't work. Instead you should use the name of the file's true encoding. (My guess is that it is in ISO/IEC_8859-1 or ISO/IEC_8859-16.)
Once you have figured out what the file's encoding really is, you need to try to understand why it does not correspond to your Java platform's default encoding ... and then make a pragmatic decision on what to do about it. (Should you hard-wire the encoding into your application ... as above? Should you make it a configuration property or command parameter? Should you change the default encoding? Should you change the file?)

You need to determine the character encoding that was used when creating the file, and specify this encoding when reading it. If it's UTF-8, for example, use
reader = new BufferedReader(new InputStreamReader(fins, "UTF-8"));
or
reader = new BufferedReader(new InputStreamReader(fins, StandardCharsets.UTF_8));
if you're under Java 7.
Text editors like Notepad++ have good heuristics to guess what the encoding of a file is. Try opening it with such an editor and see which encoding it has guessed (if the characters appear correctly).

You should know encoding of the file.
InputStream class reads file binary. Although you can interpet input as character, it will be implicit guessing, which may be wrong.
InputStreamReader class converts binary to chars. But it should know character set.
You should use the following version to feed it by character set.
UPDATE
Don't suggest you have UTF-8 encoded file, which may be wrong. Here in Russia we have such encodings as CP866, WIN1251 and KOI8, which are all differ from UTF8. Probably you have some popular Albanian encoding of text files. Check your OS setting to guess.

Java linux character encoding issue

I'm facing an issue with character encoding in linux. I'm retrieving a content from amazon S3, which was saved using UTF-8 encoding. The content is in Chinese and I'm able to see the content correctly in the browser.
I'm using amazon SDK to retrieve the content and do some update to it.Here's the code I'm using:
StringBuilder builder = new StringBuilder();
S3Object object = client.getObject(new GetObjectRequest(bucketName, key));
BufferedReader reader = new BufferedReader(new
InputStreamReader(object.getObjectContent(), "utf-8"));
while (true) {
String line = reader.readLine();
if (line == null)
break;
builder.append(line);
}
This piece of code works fine in Windows environment as I was able to update the content and save it back without messing up any chinese characters in it.
But, its acting differently in linux enviroment. The code is unable to translate the characters properly, the chinese characters are rendered as ???
I'm not sure what's going wrong here. Any pointers will be appreciated.
-Thanks

The default charset is different for the 2 OS's your using.
To start off, you can confirm the difference by printing out the default charset.
Charset.defaultCharset.name()
Somewhere in your code, I think this default charset is being used for some String conversion. The correct procedure should be to track that down, and specify UTF-8.
Without seeing that code, I can only suggest the 'cheating' way to do it: set the default charset explicitly, near the beginning of your code, or at Java startup. See here for changing default charset: Setting the default Java character encoding?
HTH

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading Arabic chars from text file - java

Related

UTF-8 encoding CSV file

jar java program encoded

How can I change the text-coding of my Java Programm?

Error when reading non-English language character from file

Java linux character encoding issue

Categories

Resources