Ok, I am reading a .docx file via a BufferedReader and want to store the text in an edittext. The .docx is not in english language but in a different one (greek). I use:
File file = new File(file_Path);
try {
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
StringBuilder text = new StringBuilder();
while ((line = br.readLine()) != null) {
text.append(line);
}
et1.setText(text);
And the result I get is this:
If the characters are in english language, it works fine. But in my case they aren't. How can I fix this? Thanks a lot
Ok, I am reading a .docx file via a BufferedReader
Well that's the first problem. BufferedReader is for plain text files. docx files are binary files in a specific format (assuming you mean the kind of file that Microsoft Word saves). You can't just read them like text files. Open the file up in Notepad (not Wordpad) and you'll see what what I mean.
You might want to look at Apache POI.
From comments:
Testing to read a .txt file with the same text gave same results too
That's probably due to using the wrong encoding. FileReader always uses the platform default encoding, which is annoying. Assuming you're using Java 7 or higher, you'd be better off with Files.newBufferedReader:
try (BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8)) {
...
}
Adjust the charset to match the one you used when saving your text file, of course - if you have the option of using UTF-8, that's a pretty good choice. (Aside from anything else, pretty much everything can handle UTF-8.)
Related
if I have a delimited text file with apostrophes in, like ' as in:
BB;Art’s Tavern;6487 Western Ave., Glen Arbor, MI 49636;
what do I need to do to allow those to be parsed correctly through a BufferedReader in Java?
the code Im currently using to open the file for reading is thus in an android application:
StringBuffer buf = new StringBuffer();
InputStream is = context.getResources().openRawResource(R.raw.lvpa);
BufferedReader reader = new BufferedReader(new InputStreamReader(is,"UTF-8"));
Currently the apostrophes are being returned as question marks ? in a black box.
The contents of the file are then parsed into a model.
any help would be appreciated:)
Thanks
The file you are reading is not recorded in UTF-8. You need to know which encoding your file is in before you attempt to read it. If possible open it in whatever text editor you use to examine it and save it off in UTF-8 and try reading it again. (Some text editors will give the option of setting the encoding when you save the file.)
I have a Java-Programm, which I develop with Netbeans.
I changed the settings on Netbeans, so that it will understand UTF-8.
But if I clean, and build my Programm and use it with my Windows System, the textcoding changes and letters like: "ü", "ä", and "ö" aren't displayed and used properly anymore.
How can I communicate with my OS and tell him to use UTF-8?
Or is there any good workaround?
EDIT: Sry for beeing so unspecific.
Well, first of all: I use Docx4j and the Apache POI with the getText() Methods to get some Texts from doc, docx, and pdf's and save them in a String.
Then Im trying to match Keywords within those texts, that I read out of an .txt file.
Those Keywords are displayed in a Combobox in the runnable Java-file.
I can see the encoding problems there. It wont match any of Keywords using the words described above.
In my IDE its working fine.
Im trying to post some code here, after I redesign it.
TXT-File is in UTF-8. If I convert it ti ANSI I see the same Problems like in the Jar.
reading out of it:
if(inputfile.exists() && inputfile.canRead())
{
try {
FileReader reader = new FileReader(inputfilepath);
BufferedReader in = new BufferedReader(reader);
String zeile = null;
while ((zeile = in.readLine()) != null) {
while(zeile.startsWith("#"))
{
if (zeile.startsWith(KUERZELTITEL)) {
int cut = zeile.indexOf('=');
zeile = zeile.substring(cut, zeile.length());
eingeleseneTagzeilen.put(KUERZELTITEL, zeile.substring(1));
kuerzel = zeile.substring(1);
}
...
this did it for me:
File readfile = new File(inputfilepath);
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream(readfile), "UTF8"));
Thx!
Congratulations, I also use UTF-8 for my projects, which seems best.
Simply make sure that editor and compiler use the same encoding. This ensures that string literals in java are correctly encoded in the jar, .class files.
In NetBeans 7.3 there is now one setting (I am using maven builds).
Properties files are historically in ISO-8859-1 or encoded as \uXXXX. So there you have to take care.
Internally Java uses Unicode, so there might be no other problems.
FileReader reader = new FileReader(inputfilepath);
should be
BufferedReader reader = new BufferedReader(new InputStreamReader(
new FileInputStream(inputfilepath), "UTF-8")));
The same procedure (explicit extra encoding parameter) for FileWriter (OutputStreamWriter + encoding), String.getBytes(encoding), new String(bytes, encoding).
Try passing -Dfile.encoding=utf-8 as JVM argument.
I am building an app where users have to guess a secret word. I have *.txt files in assets folder. The problem is that words are in Albanian language. Our language uses letters like "ë" and "ç", so whenever I try to read from the file some word containing any of those characters I get some wicked symbol and I can not implement string.compare() for these characters. I have tried many options with UTF-8, changed Eclipse setting but still the same error.
I wold really appreciate if someone has got any advice.
The code I use to read the files is:
AssetManager am = getAssets();
strOpenFile = "fjalet.txt";
InputStream fins = am.open(strOpenFile);
reader = new BufferedReader(new InputStreamReader(fins));
ArrayList<String> stringList = new ArrayList<String>();
while ((aDataRow = reader.readLine()) != null) {
aBuffer += aDataRow + "\n";
stringList.add(aDataRow);
}
Otherwise the code works fine, except for mentioned characters
It seems pretty clear that the default encoding that is in force when you create the InputStreamReader does not match the file.
If the file you are trying to read is UTF-8, then this should work:
reader = new BufferedReader(new InputStreamReader(fins, "UTF-8"));
If the file is not UTF-8, then that won't work. Instead you should use the name of the file's true encoding. (My guess is that it is in ISO/IEC_8859-1 or ISO/IEC_8859-16.)
Once you have figured out what the file's encoding really is, you need to try to understand why it does not correspond to your Java platform's default encoding ... and then make a pragmatic decision on what to do about it. (Should you hard-wire the encoding into your application ... as above? Should you make it a configuration property or command parameter? Should you change the default encoding? Should you change the file?)
You need to determine the character encoding that was used when creating the file, and specify this encoding when reading it. If it's UTF-8, for example, use
reader = new BufferedReader(new InputStreamReader(fins, "UTF-8"));
or
reader = new BufferedReader(new InputStreamReader(fins, StandardCharsets.UTF_8));
if you're under Java 7.
Text editors like Notepad++ have good heuristics to guess what the encoding of a file is. Try opening it with such an editor and see which encoding it has guessed (if the characters appear correctly).
You should know encoding of the file.
InputStream class reads file binary. Although you can interpet input as character, it will be implicit guessing, which may be wrong.
InputStreamReader class converts binary to chars. But it should know character set.
You should use the following version to feed it by character set.
UPDATE
Don't suggest you have UTF-8 encoded file, which may be wrong. Here in Russia we have such encodings as CP866, WIN1251 and KOI8, which are all differ from UTF8. Probably you have some popular Albanian encoding of text files. Check your OS setting to guess.
I am trying to read text from a text file. There are some special characters like å,ä and ö. When i make a string and print out that string then i get ?? from these special characters. I am using the following code:
File fileDir = new File("files/myfile.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(
new FileInputStream(fileDir), "UTF8"));
String strLine;
while ((strLine = br.readLine()) != null) {
System.out.println("strLine: "+strLine);
}
Can anybody tell me whats the problem. I want strLine to show and save å, ä and ö as they are in text file. Thanks in advance.
The problem might not be with the file but with the console where you are trying to print. I suggest you follow the following steps
Make sure the file you are reading is encoded in UTF-8.
Make sure the console you are printing to has the proper encoding/charset to display these characters
Finally, this article Unicode - How to get characters right? is a must read.
Check here for the lists of Java supported encodings
Most common single-byte encoding that includes non-ascii characters is ISO8859_1; maybe your file is that, and you should specifiy that encoding for your FileInputStream
I had finished a project in which I read from a text file written with notepad.
The characters in my text file are in Arabic language,and the file encoding type is UTF-8.
When launching my project inside Netbeans(7.0.1) everything seemed to be ok,but when I built the project as a (.jar) file the characters where displayed in this way: ÇáãæÇÞÚááÊØæíÑ.
How could I solve This problem please?
Most likely you are using JVM default character encoding somewhere. If you are 100% sure your file is encoded using UTF-8, make sure you explicitly specify UTF-8 when reading as well. For example this piece of code is broken:
new FileReader("file.txt")
because it uses JVM default character encoding - which you might not have control over and apparently Netbeans uses UTF-8 while your operating system defines something different. Note that this makes FileReader class completely useless if you want your code to be portable.
Instead use the following code snippet:
new InputStreamReader(new FileInputStream("file.txt"), "UTF-8");
You are not providing your code, but this should give you a general impression how this should be implemented.
Maybe this example will help a little. I will try to print content of utf-8 file to IDE console and system console that is encoded in "Cp852".
My d:\data.txt contains ąźżćąś adsfasdf
Lets check this code
//I will read chars using utf-8 encoding
BufferedReader in = new BufferedReader(new InputStreamReader(
new FileInputStream("d:\\data.txt"), "utf-8"));
//and write to console using Cp852 encoding (works for my windows7 console)
PrintWriter out = new PrintWriter(new OutputStreamWriter(System.out,
"Cp852"),true); // "Cp852" is coding used in
// my console in Win7
// ok, lets read data from file
String line;
while ((line = in.readLine()) != null) {
// here I use IDE encoding
System.out.println(line);
// here I print data using Cp852 encoding
out.println(line);
}
When I run it in Eclipse output will be
ąźżćąś adsfasdf
Ą«ľ†Ą? adsfasdf
but output from system console will be