How to read and write UTF-8 to disk on the Android?

How to read and write UTF-8 to disk on the Android? - java

I cannot read and write extended characters (French accented characters, for example) to a text file using the standard InputStreamReader methods shown in the Android API examples. When I read back the file using:
InputStreamReader tmp = new InputStreamReader(in);
BufferedReader reader = new BufferedReader(tmp);
String str;
while ((str = reader.readLine()) != null) {
...
the string read is truncated at the extended characters instead of at the end-of-line. The second half of the string then comes on the next line. I'm assuming that I need to persist my data as UTF-8 but I cannot find any examples of that, and I'm new to Java.
Can anyone provide me with an example or a link to relevant documentation?

Very simple and straightforward. :)
String filePath = "/sdcard/utf8_file.txt";
String UTF8 = "utf8";
int BUFFER_SIZE = 8192;
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(filePath), UTF8),BUFFER_SIZE);
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(filePath), UTF8),BUFFER_SIZE);

When you instantiate the InputStreamReader, use the constructor that takes a character set.
InputStreamReader tmp = new InputStreamReader(in, "UTF-8");
And do a similar thing with OutputStreamWriter
I like to have a
public static final Charset UTF8 = Charset.forName("UTF-8");
in some utility class in my code, so that I can call (see more in the Doc)
InputStreamReader tmp = new InputStreamReader(in, MyUtils.UTF8);
and not have to handle UnsupportedEncodingException every single time.

this should just work on Android, even without explicitly specifying UTF-8, because the default charset is UTF-8. if you can reproduce this problem, please raise a bug with a reproduceable test case here:
http://code.google.com/p/android/issues/entry

if you face any such kind of problem try doing this. You have to Encode and Decode your data into Base64. This worked for me. I can share the code if you need it.

Check the encoding of your file by right clicking it in the Project Explorer and selecting properties. If it's not the right encoding you'll need to re-enter your special characters after you change it, or at least that was my experience.

Related

Reading from a file doesn't get the expected text [duplicate]

I tried to use java.io.FileReader to read some text files and convert them into a string, but I found the result is wrongly encoded and not readable at all.
Here's my environment:
Windows 2003, OS encoding: CP1252
Java 5.0
My files are UTF-8 encoded or CP1252 encoded, and some of them (UTF-8 encoded files) may contain Chinese (non-Latin) characters.
I use the following code to do my work:
private static String readFileAsString(String filePath)
throws java.io.IOException{
StringBuffer fileData = new StringBuffer(1000);
FileReader reader = new FileReader(filePath);
//System.out.println(reader.getEncoding());
BufferedReader reader = new BufferedReader(reader);
char[] buf = new char[1024];
int numRead=0;
while((numRead=reader.read(buf)) != -1){
String readData = String.valueOf(buf, 0, numRead);
fileData.append(readData);
buf = new char[1024];
}
reader.close();
return fileData.toString();
}
The above code doesn't work. I found the FileReader's encoding is CP1252 even if the text is UTF-8 encoded. But the JavaDoc of java.io.FileReader says that:
The constructors of this class assume
that the default character encoding
and the default byte-buffer size are
appropriate.
Does this mean that I am not required to set character encoding by myself if I am using FileReader? But I did get wrongly encoded data currently, what's the correct way to deal with my situtaion? Thanks.

Yes, you need to specify the encoding of the file you want to read.
Yes, this means that you have to know the encoding of the file you want to read.
No, there is no general way to guess the encoding of any given "plain text" file.
The one-arguments constructors of FileReader always use the platform default encoding which is generally a bad idea.
Since Java 11 FileReader has also gained constructors that accept an encoding: new FileReader(file, charset) and new FileReader(fileName, charset).
In earlier versions of java, you need to use new InputStreamReader(new FileInputStream(pathToFile), <encoding>).

FileReader uses Java's platform default encoding, which depends on the system settings of the computer it's running on and is generally the most popular encoding among users in that locale.
If this "best guess" is not correct then you have to specify the encoding explicitly. Unfortunately, FileReader does not allow this (major oversight in the API). Instead, you have to use new InputStreamReader(new FileInputStream(filePath), encoding) and ideally get the encoding from metadata about the file.

For Java 7+ doc you can use this:
BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8);
Here are all Charsets doc
For example if your file is in CP1252, use this method
Charset.forName("windows-1252");
Here is other canonical names for Java encodings both for IO and NIO doc
If you do not know with exactly encoding you have got in a file, you may use some third-party libs like this tool from Google this which works fairly neat.

Since Java 11 you may use that:
public FileReader(String fileName, Charset charset) throws IOException;

FileInputStream with InputStreamReader is better than directly using FileReader, because the latter doesn't allow you to specify encoding charset.
Here is an example using BufferedReader, FileInputStream and InputStreamReader together, so that you could read lines from a file.
List<String> words = new ArrayList<>();
List<String> meanings = new ArrayList<>();
public void readAll( ) throws IOException{
String fileName = "College_Grade4.txt";
String charset = "UTF-8";
BufferedReader reader = new BufferedReader(
new InputStreamReader(
new FileInputStream(fileName), charset));
String line;
while ((line = reader.readLine()) != null) {
line = line.trim();
if( line.length() == 0 ) continue;
int idx = line.indexOf("\t");
words.add( line.substring(0, idx ));
meanings.add( line.substring(idx+1));
}
reader.close();
}

For another as Latin languages for example Cyrillic you can use something like this:
FileReader fr = new FileReader("src/text.txt", StandardCharsets.UTF_8);
and be sure that your .txt file is saved with UTF-8 (but not as default ANSI) format. Cheers!

While reading from file in java, input is prefixed by two junk characters

FileInputStream fin = new FileInputStream("D:\\testout.txt");
BufferedInputStream bin = new BufferedInputStream(fin);
int i;
while((i = bin.read())!=-1) {
System.out.print((char)i);
}
bin.close();
fin.close();
output: ÿþGreat
I have checked the file testout.txt, it contains only one word i.e, Great.

When you're using text, you should use a Reader. eg.
try(
BufferedReader reader = Files.newBufferedReader(
Paths.get("D:\\testout.txt"),
StandardCharsets.UTF_8)
){
int i;
while((i = reader.read())!=-1) {
System.out.print((char)i);
}
}

That's most probably the Byte order mark, optional but allowed in files using UTF-8 character encoding. Some programs (e.g. Notepad) account for this possibility, some don't. Java by default doesn't strip them.
One utility to solve this is the BOMInputStream from Apache Commons IO.
Also, Notepad will write the byte order mark in the file when you save it as UTF-8.

ÿþ is the byte order mark in UTF-16. You can convert your string to UTF-8 with java.io as explained here.
You may also refer to the answer for more detail.

Please use utf-8 Characters encoding for resolving this kind of issue.
byte[] utf_8 = input.getBytes("UTF-8"); // convert unicode string to UTF-8
String test = new String(utf_8);

java encoding issue while reading stream

I am trying to download contents from ftp folder. There is one xml file which starts with standardazed xml codes.
< ?xml version="1.0" encoding="utf-8"?>
when i read these files (using java.net.Socket)and get input stream and then try to convert to String, somehow i get some new charecters. And the whole xml document starts with '?' eg. "?< ?xml version="1.0" encoding="utf-8"?>....."
BufferedInputStream reader = new BufferedInputStream(sock.getInputStream());
Then i am getting a string from this reader using following code.
StringBuilder sb = new StringBuilder();
String line;
BufferedReader br = new BufferedReader(new InputStreamReader(reader));
while ((line = br.readLine()) != null) {
sb.append(line);
}
System.out.println ("sb.toString()");
Not sure whats happening here. why am i getting some special charecters introduced ?Any suggestions would be appreciated
and then i just used following code to read the file and in console i see some special charecters
BufferedReader reader = new BufferedReader(new FileReader("c:/Users/appd922/DocumentMeta06122014.xml"));
StringBuffer sb = new StringBuffer();
String line = null;
while ((line = reader.readLine()) != null) {
sb.append(line);
}
String output = sb.toString();
System.out.println("reading from file"+output);
I got output starting
"reading from fileï»¿< ?xml version.....
where am i getting these special charecters ?
Note- ignore the space in the xml file line given above. i could not write here with proper xmlwithout that space.

Specify the encoding when creating InputStreamReader to read the file from the ftp, for example:
BufferedReader br = new BufferedReader(new InputStreamReader(reader, "utf-8"));
Otherwise, InputStreamReader uses default encoding. Also, specify the encoding when reading the downloaded file. FileReader uses default platform encoding. Use InputStreamReader and specify encoding, for example:
BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(filePath), "utf-8"));

Those characters are called BOM, Byte Order Mark. If you set the encoding of the InputStreamReader to 'UTF-8', you could see that they are interpreted as a single character, that is the BOM character.
Unfortunately, you have to handle this character yourself, because Java won't do it for you: java utf-8 and bom. Usually you just strip your stream of it. Good luck.

Java decode special characters Â¡ and Ì§ becomes A?¡ and I?§

I'm trying to read a file name off XML, whose encoding can be changed.
The file name on the XML has string such as "Ì§oÌ" which is supposed to be read by my code as "Ì§oÌ". However, I keep getting I?§.
Similar problem for Â and A?¡
Below is my code:
Socket s = new Socket();
InputStream is = s.getInputStream();
ByteArrayInputStream bAis = new ByteArrayInputStream(buf, 0, rlen);
BufferedReader bReader = new BufferedReader( new InputStreamReader( hbis, "ISO-8859-1" ));
String theStringINeed = bReader.readLine();
Any help would be appreciated.

new InputStreamReader( hbis, "ISO-8859-1" )
If you lie about the encoding of a file, bad things will happen.
You need to read the file using the encoding it was actually written in, which is probably UTF8.

Java FileReader encoding issue

I tried to use java.io.FileReader to read some text files and convert them into a string, but I found the result is wrongly encoded and not readable at all.
Here's my environment:
Windows 2003, OS encoding: CP1252
Java 5.0
My files are UTF-8 encoded or CP1252 encoded, and some of them (UTF-8 encoded files) may contain Chinese (non-Latin) characters.
I use the following code to do my work:
private static String readFileAsString(String filePath)
throws java.io.IOException{
StringBuffer fileData = new StringBuffer(1000);
FileReader reader = new FileReader(filePath);
//System.out.println(reader.getEncoding());
BufferedReader reader = new BufferedReader(reader);
char[] buf = new char[1024];
int numRead=0;
while((numRead=reader.read(buf)) != -1){
String readData = String.valueOf(buf, 0, numRead);
fileData.append(readData);
buf = new char[1024];
}
reader.close();
return fileData.toString();
}
The above code doesn't work. I found the FileReader's encoding is CP1252 even if the text is UTF-8 encoded. But the JavaDoc of java.io.FileReader says that:
The constructors of this class assume
that the default character encoding
and the default byte-buffer size are
appropriate.
Does this mean that I am not required to set character encoding by myself if I am using FileReader? But I did get wrongly encoded data currently, what's the correct way to deal with my situtaion? Thanks.

Yes, you need to specify the encoding of the file you want to read.
Yes, this means that you have to know the encoding of the file you want to read.
No, there is no general way to guess the encoding of any given "plain text" file.
The one-arguments constructors of FileReader always use the platform default encoding which is generally a bad idea.
Since Java 11 FileReader has also gained constructors that accept an encoding: new FileReader(file, charset) and new FileReader(fileName, charset).
In earlier versions of java, you need to use new InputStreamReader(new FileInputStream(pathToFile), <encoding>).

FileReader uses Java's platform default encoding, which depends on the system settings of the computer it's running on and is generally the most popular encoding among users in that locale.
If this "best guess" is not correct then you have to specify the encoding explicitly. Unfortunately, FileReader does not allow this (major oversight in the API). Instead, you have to use new InputStreamReader(new FileInputStream(filePath), encoding) and ideally get the encoding from metadata about the file.

For Java 7+ doc you can use this:
BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8);
Here are all Charsets doc
For example if your file is in CP1252, use this method
Charset.forName("windows-1252");
Here is other canonical names for Java encodings both for IO and NIO doc
If you do not know with exactly encoding you have got in a file, you may use some third-party libs like this tool from Google this which works fairly neat.

Since Java 11 you may use that:
public FileReader(String fileName, Charset charset) throws IOException;

FileInputStream with InputStreamReader is better than directly using FileReader, because the latter doesn't allow you to specify encoding charset.
Here is an example using BufferedReader, FileInputStream and InputStreamReader together, so that you could read lines from a file.
List<String> words = new ArrayList<>();
List<String> meanings = new ArrayList<>();
public void readAll( ) throws IOException{
String fileName = "College_Grade4.txt";
String charset = "UTF-8";
BufferedReader reader = new BufferedReader(
new InputStreamReader(
new FileInputStream(fileName), charset));
String line;
while ((line = reader.readLine()) != null) {
line = line.trim();
if( line.length() == 0 ) continue;
int idx = line.indexOf("\t");
words.add( line.substring(0, idx ));
meanings.add( line.substring(idx+1));
}
reader.close();
}

For another as Latin languages for example Cyrillic you can use something like this:
FileReader fr = new FileReader("src/text.txt", StandardCharsets.UTF_8);
and be sure that your .txt file is saved with UTF-8 (but not as default ANSI) format. Cheers!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to read and write UTF-8 to disk on the Android? - java

this should just work on Android, even without explicitly specifying UTF-8, because the default charset is UTF-8. if you can reproduce this problem, please raise a bug with a reproduceable test case here: http://code.google.com/p/android/issues/entry

if you face any such kind of problem try doing this. You have to Encode and Decode your data into Base64. This worked for me. I can share the code if you need it.

Check the encoding of your file by right clicking it in the Project Explorer and selecting properties. If it's not the right encoding you'll need to re-enter your special characters after you change it, or at least that was my experience.

Related

Reading from a file doesn't get the expected text [duplicate]

While reading from file in java, input is prefixed by two junk characters

java encoding issue while reading stream

Java decode special characters Â¡ and Ì§ becomes A?¡ and I?§

Java FileReader encoding issue

Categories

Resources