Java FileReader encoding issue - java

I tried to use java.io.FileReader to read some text files and convert them into a string, but I found the result is wrongly encoded and not readable at all.
Here's my environment:
Windows 2003, OS encoding: CP1252
Java 5.0
My files are UTF-8 encoded or CP1252 encoded, and some of them (UTF-8 encoded files) may contain Chinese (non-Latin) characters.
I use the following code to do my work:
private static String readFileAsString(String filePath)
throws java.io.IOException{
StringBuffer fileData = new StringBuffer(1000);
FileReader reader = new FileReader(filePath);
//System.out.println(reader.getEncoding());
BufferedReader reader = new BufferedReader(reader);
char[] buf = new char[1024];
int numRead=0;
while((numRead=reader.read(buf)) != -1){
String readData = String.valueOf(buf, 0, numRead);
fileData.append(readData);
buf = new char[1024];
}
reader.close();
return fileData.toString();
}
The above code doesn't work. I found the FileReader's encoding is CP1252 even if the text is UTF-8 encoded. But the JavaDoc of java.io.FileReader says that:
The constructors of this class assume
that the default character encoding
and the default byte-buffer size are
appropriate.
Does this mean that I am not required to set character encoding by myself if I am using FileReader? But I did get wrongly encoded data currently, what's the correct way to deal with my situtaion? Thanks.

Yes, you need to specify the encoding of the file you want to read.
Yes, this means that you have to know the encoding of the file you want to read.
No, there is no general way to guess the encoding of any given "plain text" file.
The one-arguments constructors of FileReader always use the platform default encoding which is generally a bad idea.
Since Java 11 FileReader has also gained constructors that accept an encoding: new FileReader(file, charset) and new FileReader(fileName, charset).
In earlier versions of java, you need to use new InputStreamReader(new FileInputStream(pathToFile), <encoding>).

FileReader uses Java's platform default encoding, which depends on the system settings of the computer it's running on and is generally the most popular encoding among users in that locale.
If this "best guess" is not correct then you have to specify the encoding explicitly. Unfortunately, FileReader does not allow this (major oversight in the API). Instead, you have to use new InputStreamReader(new FileInputStream(filePath), encoding) and ideally get the encoding from metadata about the file.

For Java 7+ doc you can use this:
BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8);
Here are all Charsets doc
For example if your file is in CP1252, use this method
Charset.forName("windows-1252");
Here is other canonical names for Java encodings both for IO and NIO doc
If you do not know with exactly encoding you have got in a file, you may use some third-party libs like this tool from Google this which works fairly neat.

Since Java 11 you may use that:
public FileReader(String fileName, Charset charset) throws IOException;

FileInputStream with InputStreamReader is better than directly using FileReader, because the latter doesn't allow you to specify encoding charset.
Here is an example using BufferedReader, FileInputStream and InputStreamReader together, so that you could read lines from a file.
List<String> words = new ArrayList<>();
List<String> meanings = new ArrayList<>();
public void readAll( ) throws IOException{
String fileName = "College_Grade4.txt";
String charset = "UTF-8";
BufferedReader reader = new BufferedReader(
new InputStreamReader(
new FileInputStream(fileName), charset));
String line;
while ((line = reader.readLine()) != null) {
line = line.trim();
if( line.length() == 0 ) continue;
int idx = line.indexOf("\t");
words.add( line.substring(0, idx ));
meanings.add( line.substring(idx+1));
}
reader.close();
}

For another as Latin languages for example Cyrillic you can use something like this:
FileReader fr = new FileReader("src/text.txt", StandardCharsets.UTF_8);
and be sure that your .txt file is saved with UTF-8 (but not as default ANSI) format. Cheers!

Related

Reading from a file doesn't get the expected text [duplicate]

I tried to use java.io.FileReader to read some text files and convert them into a string, but I found the result is wrongly encoded and not readable at all.
Here's my environment:
Windows 2003, OS encoding: CP1252
Java 5.0
My files are UTF-8 encoded or CP1252 encoded, and some of them (UTF-8 encoded files) may contain Chinese (non-Latin) characters.
I use the following code to do my work:
private static String readFileAsString(String filePath)
throws java.io.IOException{
StringBuffer fileData = new StringBuffer(1000);
FileReader reader = new FileReader(filePath);
//System.out.println(reader.getEncoding());
BufferedReader reader = new BufferedReader(reader);
char[] buf = new char[1024];
int numRead=0;
while((numRead=reader.read(buf)) != -1){
String readData = String.valueOf(buf, 0, numRead);
fileData.append(readData);
buf = new char[1024];
}
reader.close();
return fileData.toString();
}
The above code doesn't work. I found the FileReader's encoding is CP1252 even if the text is UTF-8 encoded. But the JavaDoc of java.io.FileReader says that:
The constructors of this class assume
that the default character encoding
and the default byte-buffer size are
appropriate.
Does this mean that I am not required to set character encoding by myself if I am using FileReader? But I did get wrongly encoded data currently, what's the correct way to deal with my situtaion? Thanks.
Yes, you need to specify the encoding of the file you want to read.
Yes, this means that you have to know the encoding of the file you want to read.
No, there is no general way to guess the encoding of any given "plain text" file.
The one-arguments constructors of FileReader always use the platform default encoding which is generally a bad idea.
Since Java 11 FileReader has also gained constructors that accept an encoding: new FileReader(file, charset) and new FileReader(fileName, charset).
In earlier versions of java, you need to use new InputStreamReader(new FileInputStream(pathToFile), <encoding>).
FileReader uses Java's platform default encoding, which depends on the system settings of the computer it's running on and is generally the most popular encoding among users in that locale.
If this "best guess" is not correct then you have to specify the encoding explicitly. Unfortunately, FileReader does not allow this (major oversight in the API). Instead, you have to use new InputStreamReader(new FileInputStream(filePath), encoding) and ideally get the encoding from metadata about the file.
For Java 7+ doc you can use this:
BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8);
Here are all Charsets doc
For example if your file is in CP1252, use this method
Charset.forName("windows-1252");
Here is other canonical names for Java encodings both for IO and NIO doc
If you do not know with exactly encoding you have got in a file, you may use some third-party libs like this tool from Google this which works fairly neat.
Since Java 11 you may use that:
public FileReader(String fileName, Charset charset) throws IOException;
FileInputStream with InputStreamReader is better than directly using FileReader, because the latter doesn't allow you to specify encoding charset.
Here is an example using BufferedReader, FileInputStream and InputStreamReader together, so that you could read lines from a file.
List<String> words = new ArrayList<>();
List<String> meanings = new ArrayList<>();
public void readAll( ) throws IOException{
String fileName = "College_Grade4.txt";
String charset = "UTF-8";
BufferedReader reader = new BufferedReader(
new InputStreamReader(
new FileInputStream(fileName), charset));
String line;
while ((line = reader.readLine()) != null) {
line = line.trim();
if( line.length() == 0 ) continue;
int idx = line.indexOf("\t");
words.add( line.substring(0, idx ));
meanings.add( line.substring(idx+1));
}
reader.close();
}
For another as Latin languages for example Cyrillic you can use something like this:
FileReader fr = new FileReader("src/text.txt", StandardCharsets.UTF_8);
and be sure that your .txt file is saved with UTF-8 (but not as default ANSI) format. Cheers!

While reading from file in java, input is prefixed by two junk characters

FileInputStream fin = new FileInputStream("D:\\testout.txt");
BufferedInputStream bin = new BufferedInputStream(fin);
int i;
while((i = bin.read())!=-1) {
System.out.print((char)i);
}
bin.close();
fin.close();
output: ÿþGreat
I have checked the file testout.txt, it contains only one word i.e, Great.
When you're using text, you should use a Reader. eg.
try(
BufferedReader reader = Files.newBufferedReader(
Paths.get("D:\\testout.txt"),
StandardCharsets.UTF_8)
){
int i;
while((i = reader.read())!=-1) {
System.out.print((char)i);
}
}
That's most probably the Byte order mark, optional but allowed in files using UTF-8 character encoding. Some programs (e.g. Notepad) account for this possibility, some don't. Java by default doesn't strip them.
One utility to solve this is the BOMInputStream from Apache Commons IO.
Also, Notepad will write the byte order mark in the file when you save it as UTF-8.
ÿþ is the byte order mark in UTF-16. You can convert your string to UTF-8 with java.io as explained here.
You may also refer to the answer for more detail.
Please use utf-8 Characters encoding for resolving this kind of issue.
byte[] utf_8 = input.getBytes("UTF-8"); // convert unicode string to UTF-8
String test = new String(utf_8);

Java - Problems with "ü/ä/ö" after SCP

I create a Programm which can load local or remote log files.
If i load a local file there is no error.
But if I copy first the file with SCP to my local (where i use this code: http://www.jcraft.com/jsch/examples/ScpFrom.java.html) and read it out I get an Error and the letters "ü/ä/ö" shown as �.
How can i fix this ?
Remote : Linux-Server
Local: Windows-PC
Code for SCP :
http://www.jcraft.com/jsch/examples/ScpFrom.java.html
Code for reading out :
protected void openTempRemoteFile() throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream( lfile )));
String strLine;
DefaultTableModel dtm = new DefaultTableModel(0, 0);
String header[] = new String[]{ "Timestamp", "Session-ID", "Log" };
dtm.setColumnIdentifiers(header);
table.setModel(dtm);
while ((strLine = reader.readLine()) != null) {
String[] sparts = strLine.split(" ");
String[] bparts = strLine.split(" : ");
String Timestamp = sparts[0] + " " + sparts[1];
String SessionID = sparts[4];
String Log = bparts[1];
dtm.addRow(new Object[] {Timestamp, SessionID, Log});
}
reader.close();
}
EDIT :
Encoding Format of the Local-Files: UTF-8
Encoding Format of the SCP-Remote-Files from Linux-Server: WINDOWS-1252
Supply appropriate Charset to InputStreamReader constructor, e.g.:
import java.nio.charset.StandardCharsets;
...
BufferedReader reader = new BufferedReader(
new InputStreamReader(
new FileInputStream( lfile ),
StandardCharsets.UTF_8)); // try also ISO_8859_1 if UTF_8 doesn't help.
To fix your problem you have at least two options:
You can specify the encoding for your files directly in your code, updating it as follow:
BufferedReader reader = new BufferedReader(
new InputStreamReader(
new FileInputStream( lfile ),
"UTF8"
)
);
or set the default file encoding when starting the JVM with:
java -Dfile.encoding=UTF-8 … com.example.Main
I definitely prefer the first way and you can parametrize the "UTF8" value too, if you need.
With the latter way you could still face the same issues if you forgot to specify that.
You can replace the encoding with whatever you prefer (Refer to https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html for Supported Encodings) and, on Windows, "Cp1252" is usually the default encoding.
Remember, you can always use query the file.encoding property or Charset.defaultCharset() to find the current default encoding for your application, eg:
byte [] byteArray = {'blablabla'};
InputStream inputStream = new ByteArrayInputStream(byteArray);
InputStreamReader reader = new InputStreamReader(inputStream);
String defaultEncoding = reader.getEncoding();
Working with encoding is very tricky thing. If your system always uses this kind of files (from different environment) than you should first detect the charset than read it with given charset. I had similar problem and i used
juniversalchardet
to detect charset and used InputStreamReader(stream, Charset).
In your case it would be like
protected void openTempRemoteFile() throws IOException {
String encoding = UniversalDetector.detectCharset(lfile);
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream( lfile ), Charset.forName(encoding)));
....
If it is only one time job than open it in text editor (notapad++ for example) than save it in your encoding. Than use it in program.

Read and write files with accents

I have an input file in XML format and it is well formed, with accents well written. The file is created with a PHP script that works fine.
But when i read the XML File and write it in another XML using a Java program, it puts strange characters instead of the characters with accents.
This is the method that reads the XML File:
public static String getArchivo(FileInputStream fileinputstream)
{
String s = null;
try
{
byte abyte0[] = new byte[1024];
int i = fileinputstream.read(abyte0);
if(i != -1)
{
s = new String(abyte0, 0, i);
for(int j = fileinputstream.read(abyte0); j != -1; j = fileinputstream.read(abyte0))
{
s = s + new String(abyte0, 0, j);
}
}
}
catch(IOException ioexception)
{
s = null;
}
return s;
}
Due to the fact that the file is read byte per byte, How do i replace the "bad" bytes for the correct bytes for the accented characters?
If reading files like these byte per byte is not a good idea, how can i do it better?
The characters that i need are: á, é, í, ó, ú, Á, É, Í, Ó, Ú, ñ, Ñ and °.
Thanks in advance
Probably you are reading the file with UTF-8 charset. Special chars are not part of the UTF-8 charset. Change from UTF-8 to UTF-16
Something like
InputStream in = ...
InputSource is = new InputSource(new InputStreamReader(in, "utf-16"));
As Jordi correctly said there are no special chars outside of utf-8.
So consider the first part as an information for other special chars.
Looking deeper at your code I see that you read an int and you convert it to a String. Don't convert it. Read bytes and write bytes to be sure that data will not changed.
Works for me using Chaserset ISO 8859-1. Syntax in kotlin:
val inputStream : InputStream = FileInputStream(filePath)
val json = inputStream.bufferedReader(Charsets.ISO_8859_1).use { it.readText()}
When you read the file use encoding utf-8 is best
BufferedReader rd = new BufferedReader(new InputStreamReader(is, "utf-8"));
In writing also use utf-8
OutputStreamWriter writer = new OutputStreamWriter( new FileOutputStream(filePath, true), "utf-8");
This worked for me.
When read file in vi editor or other editor change default encoding into utf-8
locale charmap
LANG=en_US.UTF-8

How to read and write UTF-8 to disk on the Android?

I cannot read and write extended characters (French accented characters, for example) to a text file using the standard InputStreamReader methods shown in the Android API examples. When I read back the file using:
InputStreamReader tmp = new InputStreamReader(in);
BufferedReader reader = new BufferedReader(tmp);
String str;
while ((str = reader.readLine()) != null) {
...
the string read is truncated at the extended characters instead of at the end-of-line. The second half of the string then comes on the next line. I'm assuming that I need to persist my data as UTF-8 but I cannot find any examples of that, and I'm new to Java.
Can anyone provide me with an example or a link to relevant documentation?
Very simple and straightforward. :)
String filePath = "/sdcard/utf8_file.txt";
String UTF8 = "utf8";
int BUFFER_SIZE = 8192;
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(filePath), UTF8),BUFFER_SIZE);
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(filePath), UTF8),BUFFER_SIZE);
When you instantiate the InputStreamReader, use the constructor that takes a character set.
InputStreamReader tmp = new InputStreamReader(in, "UTF-8");
And do a similar thing with OutputStreamWriter
I like to have a
public static final Charset UTF8 = Charset.forName("UTF-8");
in some utility class in my code, so that I can call (see more in the Doc)
InputStreamReader tmp = new InputStreamReader(in, MyUtils.UTF8);
and not have to handle UnsupportedEncodingException every single time.
this should just work on Android, even without explicitly specifying UTF-8, because the default charset is UTF-8. if you can reproduce this problem, please raise a bug with a reproduceable test case here:
http://code.google.com/p/android/issues/entry
if you face any such kind of problem try doing this. You have to Encode and Decode your data into Base64. This worked for me. I can share the code if you need it.
Check the encoding of your file by right clicking it in the Project Explorer and selecting properties. If it's not the right encoding you'll need to re-enter your special characters after you change it, or at least that was my experience.

Categories

Resources