Android read file encoding issue

Android read file encoding issue - java

I'm trying to read a file from the SD card and I've been told it's in unicode format. However, when I try to read the file I get the following:
This is the code I'm using to read the file:
InputStreamReader fw = new InputStreamReader(new FileInputStream(root.getAbsolutePath()+"/Drive/sdk/cmd.62.out"), "UTF-8");
char[] buf = new char[255];
fw.read(buf);
String readString = new String(buf);
Log.d("courierread",readString);
fw.close();
If I write that output to a file this is what I get when I open it in a hex editor:
Any thoughts on what I need to do to read the file correctly?

Does the file have a byte-order mark? In that case look at Reading UTF-8 - BOM marker
EDIT (from comment): That looks like little-endian UTF-16 to me. Try the charset "UTF-16LE".

The file you show in the hex editor is not UTF-8 encoded, it looks more like UTF-16. This means you must specify UTF-16 as the encoding in your code (probably the UTF-16LE variant).
If it were UTF-8 encoded, then it would represent all characters representable in ASCII using just a single byte.

Related

How to read file with saving encoding?

So, I have file in ISO8859-1 encoding. I do the next:
InputStreamReader isr = new InputStreamReader(new FileInputStream(fileLocation));
System.out.println(isr.getEncoding());
And I get UTF8... Looks like FileInputStream or InputStreamReader convert it to UTF8.
Yes, I know about the next one way:
BufferedReader br = new BufferedReader(
new InputStreamReader(
new FileInputStream(fileLocation), "ISO-8859-1");
But I don't know beforehand what encoding my file will have.
How can I read file with saving encoding?

Binary files (bytes) that are actually text in some encoding for those bytes, unfortunately do not store the encoding (charset) somewhere.
Sometimes there is an encoding somewhere: Unicode text could have an optional BOM character at the begin of the file. HTML and XML can specify the charset.
If you downloaded the file from the internet in the header lines the charset could be mentioned. Say it were an HTML file, and Content-Type: text/html; charset=Windows-1251. Then you could read the file with Windows-1251, and always store it as UTF-8, modifying/adding a <meta charset="UTF-8">.
But in general there is no solution for determining some file's encoding. You could do:
read the bytes
if convertible to UTF-8 without error in the multibyte sequences, it is UTF-8
otherwise it is a single byte encoding, default to Windows-1252 (rather than ISO-8859-1)
maybe use word frequency tables of some languages together with encodings, and try those
write the bytes in the determined encoding to file as UTF-8
There might be a library doing such a thing; combining language recognition and charset recognition.

How to create a utf-8 encoded file in java such that it shows as UTF-8 encoded when opened in notepad++/notepad or any other text editor

I have tried to create UTF-8 file using java using different readers.But after creating when I open the file it is not read as being UTF-8 encoded(I opened it in notepad++ and it was UTF-8 without BOM).
File fileDir = new File("c:\\temp\\test.txt");
Writer out1 = new BufferedWriter(
new OutputStreamWriter(
new FileOutputStream(fileDir),
Charset.forName("UTF-8").newEncoder())
);
Writer out = new OutputStreamWriter(
new FileOutputStream(fileDir),
Charset.forName("UTF-8")
);
out.append("Website UTF-8").append("\r\n");
out.append("?? UTF-8").append("\r\n");
out.append("??????? UTF-8").append("\r\n");
out.flush();
out.close();

You are correctly writing a file in the UTF-8 encoding. (Note that you're not using out1 and it's unnecessary).
Notepad++ tells you that the file is "UTF-8 without BOM". Why do you think this is not UTF-8?
BOM stands for byte order mark. It's a special Unicode character to indicate if the bytes in a file are in little-endian or big-endian order. But for UTF-8 it has no meaning and its use is not recommended. From the Wikipedia article:
The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF. A text editor or web browser interpreting the text as ISO-8859-1 or CP1252 will display the characters ï»¿ for this.
The Unicode Standard permits the BOM in UTF-8, but does not require nor recommend its use. Byte order has no meaning in UTF-8, so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8. The BOM may also appear when UTF-8 data is converted from other encodings that use a BOM.
Is there a special reason why you need a BOM to be included? If not, then don't worry about it. Some Java XML parsers cannot deal with an UTF-8 BOM properly and will give an error when you try to parse an XML document encoded in UTF-8 when it starts with a BOM.

Character Encoding in Java

In eclipse, I changed the default encoding to ISO-8859-1. Then I wrote this:
String str = "Русский язык ";
PrintStream ps = new PrintStream(System.out, true, "UTF-8");
ps.print(str);
It should print the String correctly, as I am specifying UTF-8 encoding. However, it is not printing.

The ISO-8859-1 character encoding only supports characters between 0 and 255, and anything else is likely to be turned into '?'

If you save the source file (the .java file) as ISO-8859-1 than str will be encoded by javac using ISO-8859-1. Your problem does not lie in the creation of PrintStream: the str you are printing is wrong from the beginning.

Yes, it looks like the terminal that your are sending this output does not support this encoding.
If you are running Eclipse, you could set the encoding as follows:
In Run Configurations...->Common ->Encoding->Other
Select UTF-8

You are basically telling the PrintStream writer to expect the input characters to be UTF-8 encoded and to output it as UTF-8. There is no conversion. If you set your IDE to use ISO-8859-1 as character encoding for your file, which in turns contains the input string than you pipe ISO-8859-1 encoded characters into an UTF-8 expecting writer. So the writer treats the bytes receiving as UTF encoded characters which will result in data junk.
Either set your IDE to encode your source files in UTF-8 and check that your characters are correctly displayed and stored. Or tell your writer to treat them as ISO-8859-1, either way should do.

opening xls file and saving it as tsv file using java and UTF-16LE to UTF-8 conversion

I've two questions:
Is there a way through which we can open a xls file and save it as a tsv file through Java?
EDIT:
Or is there a way through which we can convert a xls file into an tsv file through Java?
Is there a way in which we can convert a UTF-16LE file to UTF-8 using java ?
Thank you

I've two questions:
On StackOverflow you should split that into two different questions...
I'll answer your second question:
Is there a way in which we can convert a UTF-16LE file to UTF-8 using
java?
Yes of course. And there's more than one way.
Basically you want to read your input file specifying the input encoding (UTF-16LE) and then write the file specifying the output encoding (UTF-8).
Say you have some UTF-16LE encoded file:
... $ file testInput.txt
testInput.txt: Little-endian UTF-16 Unicode character data
You then basically could do something like this in Java (it's just an example: you'll want to fill in missing exception handling code, maybe not put a last newline at the end, maybe discard the BOM if any, etc.):
FileInputStream fis = new FileInputStream(new File("/home/.../testInput.txt") );
InputStreamReader isr = new InputStreamReader( fis, Charset.forName("UTF-16LE") );
BufferedReader br = new BufferedReader( isr );
FileOutputStream fos = new FileOutputStream(new File("/home/.../testOutput.txt"));
OutputStreamWriter osw = new OutputStreamWriter( fos, Charset.forName("UTF-8") );
BufferedWriter bw = new BufferedWriter( osw );
String line = null;
while ( (line = br.readLine()) != null ) {
bw.write(line);
bw.newLine(); // will add an unnecessary newline at the end of your file, fix this
}
bw.flush();
// take care of closing the streams here etc.
This shall create a UTF-8 encoded file.
$ file testOutput.txt
testOutput.txt: UTF-8 Unicode (with BOM) text
The BOM can clearly be seen using, for example, hexdump:
$ hexdump testOutput.txt -C
00000000 ef bb bf ... (snip)
The BOM is encoded on three bytes in UTF-8 (ef bb fb) while it's encoded on two bytes in UTF-16. In UTF16-LE the BOM looks like this:
$ hexdump testInput.txt -C
00000000 ff fe ... (snip)
Note that UTF-8 encoded files may or may not (both are totally valid) have a "BOM" (byte order mask). A BOM in a UTF-8 file is not that silly: you don't care about the byte order but it can help quickly identify a text file as being UTF-8 encoded. UTF-8 files with a BOM are fully legit according to the Unicode specs and hence readers unable to deal with UTF-8 files starting with a BOM are broken. Plain and simple.
If for whatever reason you're working with broken UTF-8 readers unable to cope with BOMs, then you may want to remove the BOM from the first String before writing it to disk.
More infos on BOMs here:
http://unicode.org/faq/utf_bom.html

There is a library called jexcelapi that allows you to open/edit/save .xls files.
Once you have read the .xls file it would not be hard to write something that would output it as .tsv.

Java character conversion to UTF-8

I am using:
InputStreamReader isr = new InputStreamReader(fis, "UTF8");
to read in characters from a text file and converting them to UTF8 characters.
My question is, what if one of the characters being read cannot be converted to utf8, what happens? Will there be an exception? or will get the character get dropped off?

You are not converting from one charset to another. You are just indicating that the file is UTF 8 encoded so that you can read it correctly.
If you want to convert from 1 encoding to the other then you should do something like below
File infile = new File("x-utf8.txt");
File outfile = new File("x-utf16.txt");
String fromEncoding="UTF-8";
String toEncoding="UTF-16";
Reader in = new InputStreamReader(new FileInputStream(infile), fromEncoding);
Writer out = new OutputStreamWriter(new FileOutputStream(outfile), toEncoding);
After going through the David Gelhar's response, I feel this code can be improved a bit. If you doesn't know the encoding of the "inFile" then use the GuessEncoding library to detect the encoding and then construct the reader in the encoding detected.

If the input file contains bytes that are not valid utf-8, read() will by default replace the invalid characters with a value of U+FFFD (65533 decimal; the Unicode "replacement character").
If you need more control over this behavior, you can use:
InputStreamReader(InputStream in, CharsetDecoder dec)
and supply a CharsetDecoder configured to your liking.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Android read file encoding issue - java

Does the file have a byte-order mark? In that case look at Reading UTF-8 - BOM marker EDIT (from comment): That looks like little-endian UTF-16 to me. Try the charset "UTF-16LE".

The file you show in the hex editor is not UTF-8 encoded, it looks more like UTF-16. This means you must specify UTF-16 as the encoding in your code (probably the UTF-16LE variant). If it were UTF-8 encoded, then it would represent all characters representable in ASCII using just a single byte.

Related

How to read file with saving encoding?

How to create a utf-8 encoded file in java such that it shows as UTF-8 encoded when opened in notepad++/notepad or any other text editor

Character Encoding in Java

opening xls file and saving it as tsv file using java and UTF-16LE to UTF-8 conversion

Java character conversion to UTF-8

Categories

Resources