I am working with a CSV file right now.
In my program i am using an OutputStreamWriter to write data the csv file.
OutputStreamWriter myOutWriter = new OutputStreamWriter(fOut, Charset.forName("UTF-8").newEncoder());
I tried printing out the encoding style of this writer and get the following:
Log.i(TAG, "BODY ENCODING: " + myOutWriter.getEncoding());
Logcat: BODY ENCODING: UTF-8
But when i try to open the csv file on my desktop it says that the file is in windows-1252 so i cant read æøå chars which i need.
Am i missing something obvious here or am i not understanding the concept of outputStreamWriter? I have tried different types of encoding, but it doesn't seem to work :)
When i try to open in Excel:
Your file is actually UTF-8 not CP-1252. Your text editor/viewer detected it as CP-1251 (since no multi-byte characters). You can help your editor by adding byte order mark (BOM) in the beginning of the file. I.e.
static final byte[] UTF8_BOM = {0xEF,0xBB,0xBF};
...
fOut.write(UTF8_BOM);
OutputStreamWriter myOutWriter = new OutputStreamWriter(fOut, Charset.forName("UTF-8").newEncoder());
Did you try opening it in EXCEL? For EXCEL to recognize the file as UTF-8 it needs to have BOM (https://en.wikipedia.org/wiki/Byte_order_mark)
Related
I have tried to create UTF-8 file using java using different readers.But after creating when I open the file it is not read as being UTF-8 encoded(I opened it in notepad++ and it was UTF-8 without BOM).
File fileDir = new File("c:\\temp\\test.txt");
Writer out1 = new BufferedWriter(
new OutputStreamWriter(
new FileOutputStream(fileDir),
Charset.forName("UTF-8").newEncoder())
);
Writer out = new OutputStreamWriter(
new FileOutputStream(fileDir),
Charset.forName("UTF-8")
);
out.append("Website UTF-8").append("\r\n");
out.append("?? UTF-8").append("\r\n");
out.append("??????? UTF-8").append("\r\n");
out.flush();
out.close();
You are correctly writing a file in the UTF-8 encoding. (Note that you're not using out1 and it's unnecessary).
Notepad++ tells you that the file is "UTF-8 without BOM". Why do you think this is not UTF-8?
BOM stands for byte order mark. It's a special Unicode character to indicate if the bytes in a file are in little-endian or big-endian order. But for UTF-8 it has no meaning and its use is not recommended. From the Wikipedia article:
The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF. A text editor or web browser interpreting the text as ISO-8859-1 or CP1252 will display the characters  for this.
The Unicode Standard permits the BOM in UTF-8, but does not require nor recommend its use. Byte order has no meaning in UTF-8, so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8. The BOM may also appear when UTF-8 data is converted from other encodings that use a BOM.
Is there a special reason why you need a BOM to be included? If not, then don't worry about it. Some Java XML parsers cannot deal with an UTF-8 BOM properly and will give an error when you try to parse an XML document encoded in UTF-8 when it starts with a BOM.
I am creating a CSV and writing content in UTF-8 to support German and English by specifying encoding as below
BufferedWriter outFile = new BufferedWriter( new OutputStreamWriter( outputStream, "UTF-8" ) );
The above is working fine till I add the below separator indication (;) in the header of CSV
outFile.write( "sep=;" );
outFile.newLine();
Without this delimiter ; my CSV will be wrong but when I inclde this the encoding is failing and UTf-8 not in place.
Is there any other keyword like "sep=" to specify in header of CSV to specify encoding?
I tried encoding="UTF-8" and it is not working.
Thanks.
You cannot open a UTF8 csv file with Excel 2007. Microsft have no understanding of the word "standards". Because of this, it is notoriously difficult to generate a csv file which opens in every possible application that reads .csv files and keeps the correct encoding.
If you must use Excel 2007, I would suggest using encoding with Microsofts own "windows 1252" as it supports German characters. Don't use the header, and also look in to using tab as a separator. Yes I know the c stands for comma, but tab seems to be more consistent with Excel 2007 if you save the file back again.
I've two questions:
Is there a way through which we can open a xls file and save it as a tsv file through Java?
EDIT:
Or is there a way through which we can convert a xls file into an tsv file through Java?
Is there a way in which we can convert a UTF-16LE file to UTF-8 using java ?
Thank you
I've two questions:
On StackOverflow you should split that into two different questions...
I'll answer your second question:
Is there a way in which we can convert a UTF-16LE file to UTF-8 using
java?
Yes of course. And there's more than one way.
Basically you want to read your input file specifying the input encoding (UTF-16LE) and then write the file specifying the output encoding (UTF-8).
Say you have some UTF-16LE encoded file:
... $ file testInput.txt
testInput.txt: Little-endian UTF-16 Unicode character data
You then basically could do something like this in Java (it's just an example: you'll want to fill in missing exception handling code, maybe not put a last newline at the end, maybe discard the BOM if any, etc.):
FileInputStream fis = new FileInputStream(new File("/home/.../testInput.txt") );
InputStreamReader isr = new InputStreamReader( fis, Charset.forName("UTF-16LE") );
BufferedReader br = new BufferedReader( isr );
FileOutputStream fos = new FileOutputStream(new File("/home/.../testOutput.txt"));
OutputStreamWriter osw = new OutputStreamWriter( fos, Charset.forName("UTF-8") );
BufferedWriter bw = new BufferedWriter( osw );
String line = null;
while ( (line = br.readLine()) != null ) {
bw.write(line);
bw.newLine(); // will add an unnecessary newline at the end of your file, fix this
}
bw.flush();
// take care of closing the streams here etc.
This shall create a UTF-8 encoded file.
$ file testOutput.txt
testOutput.txt: UTF-8 Unicode (with BOM) text
The BOM can clearly be seen using, for example, hexdump:
$ hexdump testOutput.txt -C
00000000 ef bb bf ... (snip)
The BOM is encoded on three bytes in UTF-8 (ef bb fb) while it's encoded on two bytes in UTF-16. In UTF16-LE the BOM looks like this:
$ hexdump testInput.txt -C
00000000 ff fe ... (snip)
Note that UTF-8 encoded files may or may not (both are totally valid) have a "BOM" (byte order mask). A BOM in a UTF-8 file is not that silly: you don't care about the byte order but it can help quickly identify a text file as being UTF-8 encoded. UTF-8 files with a BOM are fully legit according to the Unicode specs and hence readers unable to deal with UTF-8 files starting with a BOM are broken. Plain and simple.
If for whatever reason you're working with broken UTF-8 readers unable to cope with BOMs, then you may want to remove the BOM from the first String before writing it to disk.
More infos on BOMs here:
http://unicode.org/faq/utf_bom.html
There is a library called jexcelapi that allows you to open/edit/save .xls files.
Once you have read the .xls file it would not be hard to write something that would output it as .tsv.
I'm trying to read a file from the SD card and I've been told it's in unicode format. However, when I try to read the file I get the following:
This is the code I'm using to read the file:
InputStreamReader fw = new InputStreamReader(new FileInputStream(root.getAbsolutePath()+"/Drive/sdk/cmd.62.out"), "UTF-8");
char[] buf = new char[255];
fw.read(buf);
String readString = new String(buf);
Log.d("courierread",readString);
fw.close();
If I write that output to a file this is what I get when I open it in a hex editor:
Any thoughts on what I need to do to read the file correctly?
Does the file have a byte-order mark? In that case look at Reading UTF-8 - BOM marker
EDIT (from comment): That looks like little-endian UTF-16 to me. Try the charset "UTF-16LE".
The file you show in the hex editor is not UTF-8 encoded, it looks more like UTF-16. This means you must specify UTF-16 as the encoding in your code (probably the UTF-16LE variant).
If it were UTF-8 encoded, then it would represent all characters representable in ASCII using just a single byte.
I have some strings in Java (originally from an Excel sheet) that I presume are in Windows 1252 codepage. I want them converted to Javas own unicode format. The Excel file was parsed using the JXL package, in case that matter.
I will clarify: apparently the strings gotten from the Excel file look pretty much like it already is some kind of unicode.
WorkbookSettings ws = new WorkbookSettings();
ws.setCharacterSet(someInteger);
Workbook workbook = Workbook.getWorkbook(new File(filename), ws);
Sheet s = workbook.getSheet(sheet);
row = s.getRow(4);
String contents = row[0].getContents();
This is where contents seems to contain something unicode, the åäö are multibyte characters, while the ASCII ones are normal single byte characters. It is most definitely not Latin1. If I print the "contents" string with printLn and redirect it to a hello.txt file, I find that the letter "ö" is represented with two bytes, C3 B6 in hex. (195 and 179 in decimal.)
[edit]
I have tried the suggestions with different codepages etc given below, tried converting from Cp1252 etc. There was some kind of conversion, because I would get some other kind of gibberish instead. As reference I always printed an "ö" string hand coded into the source code, to verify that there was not something wrong with my terminal or typefaces or anything. The manually typed "ö" always worked.
[edit]
I also tried WorkBookSettings as suggested in the comments, but I looked in the code for JXL and characterSet seems to be ignored by parsing code. I think the parsing code just looks at whatever encoding the XLS file is supposed to be in.
WorkbookSettings ws = new WorkbookSettings();
ws.setEncoding("CP1250");
Worked for me.
If none of the answer above solve the problem, the trick might be done like this:
String myOutput = new String (myInput, "UTF-8");
This should decode the incoming string, whatever its format.
When Java parses a file it uses some encoding to read the bytes on the disk and create bytes in memory. The default encoding varies from platform to platform. Java's internal String representation is Unicode already, so if it parses the file with the right encoding then you are already done; just write out the data in any encoding you want.
If your strings appear corrupted when you look at them in Java, it is probably because you are using the wrong encoding to read the data. Excel is probably using UTF-16 (Little-Endian I think) but I'd expect a library like JXL should be able to detect it appropriately. I've looked at the Javadocs for JXL and it doesn't do anything with character encodings. I imagine it auto-detects any encodings as it needs to.
Do you just need to write the already loaded strings to a text file? If so, then something like the following will work:
String text = getCP1252Text(); // doesn't matter what the original encoding was, Java always uses Unicode
FileOutputStream fos = new FileOutputStream("test.txt"); // Open file
OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-16"); // Specify character encoding
PrintWriter pw = new PrintWriter(osw);
pw.print(text ); // repeat as needed
pw.close(); // cleanup
osw.close();
fos.close();
If your problem is something else please edit your question and provide more details.
You need to specify the correct encoding when the file is parsed - once you have a Java String based on the wrong encoding, it's too late.
JXL allows you to specify the encoding by passing a WorkbookSettings object to the factory method.
"windows-1252"/"Cp1252" is not required to be supported by JREs, but is by Sun's (and presumably most others). See the "Supported Encodings" in your JDK documentation. Then it's just a matter of using String, InputStreamReader or similar to decode the bytes into chars.
FileInputStream fis = new FileInputStream (yourFile);
BufferedReader reader = new BufferedReader(new InputStreamReader(fis,"CP1250"));
And do with reader whatever you'd do directly with file.
Your description indicates that the encoding is UTF-8 and indeed C3 B6 is the UTF-8 encoding for 'ö'.