Unable to write multibyte character in Excel with UTF8/UTF16 Encoding

Unable to write multibyte character in Excel with UTF8/UTF16 Encoding - java

I have been trying to write simplified chinese characters into the excel file using
OutputStreamWriter(OutputStream out, String charsetName).write(String str,int off,int len);
OutputStreamWriter osw = new OutputStreamWriter(new FileOutputStream(file), "UTF-16");
osw.write((vt.get(index)).toString());
But unfortunately this is not working. It shows junk characters instead. Does anyone has any idea on this.
Is this a problem with excel or I can rectify this within my code.

My version of Excel is having trouble with Chinese so I decided to pick on the Russians instead. Cyrillic is far enough into Unicode that if you can get this to work you should be able to get Chinese to work.
Your code is close but there are two things wrong:
UTF-16 can be either big-endian or little endian. The Java charset name "UTF-16" really means UTF-16 with big endian encoding. Microsoft always uses little-endian as their default. You need to use charset "UTF-16LE"
You need to warn Excel that you are using this encoding by putting a byte order mark (BOM) at the beginning of the file. It's just two bytes 0xFF followed by 0xFE.
Here is a simple program that prints "War and Peace" in Russian with each word in a separate column. The resulting file can be imported into Excel. Just replace the Russian text with your Chinese text.
import java.io.FileOutputStream;
import java.io.OutputStreamWriter;
public class Russian
{
public static void main(String [] args) throws Exception
{
byte [] bom = { (byte) 0xFF, (byte) 0xFE};
String text = "ВОЙНА,И,МИР";
FileOutputStream fout = new FileOutputStream("WarAndPeace.csv");
fout.write(bom);
OutputStreamWriter out = new OutputStreamWriter(fout, "UTF-16LE");
out.write(text);
out.close();;
}
}

Related

Java changes special characters when using FileReader

I have a problem with Java because I have a file with ASCII encoding and when I pass that value to the output file it changes special characters that I need to keep:
Original file:
Output file:
The code I use to read an ASCII file and pass it to a string that has a length of 7000 and the problem with that file where it reaches the special characters that within the frame or string that is the position 486 to 498 the FileRender does not bring the special characters correctly changes them for others and does not keep them (as I understand it is a binary):
fr = new FileReader(sourceFile);
//BufferedReader br = new BufferedReader(fr);
BufferedReader br = new BufferedReader(
new InputStreamReader(new FileInputStream(sourceFile), "UTF-8"));
String asciiString;
asciiString = br.readLine();
Edit:
I am doing a conversion from ASCII to EBCDIC. I am using CharFormatConverter.java
I really don't understand why the special characters are lost and not maintained. I found the UTF-8 code in another forum, but characters are still lost. Read file utf-8
Edit:
I was thinking about using FileReader for the ASCII data and FileInputStream to get the binary (but I can't figure out how to get it out with respect to the positions) that is in the ASCII file and thus have the two formats separated and then merge them after the conversion.
Regards.

If your info in the file is a binary info and not textual you can not read it as a String and no charset will help you. As charset is a schema that tells you how to interpret particular character into numeric code and vise-versa. If your info is not textual charset won't help you. You will need to read your info as binary - a sequence of bytes - and write them the same way. you will need to use InputStream implementation that reads info as binary. In your case a good candidate might be FileInputStream. But some other options may be used

Since your base code (CharFormatConverter) is byte-oriented, and it looks like your input files are binary, you should replace Readers by InputStreams, which produce bytes (not characters).
This is the ordinary way to read and process an InputStream:
private void convertFileToEbcdic(File sourceFile)
throws IOException
{
try (InputStream input=new FileInputStream(sourceFile))
{
byte[] buffer=new byte[4096];
int len;
do {
len=input.read(buffer);
if (len>0)
{
byte[] ebcdic=convertBufferFromAsciiToEbcdic(buffer, len);
// Now ebcdic contains the buffer converted to EBCDIC. You may use it.
}
} while (len>=0);
}
}
private byte[] convertBufferFromAsciiToEbcdic(byte[] ascii, int length)
{
// Create an array of same input as received
// and fill it with the input data converted to EBCDIC
}

Java IO with UTF characters

I have a weird problem with files.
I intend to modify the timing of an .srt file, but writing the new file seems to be a weird task.
Here's a sample code I wrote:
import java.io.*;
import java.nio.charset.Charset;
public class ReaderWriter {
public static void main(String[] args) throws IOException {
InputStream inputStream = new FileInputStream("D:\\E\\Movies\\English\\1960's\\TheApartment1960.srt");
Reader reader = new InputStreamReader(inputStream,
Charset.forName("UTF-8"));
OutputStream outputStream = new FileOutputStream("output.srt");
Writer writer = new OutputStreamWriter(outputStream,
Charset.forName("UTF-8"));
int data = reader.read();
while (data != -1) {
char theChar = (char) data;
writer.write(theChar);
data = reader.read();
}
reader.close();
writer.close();
}
}
This is an image from the original file:
However, the resulted file seems like:
I searched a lot for a solution but in vain. Any help, please.

First a few points:
There is nothing wrong with your Java code. If I use it to read an input file containing Arabic text encoded in UTF-8 it creates the output file encoded in UTF-8 with no problems.
I don't think there is a font issue. Since you can successfully display the content of the input file there is no reason you cannot also successfully display the content of a valid output file.
Those black diamonds with question marks in the output file are replacement characters which are "used to replace an incoming character whose value is unknown or unrepresentable in Unicode". This indicates that the input file you are reading is not UTF-8 encoded, even though the code explicitly states that it is. I can reproduce similar results to yours if the input file is UTF-16 encoded, but specified as UTF-8 in the code.
Alternatively, if the input file truly is UTF-8 encoded, specify it as UTF-16 in the code. For example, here is a valid UTF-8 input file with some Arabic text where the code (incorrectly) stated Reader reader = new InputStreamReader(inputStream, Charset.forName("UTF-16"));:
يونكود في النظم القائمة وفيما يخص التطبيقات الحاسوبية، الخطوط، تصميم النصوص والحوسبة متعددة اللغات.
And here is the output file, containing the replacement characters because the input stream of the UTF-8 file was incorrectly processed as UTF-16:
���⃙臙訠���ꟙ蓙苘Ꟙꛙ藘ꤠ���諘께딠�����ꟙ蓘귘Ꟙ동裘꣙諘꧘谠����꫘뗙藙諙蔠���⃙裘ꟙ蓘귙裘돘꣘ꤠ���⃘ꟙ蓙蓘뫘Ꟙꨮ�
Given all that, simply ensuring that the encoding of the input file is specified correctly in the InputStreamReader() constructor should solve your problem. To verify this, just create another input file and save it with UTF-8 character encoding, then run your code. If it works then you know that the problem was the that the encoding of input file was not UTF-8.

How to make a frequency table from file content using fileInputStream

My assignment is to create a program that does compression using the Huffman algorithm. My program must be able to compress any type of file. Hence why i'm not using the Reader that works with characters.
Im not understanding how to be able to make some kind of frequency table when encoding a binary file?
EDIT!! Problem solved.
public static void main(String args[]){
try{
FileInputStream in = new FileInputStream("./src/hello.jpg");
int currentByte;
while((currentByte = in.read())!=-1){ //in.read()
//read all byte streams in file and create a frequency
//table
}
}catch (IOException e){
e.printStackTrace();
}
}

I'm not sure what you mean by "reading from an image and look at the characters" but talking about text files (as you're reading one in in your code example) this is most of the time working by casting the read byte to char by doing a
char charVal = (char) currentByte;
It's mostly working because most data is ASCII and most charsets contain ASCII. It gets more complicated with non-ASCII characters because a simple cast is equivalent with using charset ISO-8859-1. This will still most of the time produce correct results, because e.g. Window's cp1252 (on german systems) only differ with ISO-8859-1 at the Euro-sign.
Things start to run havoc with charsets like UTF-8 where non-ASCII characters are encoded with multiple bytes, so you will see things like Ã¤ instead of an ä. Same for files being encoded with Unicode where every second byte is most likely a binary zero.

You could use Files.readAllBytes and then iterate over this array.
Path path = Paths.get("hello.txt");
try {
byte[] array = Files.readAllBytes(path);
} catch (IOException ) {
}

Reading file with charset encoding

I am trying to write Arabic word in windows Notepad by buffered output stream in java and after writing the charset encoding for notepad become UTF-8 so it is obvious the default charset for writing file in java is UTF-8 but the wonder when I read it by buffered input stream , it is not read by UTF-8 encoding because when reading it the result is strange symbols
enter code here
class writeFile extends BufferedOutputStream {
public writeFile(OutpuStream out){
super(out);
}
public static void main(String arg[])
{ writeFile out=new writeFile(new FileOutputStream(new
File("path_String")));
out.write("مكتبة".getByte());
}}
it is ok written as it is but when read :
enter code here
class readFile extends BufferedInputStream {
public readFile(InputStream In){
super(In);
}
public static void main(String arg[])
{ readFile in=new readFile(new FileInputStream(new
File("path_String")));
int c;
while((c=in.read()!=-1)
System.out.print((char)c);
}}
the result is not as in file as written before : ÙÙØªØ¨Ø©
so is this mean in writing java uses UTF-8 encoding and when in reading uses another encoding ?

The issue is not that it it not reading with UTF-8, it's that you are trashing the encoding in your read operation. FileInputStream.read() is very clearly stated to read one byte at a time. Bytes converted to characters are not going to work if you have multi-byte sequences in your file (which you almost certainly do since it is in Arabic).
As you figured out, the easiest solution is to use InputStreamReader, which reads the bytes from an underlying FileInputStream (or other stream), and correctly decodes the character sequences. The default encoding here is of course the same as for the writer:
An InputStreamReader is a bridge from byte streams to character streams: It reads bytes and decodes them into characters using a specified charset. The charset that it uses may be specified by name or may be given explicitly, or the platform's default charset may be accepted.
You can do a similar thing by reading the entire file into a byte buffer and then decoding the entire thing using something like String(byte[]). The results should be identical if you read the entire file because now the decoder will have enough information to correctly parse out all the multi-byte characters.
There is a reference on encoding and decoding that I found very useful in understanding the subject: http://kunststube.net/encoding/

Android read file encoding issue

I'm trying to read a file from the SD card and I've been told it's in unicode format. However, when I try to read the file I get the following:
This is the code I'm using to read the file:
InputStreamReader fw = new InputStreamReader(new FileInputStream(root.getAbsolutePath()+"/Drive/sdk/cmd.62.out"), "UTF-8");
char[] buf = new char[255];
fw.read(buf);
String readString = new String(buf);
Log.d("courierread",readString);
fw.close();
If I write that output to a file this is what I get when I open it in a hex editor:
Any thoughts on what I need to do to read the file correctly?

Does the file have a byte-order mark? In that case look at Reading UTF-8 - BOM marker
EDIT (from comment): That looks like little-endian UTF-16 to me. Try the charset "UTF-16LE".

The file you show in the hex editor is not UTF-8 encoded, it looks more like UTF-16. This means you must specify UTF-16 as the encoding in your code (probably the UTF-16LE variant).
If it were UTF-8 encoded, then it would represent all characters representable in ASCII using just a single byte.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.