Reading file with charset encoding

Reading file with charset encoding - java

I am trying to write Arabic word in windows Notepad by buffered output stream in java and after writing the charset encoding for notepad become UTF-8 so it is obvious the default charset for writing file in java is UTF-8 but the wonder when I read it by buffered input stream , it is not read by UTF-8 encoding because when reading it the result is strange symbols
enter code here
class writeFile extends BufferedOutputStream {
public writeFile(OutpuStream out){
super(out);
}
public static void main(String arg[])
{ writeFile out=new writeFile(new FileOutputStream(new
File("path_String")));
out.write("مكتبة".getByte());
}}
it is ok written as it is but when read :
enter code here
class readFile extends BufferedInputStream {
public readFile(InputStream In){
super(In);
}
public static void main(String arg[])
{ readFile in=new readFile(new FileInputStream(new
File("path_String")));
int c;
while((c=in.read()!=-1)
System.out.print((char)c);
}}
the result is not as in file as written before : ÙÙØªØ¨Ø©
so is this mean in writing java uses UTF-8 encoding and when in reading uses another encoding ?

The issue is not that it it not reading with UTF-8, it's that you are trashing the encoding in your read operation. FileInputStream.read() is very clearly stated to read one byte at a time. Bytes converted to characters are not going to work if you have multi-byte sequences in your file (which you almost certainly do since it is in Arabic).
As you figured out, the easiest solution is to use InputStreamReader, which reads the bytes from an underlying FileInputStream (or other stream), and correctly decodes the character sequences. The default encoding here is of course the same as for the writer:
An InputStreamReader is a bridge from byte streams to character streams: It reads bytes and decodes them into characters using a specified charset. The charset that it uses may be specified by name or may be given explicitly, or the platform's default charset may be accepted.
You can do a similar thing by reading the entire file into a byte buffer and then decoding the entire thing using something like String(byte[]). The results should be identical if you read the entire file because now the decoder will have enough information to correctly parse out all the multi-byte characters.
There is a reference on encoding and decoding that I found very useful in understanding the subject: http://kunststube.net/encoding/

Related

Java changes special characters when using FileReader

I have a problem with Java because I have a file with ASCII encoding and when I pass that value to the output file it changes special characters that I need to keep:
Original file:
Output file:
The code I use to read an ASCII file and pass it to a string that has a length of 7000 and the problem with that file where it reaches the special characters that within the frame or string that is the position 486 to 498 the FileRender does not bring the special characters correctly changes them for others and does not keep them (as I understand it is a binary):
fr = new FileReader(sourceFile);
//BufferedReader br = new BufferedReader(fr);
BufferedReader br = new BufferedReader(
new InputStreamReader(new FileInputStream(sourceFile), "UTF-8"));
String asciiString;
asciiString = br.readLine();
Edit:
I am doing a conversion from ASCII to EBCDIC. I am using CharFormatConverter.java
I really don't understand why the special characters are lost and not maintained. I found the UTF-8 code in another forum, but characters are still lost. Read file utf-8
Edit:
I was thinking about using FileReader for the ASCII data and FileInputStream to get the binary (but I can't figure out how to get it out with respect to the positions) that is in the ASCII file and thus have the two formats separated and then merge them after the conversion.
Regards.

If your info in the file is a binary info and not textual you can not read it as a String and no charset will help you. As charset is a schema that tells you how to interpret particular character into numeric code and vise-versa. If your info is not textual charset won't help you. You will need to read your info as binary - a sequence of bytes - and write them the same way. you will need to use InputStream implementation that reads info as binary. In your case a good candidate might be FileInputStream. But some other options may be used

Since your base code (CharFormatConverter) is byte-oriented, and it looks like your input files are binary, you should replace Readers by InputStreams, which produce bytes (not characters).
This is the ordinary way to read and process an InputStream:
private void convertFileToEbcdic(File sourceFile)
throws IOException
{
try (InputStream input=new FileInputStream(sourceFile))
{
byte[] buffer=new byte[4096];
int len;
do {
len=input.read(buffer);
if (len>0)
{
byte[] ebcdic=convertBufferFromAsciiToEbcdic(buffer, len);
// Now ebcdic contains the buffer converted to EBCDIC. You may use it.
}
} while (len>=0);
}
}
private byte[] convertBufferFromAsciiToEbcdic(byte[] ascii, int length)
{
// Create an array of same input as received
// and fill it with the input data converted to EBCDIC
}

Java IO with UTF characters

I have a weird problem with files.
I intend to modify the timing of an .srt file, but writing the new file seems to be a weird task.
Here's a sample code I wrote:
import java.io.*;
import java.nio.charset.Charset;
public class ReaderWriter {
public static void main(String[] args) throws IOException {
InputStream inputStream = new FileInputStream("D:\\E\\Movies\\English\\1960's\\TheApartment1960.srt");
Reader reader = new InputStreamReader(inputStream,
Charset.forName("UTF-8"));
OutputStream outputStream = new FileOutputStream("output.srt");
Writer writer = new OutputStreamWriter(outputStream,
Charset.forName("UTF-8"));
int data = reader.read();
while (data != -1) {
char theChar = (char) data;
writer.write(theChar);
data = reader.read();
}
reader.close();
writer.close();
}
}
This is an image from the original file:
However, the resulted file seems like:
I searched a lot for a solution but in vain. Any help, please.

First a few points:
There is nothing wrong with your Java code. If I use it to read an input file containing Arabic text encoded in UTF-8 it creates the output file encoded in UTF-8 with no problems.
I don't think there is a font issue. Since you can successfully display the content of the input file there is no reason you cannot also successfully display the content of a valid output file.
Those black diamonds with question marks in the output file are replacement characters which are "used to replace an incoming character whose value is unknown or unrepresentable in Unicode". This indicates that the input file you are reading is not UTF-8 encoded, even though the code explicitly states that it is. I can reproduce similar results to yours if the input file is UTF-16 encoded, but specified as UTF-8 in the code.
Alternatively, if the input file truly is UTF-8 encoded, specify it as UTF-16 in the code. For example, here is a valid UTF-8 input file with some Arabic text where the code (incorrectly) stated Reader reader = new InputStreamReader(inputStream, Charset.forName("UTF-16"));:
يونكود في النظم القائمة وفيما يخص التطبيقات الحاسوبية، الخطوط، تصميم النصوص والحوسبة متعددة اللغات.
And here is the output file, containing the replacement characters because the input stream of the UTF-8 file was incorrectly processed as UTF-16:
���⃙臙訠���ꟙ蓙苘Ꟙꛙ藘ꤠ���諘께딠�����ꟙ蓘귘Ꟙ동裘꣙諘꧘谠����꫘뗙藙諙蔠���⃙裘ꟙ蓘귙裘돘꣘ꤠ���⃘ꟙ蓙蓘뫘Ꟙꨮ�
Given all that, simply ensuring that the encoding of the input file is specified correctly in the InputStreamReader() constructor should solve your problem. To verify this, just create another input file and save it with UTF-8 character encoding, then run your code. If it works then you know that the problem was the that the encoding of input file was not UTF-8.

When do I need to specify the encoding while writing the file to the disk?

I have a sample method which copies one file to another using InputStream and OutputStream. In this case, the source file is encoded in 'UTF-8'. Even if I don't specify the encoding while writing to the disk, the destination file has the correct encoding. But, if I have to write a java.lang.String to a file, I need to specify the encoding. Why is that ?
public static void copyFile() {
String sourceFilePath = "C://my_encoded.txt";
InputStream inStream = null;
OutputStream outStream = null;
try{
String targetFilePath = "C://my_target.txt";
File sourcefile =new File(sourceFilePath);
outStream = new FileOutputStream(targetFilePath);
inStream = new FileInputStream(sourcefile);
byte[] buffer = new byte[1024];
int length;
//copy the file content in bytes
while ((length = inStream.read(buffer)) > 0){
outStream.write(buffer, 0, length);
}
inStream.close();
outStream.close();
System.out.println("File "+targetFilePath+" is copied successful!");
}catch(IOException e){
e.printStackTrace();
}
}
My guess is that since the source file has thee correct encoding and since we read and write one byte at a time, it works fine. And java.lang.String is 'UTF-16' by default and if we write it to the file, it reads one byte at a time instead of 2 bytes and hence garbage values. Is that correct or am I completely wrong in my understanding ?

You are copying the file byte per byte, so you don't need to care about character encoding.
As a rule of thumb:
Use the various InputStream and OutputStream implementations for byte-wise processing (like file copy).
There are some convenience methods to handle text directly like PrintStream.println(). Be careful because most of them use the default platform specific encoding.
Use the various Reader and Writer implemenations for reading and writing text.
If you need to convert between byte-wise and text processing use InputStreamReader and OutputStreamWriter with explicit file encoding.
Do not rely on the default encoding. The default character encoding is platform specific (e.g. Windows-ANSI aka Cp1252 for Windows, usually UTF-8 on Linux).
Example: If you need to read a UTF-8 text file:
BufferedReader reader =
new BufferedReader(new InputStreamReader(new FileInputStream(inFile), "UTF-8"));
Avoid using a FileReader because a FileReader uses always the default encoding.
A special case: If you need random access to a file you should use RandomAccessFile. With it you can read and write data blocks at arbitrary positions. You can read and write raw byte blocks or you can use convenience methods to read and write text. But you should read the documentation carefully. E.g. the methods readUTF() and writeUTF() use a modified UTF-8 encoding.
InputStream, OutputStream, Reader, Writer and RandomAccessFile form the basic IO functionality, enough for most use cases. For advanced IO (e.g. memory mapped files, ...) have a look at package java.nio.

Just read your code! (For the copy part at least ;-) )
When you copy the two files, you copy it byte by byte. There is no conversion to String, thus.
When you write a String into a file, you need to convert it (indirectly sometimes) in an array of byte (byte[]). There you need to specify your encoding.
When you read a file to get a String, you need to know its encoding in order to do it properly. Java doesn't 'skip' any byte but you need to make a conversion once again : from a byte[] to a String.

jar java program encoded

I have made a small java program in netbeans that's read a text file. When I run the program in my netbeans, everything goes fine. So I made an executable jar of my program, but when I run that jar I get wired characters when the program read the text file.
For example:
I get "CÃ©leste" but it has to be Céleste.
That's my code to read the file:
private void readFWFile(File file){
try {
FileReader fr = new FileReader(file);
BufferedReader br = new BufferedReader(fr);
String ligne;
while((ligne = br.readLine()) != null) {
System.out.println(ligne);
}
fr.close();
} catch (IOException ex) {
Logger.getLogger(FWFileReader.class.getName()).log(Level.SEVERE, null, ex);
}
}

The FileReader class uses the "platform default character encoding" to decode bytes in the file into characters. It seems that your file is encoded in UTF-8, while the default encoding is something else on your system.
You can read the file in a specific encoding using InputStreamReader:
Reader fr = new InputStreamReader(new FileInputStream(file), "UTF-8");

This kind of output is caused by a mismatch somewhere - your file is encoded in UTF-8 but the console where you print the data expects a single-byte encoding such as Windows-1252.
You need to (a) ensure you read the file as UTF-8 and (b) ensure you write to the console using the encoding it expects.
FileReader always uses the platform default encoding when reading files. If this is UTF-8 then
your Java code reads the file as UTF-8 and sees Céleste
you then print out that data as UTF-8
in NetBeans the console clearly expects UTF-8 and displays the data correctly
outside NetBeans the console expects a single-byte encoding and displays the incorrect rendering.
Or if your default encoding is a single byte one then
your Java code reads the file as a single byte encoding and sees CÃ©leste
you then print out that data as the same encoding
NetBeans treats the bytes you wrote as UTF-8 and displays Céleste
outside NetBeans you see the wrong data you originally read.
Use an InputStreamReader with a FileInputStream to ensure you read the data in the correct encoding, and make sure that when you print data to the console you do so using the encoding that the console expects.

Unable to write multibyte character in Excel with UTF8/UTF16 Encoding

I have been trying to write simplified chinese characters into the excel file using
OutputStreamWriter(OutputStream out, String charsetName).write(String str,int off,int len);
OutputStreamWriter osw = new OutputStreamWriter(new FileOutputStream(file), "UTF-16");
osw.write((vt.get(index)).toString());
But unfortunately this is not working. It shows junk characters instead. Does anyone has any idea on this.
Is this a problem with excel or I can rectify this within my code.

My version of Excel is having trouble with Chinese so I decided to pick on the Russians instead. Cyrillic is far enough into Unicode that if you can get this to work you should be able to get Chinese to work.
Your code is close but there are two things wrong:
UTF-16 can be either big-endian or little endian. The Java charset name "UTF-16" really means UTF-16 with big endian encoding. Microsoft always uses little-endian as their default. You need to use charset "UTF-16LE"
You need to warn Excel that you are using this encoding by putting a byte order mark (BOM) at the beginning of the file. It's just two bytes 0xFF followed by 0xFE.
Here is a simple program that prints "War and Peace" in Russian with each word in a separate column. The resulting file can be imported into Excel. Just replace the Russian text with your Chinese text.
import java.io.FileOutputStream;
import java.io.OutputStreamWriter;
public class Russian
{
public static void main(String [] args) throws Exception
{
byte [] bom = { (byte) 0xFF, (byte) 0xFE};
String text = "ВОЙНА,И,МИР";
FileOutputStream fout = new FileOutputStream("WarAndPeace.csv");
fout.write(bom);
OutputStreamWriter out = new OutputStreamWriter(fout, "UTF-16LE");
out.write(text);
out.close();;
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.