Java IO with UTF characters

Java IO with UTF characters - java

I have a weird problem with files.
I intend to modify the timing of an .srt file, but writing the new file seems to be a weird task.
Here's a sample code I wrote:
import java.io.*;
import java.nio.charset.Charset;
public class ReaderWriter {
public static void main(String[] args) throws IOException {
InputStream inputStream = new FileInputStream("D:\\E\\Movies\\English\\1960's\\TheApartment1960.srt");
Reader reader = new InputStreamReader(inputStream,
Charset.forName("UTF-8"));
OutputStream outputStream = new FileOutputStream("output.srt");
Writer writer = new OutputStreamWriter(outputStream,
Charset.forName("UTF-8"));
int data = reader.read();
while (data != -1) {
char theChar = (char) data;
writer.write(theChar);
data = reader.read();
}
reader.close();
writer.close();
}
}
This is an image from the original file:
However, the resulted file seems like:
I searched a lot for a solution but in vain. Any help, please.

First a few points:
There is nothing wrong with your Java code. If I use it to read an input file containing Arabic text encoded in UTF-8 it creates the output file encoded in UTF-8 with no problems.
I don't think there is a font issue. Since you can successfully display the content of the input file there is no reason you cannot also successfully display the content of a valid output file.
Those black diamonds with question marks in the output file are replacement characters which are "used to replace an incoming character whose value is unknown or unrepresentable in Unicode". This indicates that the input file you are reading is not UTF-8 encoded, even though the code explicitly states that it is. I can reproduce similar results to yours if the input file is UTF-16 encoded, but specified as UTF-8 in the code.
Alternatively, if the input file truly is UTF-8 encoded, specify it as UTF-16 in the code. For example, here is a valid UTF-8 input file with some Arabic text where the code (incorrectly) stated Reader reader = new InputStreamReader(inputStream, Charset.forName("UTF-16"));:
يونكود في النظم القائمة وفيما يخص التطبيقات الحاسوبية، الخطوط، تصميم النصوص والحوسبة متعددة اللغات.
And here is the output file, containing the replacement characters because the input stream of the UTF-8 file was incorrectly processed as UTF-16:
���⃙臙訠���ꟙ蓙苘Ꟙꛙ藘ꤠ���諘께딠�����ꟙ蓘귘Ꟙ동裘꣙諘꧘谠����꫘뗙藙諙蔠���⃙裘ꟙ蓘귙裘돘꣘ꤠ���⃘ꟙ蓙蓘뫘Ꟙꨮ�
Given all that, simply ensuring that the encoding of the input file is specified correctly in the InputStreamReader() constructor should solve your problem. To verify this, just create another input file and save it with UTF-8 character encoding, then run your code. If it works then you know that the problem was the that the encoding of input file was not UTF-8.

Related

Java changes special characters when using FileReader

I have a problem with Java because I have a file with ASCII encoding and when I pass that value to the output file it changes special characters that I need to keep:
Original file:
Output file:
The code I use to read an ASCII file and pass it to a string that has a length of 7000 and the problem with that file where it reaches the special characters that within the frame or string that is the position 486 to 498 the FileRender does not bring the special characters correctly changes them for others and does not keep them (as I understand it is a binary):
fr = new FileReader(sourceFile);
//BufferedReader br = new BufferedReader(fr);
BufferedReader br = new BufferedReader(
new InputStreamReader(new FileInputStream(sourceFile), "UTF-8"));
String asciiString;
asciiString = br.readLine();
Edit:
I am doing a conversion from ASCII to EBCDIC. I am using CharFormatConverter.java
I really don't understand why the special characters are lost and not maintained. I found the UTF-8 code in another forum, but characters are still lost. Read file utf-8
Edit:
I was thinking about using FileReader for the ASCII data and FileInputStream to get the binary (but I can't figure out how to get it out with respect to the positions) that is in the ASCII file and thus have the two formats separated and then merge them after the conversion.
Regards.

If your info in the file is a binary info and not textual you can not read it as a String and no charset will help you. As charset is a schema that tells you how to interpret particular character into numeric code and vise-versa. If your info is not textual charset won't help you. You will need to read your info as binary - a sequence of bytes - and write them the same way. you will need to use InputStream implementation that reads info as binary. In your case a good candidate might be FileInputStream. But some other options may be used

Since your base code (CharFormatConverter) is byte-oriented, and it looks like your input files are binary, you should replace Readers by InputStreams, which produce bytes (not characters).
This is the ordinary way to read and process an InputStream:
private void convertFileToEbcdic(File sourceFile)
throws IOException
{
try (InputStream input=new FileInputStream(sourceFile))
{
byte[] buffer=new byte[4096];
int len;
do {
len=input.read(buffer);
if (len>0)
{
byte[] ebcdic=convertBufferFromAsciiToEbcdic(buffer, len);
// Now ebcdic contains the buffer converted to EBCDIC. You may use it.
}
} while (len>=0);
}
}
private byte[] convertBufferFromAsciiToEbcdic(byte[] ascii, int length)
{
// Create an array of same input as received
// and fill it with the input data converted to EBCDIC
}

special characters in utf-8 text file

I've an input file which comes under ANSI UNIX file format. I convert that file into UTF-8.
Before converting to UTF-8, there is an special character like this in input file
»
After converting to UTF-8, it becomes like this
Ã»
When I process my file as it is, without converting to utf-8, all special characters disappeared and data loss as well.
But when I process my file after converting to UTF-8, All data appears with special character same as am getting after converting to UTF-8 in output file.
ANSI to UTF-8 (could be wrong, please correct me if am wrong somewhere)
FileInputStream = fis = new FileInputStream("inputtextfile.txt");
InputStreamReader isr = new InputStreamReader (fis, "ISO-8859-1");
Reader in = new BufferReader(isr);
FileOutputStream fos = new FileOutputStream("outputfile.txt");
OutPutStreamWriter osw = OutPutStreamWriter("fos", "UTF-8");
Writer out = new BufferedWriter(osw);
int ch;
out.write("\uFEFF";);
while ((ch = in.read()) > -1 ) {
out.write(ch);
}
out.close();
in.close();
After this am processing my file further for final output.
I'm using Talend ETL tool for creating an final output out of generated utf-8. (Java based ETL tool)
What I want is, I want to process my file so that I could get same special characters in output as am getting in input file.
I'm using java 1.8 for this whole processing. I'
'm too stuck in this situation and never dealt this with special characters.
Any suggestion would be helpful.

Cannot find ZERO WIDTH NO-BREAK SPACE when reading file

I've run into a problem when trying to parse a JSON string that I grab from a file. My problem is that the Zero width no-break space character (unicode 0xfeff) is at the beginning of my string when I read it in, and I cannot get rid of it. I don't want to use regex because of the chance there may be other hidden characters with different unicodes.
Here's what I have:
StringBuilder content = new StringBuilder();
try {
BufferedReader br = new BufferedReader(new FileReader("src/test/resources/getStuff.json"));
String currentLine;
while((currentLine = br.readLine()) != null) {
content.append(currentLine);
}
br.close();
} catch(Exception e) {
Assert.fail();
}
And this is the the start of the JSON file (it's too long to copy paste the whole thing, but I have confirmed it is valid):
{"result":{"data":{"request":{"year":null,"timestamp":1413398641246,...
Here's what I've tried so far:
Copying the JSON file to notepad++ and showing all characters
Copying file to notepad++ and converting to UFT-8 without BOM, and ISO 8859-1
Opened JSON file in other text editors such as sublime and saved as UFT-8
Copied the JSON file to a txt file and read that in
Tried using Scanner instead of BufferedReader
In intellij I tried view -> active editor -> show whitespaces
How can I read this file in without having the Zero width no-break space character at the beginning of the string?

0xEF 0xBB 0xBF is the UTF-8 BOM, 0xFE 0xFF is the UTF-16BE BOM, and 0xFF 0xFE is the UTF-16LE BOM. If 0xFEFF exists at the front of your String, it means you created a UTF encoded text file with a BOM. A UTF-16 BOM could appear as-is as 0xFEFF, whereas a UTF-8 BOM would only appear as 0xFEFF if the BOM itself were being decoded from UTF-8 to UTF-16 (meaning the reader detected the BOM but did not skip it). In fact, it is known that Java does not handle UTF-8 BOMs (see bugs JDK-4508058 and JDK-6378911).
If you read the FileReader documentation, it says:
The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.
You need to read the file content using a reader that recognizes charsets, preferably one that will read the BOM for you and adjust itself internally as needed. But worse case, you could just open the file yourself, read the first few bytes to detect if a BOM is present, and then construct a reader using an appropriate charset to read the rest of the file. Here is an example using org.apache.commons.io.input.BOMInputStream that does exactly that:
(from https://stackoverflow.com/a/13988345/65863)
String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(someFileWithPossibleUtf8Bom);
try {
BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
ByteOrderMark bom = bOMInputStream.getBOM();
String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
//use reader
} finally {
inputStream.close();
}

jar java program encoded

I have made a small java program in netbeans that's read a text file. When I run the program in my netbeans, everything goes fine. So I made an executable jar of my program, but when I run that jar I get wired characters when the program read the text file.
For example:
I get "CÃ©leste" but it has to be Céleste.
That's my code to read the file:
private void readFWFile(File file){
try {
FileReader fr = new FileReader(file);
BufferedReader br = new BufferedReader(fr);
String ligne;
while((ligne = br.readLine()) != null) {
System.out.println(ligne);
}
fr.close();
} catch (IOException ex) {
Logger.getLogger(FWFileReader.class.getName()).log(Level.SEVERE, null, ex);
}
}

The FileReader class uses the "platform default character encoding" to decode bytes in the file into characters. It seems that your file is encoded in UTF-8, while the default encoding is something else on your system.
You can read the file in a specific encoding using InputStreamReader:
Reader fr = new InputStreamReader(new FileInputStream(file), "UTF-8");

This kind of output is caused by a mismatch somewhere - your file is encoded in UTF-8 but the console where you print the data expects a single-byte encoding such as Windows-1252.
You need to (a) ensure you read the file as UTF-8 and (b) ensure you write to the console using the encoding it expects.
FileReader always uses the platform default encoding when reading files. If this is UTF-8 then
your Java code reads the file as UTF-8 and sees Céleste
you then print out that data as UTF-8
in NetBeans the console clearly expects UTF-8 and displays the data correctly
outside NetBeans the console expects a single-byte encoding and displays the incorrect rendering.
Or if your default encoding is a single byte one then
your Java code reads the file as a single byte encoding and sees CÃ©leste
you then print out that data as the same encoding
NetBeans treats the bytes you wrote as UTF-8 and displays Céleste
outside NetBeans you see the wrong data you originally read.
Use an InputStreamReader with a FileInputStream to ensure you read the data in the correct encoding, and make sure that when you print data to the console you do so using the encoding that the console expects.

Android read file encoding issue

I'm trying to read a file from the SD card and I've been told it's in unicode format. However, when I try to read the file I get the following:
This is the code I'm using to read the file:
InputStreamReader fw = new InputStreamReader(new FileInputStream(root.getAbsolutePath()+"/Drive/sdk/cmd.62.out"), "UTF-8");
char[] buf = new char[255];
fw.read(buf);
String readString = new String(buf);
Log.d("courierread",readString);
fw.close();
If I write that output to a file this is what I get when I open it in a hex editor:
Any thoughts on what I need to do to read the file correctly?

Does the file have a byte-order mark? In that case look at Reading UTF-8 - BOM marker
EDIT (from comment): That looks like little-endian UTF-16 to me. Try the charset "UTF-16LE".

The file you show in the hex editor is not UTF-8 encoded, it looks more like UTF-16. This means you must specify UTF-16 as the encoding in your code (probably the UTF-16LE variant).
If it were UTF-8 encoded, then it would represent all characters representable in ASCII using just a single byte.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.