Java character conversion to UTF-8

Java character conversion to UTF-8 - java

I am using:
InputStreamReader isr = new InputStreamReader(fis, "UTF8");
to read in characters from a text file and converting them to UTF8 characters.
My question is, what if one of the characters being read cannot be converted to utf8, what happens? Will there be an exception? or will get the character get dropped off?

You are not converting from one charset to another. You are just indicating that the file is UTF 8 encoded so that you can read it correctly.
If you want to convert from 1 encoding to the other then you should do something like below
File infile = new File("x-utf8.txt");
File outfile = new File("x-utf16.txt");
String fromEncoding="UTF-8";
String toEncoding="UTF-16";
Reader in = new InputStreamReader(new FileInputStream(infile), fromEncoding);
Writer out = new OutputStreamWriter(new FileOutputStream(outfile), toEncoding);
After going through the David Gelhar's response, I feel this code can be improved a bit. If you doesn't know the encoding of the "inFile" then use the GuessEncoding library to detect the encoding and then construct the reader in the encoding detected.

If the input file contains bytes that are not valid utf-8, read() will by default replace the invalid characters with a value of U+FFFD (65533 decimal; the Unicode "replacement character").
If you need more control over this behavior, you can use:
InputStreamReader(InputStream in, CharsetDecoder dec)
and supply a CharsetDecoder configured to your liking.

Related

Java changes special characters when using FileReader

I have a problem with Java because I have a file with ASCII encoding and when I pass that value to the output file it changes special characters that I need to keep:
Original file:
Output file:
The code I use to read an ASCII file and pass it to a string that has a length of 7000 and the problem with that file where it reaches the special characters that within the frame or string that is the position 486 to 498 the FileRender does not bring the special characters correctly changes them for others and does not keep them (as I understand it is a binary):
fr = new FileReader(sourceFile);
//BufferedReader br = new BufferedReader(fr);
BufferedReader br = new BufferedReader(
new InputStreamReader(new FileInputStream(sourceFile), "UTF-8"));
String asciiString;
asciiString = br.readLine();
Edit:
I am doing a conversion from ASCII to EBCDIC. I am using CharFormatConverter.java
I really don't understand why the special characters are lost and not maintained. I found the UTF-8 code in another forum, but characters are still lost. Read file utf-8
Edit:
I was thinking about using FileReader for the ASCII data and FileInputStream to get the binary (but I can't figure out how to get it out with respect to the positions) that is in the ASCII file and thus have the two formats separated and then merge them after the conversion.
Regards.

If your info in the file is a binary info and not textual you can not read it as a String and no charset will help you. As charset is a schema that tells you how to interpret particular character into numeric code and vise-versa. If your info is not textual charset won't help you. You will need to read your info as binary - a sequence of bytes - and write them the same way. you will need to use InputStream implementation that reads info as binary. In your case a good candidate might be FileInputStream. But some other options may be used

Since your base code (CharFormatConverter) is byte-oriented, and it looks like your input files are binary, you should replace Readers by InputStreams, which produce bytes (not characters).
This is the ordinary way to read and process an InputStream:
private void convertFileToEbcdic(File sourceFile)
throws IOException
{
try (InputStream input=new FileInputStream(sourceFile))
{
byte[] buffer=new byte[4096];
int len;
do {
len=input.read(buffer);
if (len>0)
{
byte[] ebcdic=convertBufferFromAsciiToEbcdic(buffer, len);
// Now ebcdic contains the buffer converted to EBCDIC. You may use it.
}
} while (len>=0);
}
}
private byte[] convertBufferFromAsciiToEbcdic(byte[] ascii, int length)
{
// Create an array of same input as received
// and fill it with the input data converted to EBCDIC
}

Japanese character change to junk after writing to file in java

Japanese character display the actual content when I read through the inputStreamReader object(only specifying the encoded charset), however when I check on the output of the physical file, it contains
junk character.
I have a below questions, please help me out to understand.
Why physical file change to junk? e.g:¥Ô¡¼¥¿¡¼¡¦¥¸¥ç¡¼¥º opened with notepad. Note: Same when I open on OpenOffice Calc
setting charset it displays actual Japanese character.
While reading using inputStreamReader if I specify other than
encoded character,actual content change to junk.
e.g:�ԡ����������硼�� So as per my understanding encoded charset and
decoded charset must be the same.*
I have checked this answer, String encoding conversion UTF-8 to SHIFT-JIS.
But what I need to know is suppose encoded file is different while decoding
if we use UTF-8 does is it possible display the actual content.
OutputStream os = new FileOutputStream("japanese.txt");
OutputStreamWriter osw = new OutputStreamWriter(os, "EUC-JP");
osw.write("ピーター・ジョーズ");
osw.flush();
osw.close();
InputStream ir = new FileInputStream("japanese.txt");
InputStreamReader isr = new InputStreamReader(ir, "EUC-JP");
int i =isr.read();
while(i!=-1) {
System.out.print((char)i);
i=isr.read();
}
isr.close();
encoding & decoding - (EUC-JP)
ピーター・ジョーズ
encoding - EUC-JP : decoding - UTF-8
�ԡ����������硼��

How to read file with saving encoding?

So, I have file in ISO8859-1 encoding. I do the next:
InputStreamReader isr = new InputStreamReader(new FileInputStream(fileLocation));
System.out.println(isr.getEncoding());
And I get UTF8... Looks like FileInputStream or InputStreamReader convert it to UTF8.
Yes, I know about the next one way:
BufferedReader br = new BufferedReader(
new InputStreamReader(
new FileInputStream(fileLocation), "ISO-8859-1");
But I don't know beforehand what encoding my file will have.
How can I read file with saving encoding?

Binary files (bytes) that are actually text in some encoding for those bytes, unfortunately do not store the encoding (charset) somewhere.
Sometimes there is an encoding somewhere: Unicode text could have an optional BOM character at the begin of the file. HTML and XML can specify the charset.
If you downloaded the file from the internet in the header lines the charset could be mentioned. Say it were an HTML file, and Content-Type: text/html; charset=Windows-1251. Then you could read the file with Windows-1251, and always store it as UTF-8, modifying/adding a <meta charset="UTF-8">.
But in general there is no solution for determining some file's encoding. You could do:
read the bytes
if convertible to UTF-8 without error in the multibyte sequences, it is UTF-8
otherwise it is a single byte encoding, default to Windows-1252 (rather than ISO-8859-1)
maybe use word frequency tables of some languages together with encodings, and try those
write the bytes in the determined encoding to file as UTF-8
There might be a library doing such a thing; combining language recognition and charset recognition.

Setting Charset for String

I use forbiddenapis to check my code. It gives an error:
[forbiddenapis] Forbidden class/interface use: java.lang.String#<init>(byte[])
[forbiddenapis] in org.a.b.MyObject (MyObject.java:14)
Which points to:
String finalString = new String(((ByteArrayOutputStream) out).toByteArray());
How can I resolve it? I know that I can set a Charset, i.e.:
Charset.forName("UTF-8").encode(myString);
However since there is used byte, which charset should I use to avoid a problem with different characters?

You'll need insight into the charset with which the bytes were encoded in the first place. If you're confident it'd always be UTF8, you could just use the String constructor:
new String(bytes, StandardCharsets.UTF_8)

Do not use FileReader. This is an old utility class to read files in the default platform encoding. That is not suited for portable files. The code is unportable.
String / Reader / Writer holds Unicode. When converting from byte[] / InputStream / OutputStream one needs to indicate the encoding of those bytes, binary data.
String s = new String(bytes, charset);
byte[] bytes = s.getBytes(charset);
It seems that the message mentions FileReader and complains about its
new String(bytes);
which uses the default encoding, as would:
string.getBytes();

Android read file encoding issue

I'm trying to read a file from the SD card and I've been told it's in unicode format. However, when I try to read the file I get the following:
This is the code I'm using to read the file:
InputStreamReader fw = new InputStreamReader(new FileInputStream(root.getAbsolutePath()+"/Drive/sdk/cmd.62.out"), "UTF-8");
char[] buf = new char[255];
fw.read(buf);
String readString = new String(buf);
Log.d("courierread",readString);
fw.close();
If I write that output to a file this is what I get when I open it in a hex editor:
Any thoughts on what I need to do to read the file correctly?

Does the file have a byte-order mark? In that case look at Reading UTF-8 - BOM marker
EDIT (from comment): That looks like little-endian UTF-16 to me. Try the charset "UTF-16LE".

The file you show in the hex editor is not UTF-8 encoded, it looks more like UTF-16. This means you must specify UTF-16 as the encoding in your code (probably the UTF-16LE variant).
If it were UTF-8 encoded, then it would represent all characters representable in ASCII using just a single byte.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java character conversion to UTF-8 - java

Related

Java changes special characters when using FileReader

Japanese character change to junk after writing to file in java

How to read file with saving encoding?

Setting Charset for String

Android read file encoding issue

Categories

Resources