Wrong encoding of non-English String

Wrong encoding of non-English String - java

I am trying to send a message that contains English and Russian text, but the Russian text is displayed as "?? ???????"
PrintWriter writer = new PrintWriter(clientSocket.getOutputStream());
writer.println("English" + "На русском");
writer.flush();
Is there a way to fix this?

So, your problem was has nothing to do with IntelliJ IDEA.
The PrintWriter constructor that only accepts an OutputStream creates a PrintWriter that makes use of the default character encoding of the JVM. You can check what the default character encoding of the JVM is by invoking Charset.defaultCharset(). However, you should not rely on it having any particular value, either at the sending end, or at the receiving end. It is best to either set the default character encoding of the JVM, or to supply a specific character encoding when creating your PrintWriter. The following should do it:
Charset charset = StandardCharsets.UTF_8;
OutputStreamWriter osw = new OutputStreamWriter( outputStream, charset );
PrintWriter printWriter = new PrintWriter( new BufferedWriter( osw ) );
If you are in control of both the sending side and the receiving side, then you may have to add the corresponding on the other side with the Charset of the InputStream.

Related

Japanese character change to junk after writing to file in java

Japanese character display the actual content when I read through the inputStreamReader object(only specifying the encoded charset), however when I check on the output of the physical file, it contains
junk character.
I have a below questions, please help me out to understand.
Why physical file change to junk? e.g:¥Ô¡¼¥¿¡¼¡¦¥¸¥ç¡¼¥º opened with notepad. Note: Same when I open on OpenOffice Calc
setting charset it displays actual Japanese character.
While reading using inputStreamReader if I specify other than
encoded character,actual content change to junk.
e.g:�ԡ����������硼�� So as per my understanding encoded charset and
decoded charset must be the same.*
I have checked this answer, String encoding conversion UTF-8 to SHIFT-JIS.
But what I need to know is suppose encoded file is different while decoding
if we use UTF-8 does is it possible display the actual content.
OutputStream os = new FileOutputStream("japanese.txt");
OutputStreamWriter osw = new OutputStreamWriter(os, "EUC-JP");
osw.write("ピーター・ジョーズ");
osw.flush();
osw.close();
InputStream ir = new FileInputStream("japanese.txt");
InputStreamReader isr = new InputStreamReader(ir, "EUC-JP");
int i =isr.read();
while(i!=-1) {
System.out.print((char)i);
i=isr.read();
}
isr.close();
encoding & decoding - (EUC-JP)
ピーター・ジョーズ
encoding - EUC-JP : decoding - UTF-8
�ԡ����������硼��

special characters in utf-8 text file

I've an input file which comes under ANSI UNIX file format. I convert that file into UTF-8.
Before converting to UTF-8, there is an special character like this in input file
»
After converting to UTF-8, it becomes like this
Ã»
When I process my file as it is, without converting to utf-8, all special characters disappeared and data loss as well.
But when I process my file after converting to UTF-8, All data appears with special character same as am getting after converting to UTF-8 in output file.
ANSI to UTF-8 (could be wrong, please correct me if am wrong somewhere)
FileInputStream = fis = new FileInputStream("inputtextfile.txt");
InputStreamReader isr = new InputStreamReader (fis, "ISO-8859-1");
Reader in = new BufferReader(isr);
FileOutputStream fos = new FileOutputStream("outputfile.txt");
OutPutStreamWriter osw = OutPutStreamWriter("fos", "UTF-8");
Writer out = new BufferedWriter(osw);
int ch;
out.write("\uFEFF";);
while ((ch = in.read()) > -1 ) {
out.write(ch);
}
out.close();
in.close();
After this am processing my file further for final output.
I'm using Talend ETL tool for creating an final output out of generated utf-8. (Java based ETL tool)
What I want is, I want to process my file so that I could get same special characters in output as am getting in input file.
I'm using java 1.8 for this whole processing. I'
'm too stuck in this situation and never dealt this with special characters.
Any suggestion would be helpful.

Setting Charset for String

I use forbiddenapis to check my code. It gives an error:
[forbiddenapis] Forbidden class/interface use: java.lang.String#<init>(byte[])
[forbiddenapis] in org.a.b.MyObject (MyObject.java:14)
Which points to:
String finalString = new String(((ByteArrayOutputStream) out).toByteArray());
How can I resolve it? I know that I can set a Charset, i.e.:
Charset.forName("UTF-8").encode(myString);
However since there is used byte, which charset should I use to avoid a problem with different characters?

You'll need insight into the charset with which the bytes were encoded in the first place. If you're confident it'd always be UTF8, you could just use the String constructor:
new String(bytes, StandardCharsets.UTF_8)

Do not use FileReader. This is an old utility class to read files in the default platform encoding. That is not suited for portable files. The code is unportable.
String / Reader / Writer holds Unicode. When converting from byte[] / InputStream / OutputStream one needs to indicate the encoding of those bytes, binary data.
String s = new String(bytes, charset);
byte[] bytes = s.getBytes(charset);
It seems that the message mentions FileReader and complains about its
new String(bytes);
which uses the default encoding, as would:
string.getBytes();

PrintWriter using a StandardCharset

If I were to create an InputStreamReader with the following code,
new InputStreamReader(anInputStream, "UTF-8")
I would have to catch UnsupportedEncodingException, which is reasonable. I can avoid this by using
new InputStreamReader(anInputStream, StandardCharsets.UTF_8)
which doesn't throw UnsupportedEncodingException as the charset is already known to be valid. All good so far.
Now enter its counterpart, the PrintWriter:
new PrintWriter("filename", StandardCharsets.UTF_8)
doesn't compile because the PrintWriter constructor doesn't take a Charset argument. I can do this:
new PrintWriter("filename", StandardCharsets.UTF_8.name())
but then I can't avoid having to catch UnsupportedEncodingException, even though the charset name has just come from a valid charset.
The StandardCharsets utility class was added later on in Java's lifetime, and when Sun added it, they also added an overload to the InputStreamReader constructor. Why did they add an overload to InputStreamReader but not PrintWriter?
Is there another class I can use instead, which takes a charset instead of a charset name?

The counterpart to InputStreamReader is not PrintWriter.
Use OutputStreamWriter instead.
If you want to use PrintWriter, it's possible to use PrintWriter(new OutputStreamWriter(anOutputStream, StandardCharsets.UTF_8));

The counterpart of java.io.InputStreamReader is java.io.OutputStreamWriter, not java.io.PrintWriter.
That said, you can create the PrintWriter safely like this:
Reader reader = new InputStreamReader(anyOutputStream, StandardCharsets.UTF_8);
Writer writer = new OutputStreamWriter(anyInputStream, StandardCharsets.UTF_8);
PrintWriter printWriter = new PrintWriter(writer);

but then I can't avoid having to catch UnsupportedEncodingException, even though the charset name has just come from a valid charset.
Which makes sense, right? Since it's still a String.
As suggested by Stewart, using the java.io.OutputStreamWriter would be the way to go.
new PrintWriter(new OutputStreamWriter(anOutputStream, StandardCharsets.UTF_8), isAutoFlush)

Java character conversion to UTF-8

I am using:
InputStreamReader isr = new InputStreamReader(fis, "UTF8");
to read in characters from a text file and converting them to UTF8 characters.
My question is, what if one of the characters being read cannot be converted to utf8, what happens? Will there be an exception? or will get the character get dropped off?

You are not converting from one charset to another. You are just indicating that the file is UTF 8 encoded so that you can read it correctly.
If you want to convert from 1 encoding to the other then you should do something like below
File infile = new File("x-utf8.txt");
File outfile = new File("x-utf16.txt");
String fromEncoding="UTF-8";
String toEncoding="UTF-16";
Reader in = new InputStreamReader(new FileInputStream(infile), fromEncoding);
Writer out = new OutputStreamWriter(new FileOutputStream(outfile), toEncoding);
After going through the David Gelhar's response, I feel this code can be improved a bit. If you doesn't know the encoding of the "inFile" then use the GuessEncoding library to detect the encoding and then construct the reader in the encoding detected.

If the input file contains bytes that are not valid utf-8, read() will by default replace the invalid characters with a value of U+FFFD (65533 decimal; the Unicode "replacement character").
If you need more control over this behavior, you can use:
InputStreamReader(InputStream in, CharsetDecoder dec)
and supply a CharsetDecoder configured to your liking.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Wrong encoding of non-English String - java

I am trying to send a message that contains English and Russian text, but the Russian text is displayed as "?? ???????" PrintWriter writer = new PrintWriter(clientSocket.getOutputStream()); writer.println("English" + "На русском"); writer.flush(); Is there a way to fix this?

Related

Japanese character change to junk after writing to file in java

special characters in utf-8 text file

Setting Charset for String

PrintWriter using a StandardCharset

Java character conversion to UTF-8

Categories

Resources