Setting Charset for String

Setting Charset for String - java

I use forbiddenapis to check my code. It gives an error:
[forbiddenapis] Forbidden class/interface use: java.lang.String#<init>(byte[])
[forbiddenapis] in org.a.b.MyObject (MyObject.java:14)
Which points to:
String finalString = new String(((ByteArrayOutputStream) out).toByteArray());
How can I resolve it? I know that I can set a Charset, i.e.:
Charset.forName("UTF-8").encode(myString);
However since there is used byte, which charset should I use to avoid a problem with different characters?

You'll need insight into the charset with which the bytes were encoded in the first place. If you're confident it'd always be UTF8, you could just use the String constructor:
new String(bytes, StandardCharsets.UTF_8)

Do not use FileReader. This is an old utility class to read files in the default platform encoding. That is not suited for portable files. The code is unportable.
String / Reader / Writer holds Unicode. When converting from byte[] / InputStream / OutputStream one needs to indicate the encoding of those bytes, binary data.
String s = new String(bytes, charset);
byte[] bytes = s.getBytes(charset);
It seems that the message mentions FileReader and complains about its
new String(bytes);
which uses the default encoding, as would:
string.getBytes();

Related

When do I need to specify the encoding while writing the file to the disk?

I have a sample method which copies one file to another using InputStream and OutputStream. In this case, the source file is encoded in 'UTF-8'. Even if I don't specify the encoding while writing to the disk, the destination file has the correct encoding. But, if I have to write a java.lang.String to a file, I need to specify the encoding. Why is that ?
public static void copyFile() {
String sourceFilePath = "C://my_encoded.txt";
InputStream inStream = null;
OutputStream outStream = null;
try{
String targetFilePath = "C://my_target.txt";
File sourcefile =new File(sourceFilePath);
outStream = new FileOutputStream(targetFilePath);
inStream = new FileInputStream(sourcefile);
byte[] buffer = new byte[1024];
int length;
//copy the file content in bytes
while ((length = inStream.read(buffer)) > 0){
outStream.write(buffer, 0, length);
}
inStream.close();
outStream.close();
System.out.println("File "+targetFilePath+" is copied successful!");
}catch(IOException e){
e.printStackTrace();
}
}
My guess is that since the source file has thee correct encoding and since we read and write one byte at a time, it works fine. And java.lang.String is 'UTF-16' by default and if we write it to the file, it reads one byte at a time instead of 2 bytes and hence garbage values. Is that correct or am I completely wrong in my understanding ?

You are copying the file byte per byte, so you don't need to care about character encoding.
As a rule of thumb:
Use the various InputStream and OutputStream implementations for byte-wise processing (like file copy).
There are some convenience methods to handle text directly like PrintStream.println(). Be careful because most of them use the default platform specific encoding.
Use the various Reader and Writer implemenations for reading and writing text.
If you need to convert between byte-wise and text processing use InputStreamReader and OutputStreamWriter with explicit file encoding.
Do not rely on the default encoding. The default character encoding is platform specific (e.g. Windows-ANSI aka Cp1252 for Windows, usually UTF-8 on Linux).
Example: If you need to read a UTF-8 text file:
BufferedReader reader =
new BufferedReader(new InputStreamReader(new FileInputStream(inFile), "UTF-8"));
Avoid using a FileReader because a FileReader uses always the default encoding.
A special case: If you need random access to a file you should use RandomAccessFile. With it you can read and write data blocks at arbitrary positions. You can read and write raw byte blocks or you can use convenience methods to read and write text. But you should read the documentation carefully. E.g. the methods readUTF() and writeUTF() use a modified UTF-8 encoding.
InputStream, OutputStream, Reader, Writer and RandomAccessFile form the basic IO functionality, enough for most use cases. For advanced IO (e.g. memory mapped files, ...) have a look at package java.nio.

Just read your code! (For the copy part at least ;-) )
When you copy the two files, you copy it byte by byte. There is no conversion to String, thus.
When you write a String into a file, you need to convert it (indirectly sometimes) in an array of byte (byte[]). There you need to specify your encoding.
When you read a file to get a String, you need to know its encoding in order to do it properly. Java doesn't 'skip' any byte but you need to make a conversion once again : from a byte[] to a String.

Send textfile over socket using writer/reader?

Is there a way to send a textfile from client to server using XXXwriter and XXXreader instead of sending bytes?
Any suggestions?

You can wrap the InputStream in an InputStreamReader, and the OutputStream in an OutputStreamWriter. These classes bridge binary (byte[], *Stream) from/to java's Unicode text (String, char, *Reader, *Writer). Use the constructor with the correct encoding.
Charset encoding = StandardCharsets.UTF_8;
String encoding = "Windows-1252";
... new InputStreamReader(inputStream, encoding);
This however assumes that the Stream transfer is done fine. Possible errors are:
forgetting to close, not all data transfered;
use of available() which is not needed;
using a buffer to read, and not writing the actual number of bytes read, old data at the end.

How to convert String encoded in windows-1250/Cp1250 to utf-8?

As title say ...
I read content from htto response
InputStream is = response.getEntity().getContent();
String cw = IOUtils.toString(is);
byte[] b = cw.getBytes("Cp1250");
String x = StringUtils.newStringUtf8(b);
String content = new String(b, "UTF-8");
System.out.println(content);
I have tried plenty of variations. I am little confused about what are correct encoding constants used as strings. windows-1250 or Cp1250. UTF-8 or utf-8 or utf8?

You seem to think that a String object has an encoding. That's not correct. An encoding is used as part of the translation from binary data (a byte[] or InputStream) to text data (a String or char[] etc).
It's not clear what IOUtils.toString is doing, but it's almost certainly losing data or at least handling it inappropriately. If your data is originally in Windows-1250, then you should use an InputStreamReader wrapping the InputStream, specifying the charset in the InputStreamReader constructor call.
It's not clear where UTF-8 comes in - you might want to write out the data in UTF-8 afterwards, but the result of that would be byte[], not a string.

You're converting backwards. You need to get the input data as a byte array and then use String(byteArray, "Cp1250") to create the String object. Then if you want UTF-8, use String.getBytes("UTF-8").

Encoding have a canonical (unique) name and other varying names, and that case-insensitive. For instance "UTF-8" is the canonical name, but some java versions back it was "UTF8"; it got written more to the common usage. The same for "Windows-1250," which you might see also in HTML pages. "Cp1250" (Code-Page) is a java internal name.
In java byte[] is binary data, String (internally Unicode) is text.
Conversion between both needs an encoding, often optional though, taking the operating system default.
byte, InputStream, OutputStream <-> String, char, Reader, Writer
String cw = IOUtils.toString(is, "UTF-8"); // InputStream is binary gives byte[], hence give encoding
byte[] b = cw.getBytes("Cp1250");
String x = new String(b, "Cp1250");
String content = s;
System.out.println(content);
To allow this universal (qua encoding) String, String internally uses char, UTF-16.
String constants are stored in the .class file as UTF-8 (more compact).

Assuming Apache Commons IO, use one of the methods that specifies an encoding:
String cw = IOUtils.toString(is, "windows-1250");
All strings are implicitly UTF-16 in Java. Other encodings are generally represented using byte arrays.

I see better to use Scanner for reading in different charsets.
FileInputStream is = new FileInputStream(fileOrPath);
Scanner scanner = new Scanner(is, "cp1250");
String out = scanner.next();
And method next() returns String value in charset of application.
Tested on "czech language" from "cp1250" to "UTF-8".

Java character conversion to UTF-8

I am using:
InputStreamReader isr = new InputStreamReader(fis, "UTF8");
to read in characters from a text file and converting them to UTF8 characters.
My question is, what if one of the characters being read cannot be converted to utf8, what happens? Will there be an exception? or will get the character get dropped off?

You are not converting from one charset to another. You are just indicating that the file is UTF 8 encoded so that you can read it correctly.
If you want to convert from 1 encoding to the other then you should do something like below
File infile = new File("x-utf8.txt");
File outfile = new File("x-utf16.txt");
String fromEncoding="UTF-8";
String toEncoding="UTF-16";
Reader in = new InputStreamReader(new FileInputStream(infile), fromEncoding);
Writer out = new OutputStreamWriter(new FileOutputStream(outfile), toEncoding);
After going through the David Gelhar's response, I feel this code can be improved a bit. If you doesn't know the encoding of the "inFile" then use the GuessEncoding library to detect the encoding and then construct the reader in the encoding detected.

If the input file contains bytes that are not valid utf-8, read() will by default replace the invalid characters with a value of U+FFFD (65533 decimal; the Unicode "replacement character").
If you need more control over this behavior, you can use:
InputStreamReader(InputStream in, CharsetDecoder dec)
and supply a CharsetDecoder configured to your liking.

Java: convert UTF8 String to byte array in another encoding

I have UTF8 encoded String, but I need to post parameters to Runtime process in cp1251. How can I decode String or byte array?
I need smth like:.bytesInCp1251 = encodeTo(stringInUtf8, "cp1251");
Thanks to all! This is my own solution:
OutputStreamWriter writer = new OutputStreamWriter(out, "cp1251");
writer.write(s);

There is no such thing as an "UTF8 encoded String" in Java. Java Strings use UTF-16 internally, but should be seen as an abstraction without a specific encoding. If you have a String, it's already decoded. If you want to encode it, use string.getBytes(encoding). If you original data is UTF-8, you have to take that into account when you convert that data from bytes to String.

byte[] bytesInCp1251 = stringInUtf8.getBytes("cp1251");

This is solution!
OutputStreamWriter writer = new OutputStreamWriter(out, "cp1251");
writer.write(s);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.