How to convert String encoded in windows-1250/Cp1250 to utf-8? - java

As title say ...
I read content from htto response
InputStream is = response.getEntity().getContent();
String cw = IOUtils.toString(is);
byte[] b = cw.getBytes("Cp1250");
String x = StringUtils.newStringUtf8(b);
String content = new String(b, "UTF-8");
System.out.println(content);
I have tried plenty of variations. I am little confused about what are correct encoding constants used as strings. windows-1250 or Cp1250. UTF-8 or utf-8 or utf8?

You seem to think that a String object has an encoding. That's not correct. An encoding is used as part of the translation from binary data (a byte[] or InputStream) to text data (a String or char[] etc).
It's not clear what IOUtils.toString is doing, but it's almost certainly losing data or at least handling it inappropriately. If your data is originally in Windows-1250, then you should use an InputStreamReader wrapping the InputStream, specifying the charset in the InputStreamReader constructor call.
It's not clear where UTF-8 comes in - you might want to write out the data in UTF-8 afterwards, but the result of that would be byte[], not a string.

You're converting backwards. You need to get the input data as a byte array and then use String(byteArray, "Cp1250") to create the String object. Then if you want UTF-8, use String.getBytes("UTF-8").

Encoding have a canonical (unique) name and other varying names, and that case-insensitive. For instance "UTF-8" is the canonical name, but some java versions back it was "UTF8"; it got written more to the common usage. The same for "Windows-1250," which you might see also in HTML pages. "Cp1250" (Code-Page) is a java internal name.
In java byte[] is binary data, String (internally Unicode) is text.
Conversion between both needs an encoding, often optional though, taking the operating system default.
byte, InputStream, OutputStream <-> String, char, Reader, Writer
String cw = IOUtils.toString(is, "UTF-8"); // InputStream is binary gives byte[], hence give encoding
byte[] b = cw.getBytes("Cp1250");
String x = new String(b, "Cp1250");
String content = s;
System.out.println(content);
To allow this universal (qua encoding) String, String internally uses char, UTF-16.
String constants are stored in the .class file as UTF-8 (more compact).

Assuming Apache Commons IO, use one of the methods that specifies an encoding:
String cw = IOUtils.toString(is, "windows-1250");
All strings are implicitly UTF-16 in Java. Other encodings are generally represented using byte arrays.

I see better to use Scanner for reading in different charsets.
FileInputStream is = new FileInputStream(fileOrPath);
Scanner scanner = new Scanner(is, "cp1250");
String out = scanner.next();
And method next() returns String value in charset of application.
Tested on "czech language" from "cp1250" to "UTF-8".

Related

Java changes special characters when using FileReader

I have a problem with Java because I have a file with ASCII encoding and when I pass that value to the output file it changes special characters that I need to keep:
Original file:
Output file:
The code I use to read an ASCII file and pass it to a string that has a length of 7000 and the problem with that file where it reaches the special characters that within the frame or string that is the position 486 to 498 the FileRender does not bring the special characters correctly changes them for others and does not keep them (as I understand it is a binary):
fr = new FileReader(sourceFile);
//BufferedReader br = new BufferedReader(fr);
BufferedReader br = new BufferedReader(
new InputStreamReader(new FileInputStream(sourceFile), "UTF-8"));
String asciiString;
asciiString = br.readLine();
Edit:
I am doing a conversion from ASCII to EBCDIC. I am using CharFormatConverter.java
I really don't understand why the special characters are lost and not maintained. I found the UTF-8 code in another forum, but characters are still lost. Read file utf-8
Edit:
I was thinking about using FileReader for the ASCII data and FileInputStream to get the binary (but I can't figure out how to get it out with respect to the positions) that is in the ASCII file and thus have the two formats separated and then merge them after the conversion.
Regards.
If your info in the file is a binary info and not textual you can not read it as a String and no charset will help you. As charset is a schema that tells you how to interpret particular character into numeric code and vise-versa. If your info is not textual charset won't help you. You will need to read your info as binary - a sequence of bytes - and write them the same way. you will need to use InputStream implementation that reads info as binary. In your case a good candidate might be FileInputStream. But some other options may be used
Since your base code (CharFormatConverter) is byte-oriented, and it looks like your input files are binary, you should replace Readers by InputStreams, which produce bytes (not characters).
This is the ordinary way to read and process an InputStream:
private void convertFileToEbcdic(File sourceFile)
throws IOException
{
try (InputStream input=new FileInputStream(sourceFile))
{
byte[] buffer=new byte[4096];
int len;
do {
len=input.read(buffer);
if (len>0)
{
byte[] ebcdic=convertBufferFromAsciiToEbcdic(buffer, len);
// Now ebcdic contains the buffer converted to EBCDIC. You may use it.
}
} while (len>=0);
}
}
private byte[] convertBufferFromAsciiToEbcdic(byte[] ascii, int length)
{
// Create an array of same input as received
// and fill it with the input data converted to EBCDIC
}

Setting Charset for String

I use forbiddenapis to check my code. It gives an error:
[forbiddenapis] Forbidden class/interface use: java.lang.String#<init>(byte[])
[forbiddenapis] in org.a.b.MyObject (MyObject.java:14)
Which points to:
String finalString = new String(((ByteArrayOutputStream) out).toByteArray());
How can I resolve it? I know that I can set a Charset, i.e.:
Charset.forName("UTF-8").encode(myString);
However since there is used byte, which charset should I use to avoid a problem with different characters?
You'll need insight into the charset with which the bytes were encoded in the first place. If you're confident it'd always be UTF8, you could just use the String constructor:
new String(bytes, StandardCharsets.UTF_8)
Do not use FileReader. This is an old utility class to read files in the default platform encoding. That is not suited for portable files. The code is unportable.
String / Reader / Writer holds Unicode. When converting from byte[] / InputStream / OutputStream one needs to indicate the encoding of those bytes, binary data.
String s = new String(bytes, charset);
byte[] bytes = s.getBytes(charset);
It seems that the message mentions FileReader and complains about its
new String(bytes);
which uses the default encoding, as would:
string.getBytes();

Convert Utf-16 to UTF-8 strings with data losing using Java

I have to insert text which 99,9% is UTF-8 but have 0.01% UTF-16 characters. Sะพ when I try to save it in my Mysql databse using Hibernate and Spring an exception occured. I can even remove these chars there is no problem, so I want to convert all my text in UTF-8 and save to my database with data losing, so the problem chars to be removed. I tried
String string = "๐Ÿ˜ˆ Devil Emoji";
byte[] converttoBytes = string.getBytes("UTF-16");
string = new String(converttoBytes, "UTF-8");
System.out.println(string);
But nothing happens.
๐Ÿ˜ˆ Devil Emoji
Is there any external library in order to do that?
๐Ÿ˜ˆ probably has nothing to do with UTF-16. It's hex is F09F9888. Notice that that is 4 bytes. Also notice that that is a UTF-8 encoding, not a "Unicode" encoding: U+1F608 or \u1F608. UTF-16 would be none of the above. More (scarfboy).
MySQL's utf8 handles only 3-byte (or shorter) UTF-8 characters. MySQL's utf8mb4 also handles 4-byte characters like that little devil.
You need to change the CHARACTER SET of the column you are storing him into. And you need to establish that your connection is charset=UTF-8.
Note: things outside MySQL call it UTF-8, but MySQL calls it utf8mb4.
String holds Unicode in java, so all scripts can be combined.
byte[] converttoBytes = string.getBytes("UTF-16");
These bytes are binary data, but actually used to store text, encoded in UTF-16.
string = new String(converttoBytes, "UTF-8");
Now String thinks that the bytes represent text encoding in UTF-8, and converts those. This is wrong.
Now to detect the encoding, either UTF-8 or UTF-16, then that should best be done on bytes, not String, as that String then has an erroneous conversion with possible loss.
As UTF-8 has the most strict format of both, we'll check that one.
Also UTF-16 has a byte 0 for ASCII, that almost never occurs in normal text.
So something like
public static String string(byte[] bytes) {
ByteBuffer buffer = ByteBuffer.wrap(bytes);
CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);
decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
try {
String s = decoder.decode(buffer).toString();
if (!s.contains("\u0000")) { // Could be UTF-16
return s;
}
} catch (CharacterCodingException e) { // Error in UTF-8
}
return new String(bytes, "UTF-16LE");
}
If you only have a String (for instance from the database), then
if (!s.contains("\u0000")) { // Could be UTF-16
s = new String(s.getBytes("Windows-1252"), "UTF-16LE");
}
might work or make a larger mess.

StringBufferInputStream Question in Java

I want to read an input string and return it as a UTF8 encoded string. SO I found an example on the Oracle/Sun website that used FileInputStream. I didn't want to read a file, but a string, so I changed it to StringBufferInputStream and used the code below. The method parameter jtext, is some Japanese text. Actually this method works great. The question is about the deprecated code. I had to put #SuppressWarnings because StringBufferInputStream is deprecated. I want to know is there a better way to get a string input stream? Is it ok just to leave it as is? I've spent so long trying to fix this problem that I don't want to change anything now I seem to have cracked it.
#SuppressWarnings("deprecation")
private String readInput(String jtext) {
StringBuffer buffer = new StringBuffer();
try {
StringBufferInputStream sbis = new StringBufferInputStream (jtext);
InputStreamReader isr = new InputStreamReader(sbis,
"UTF8");
Reader in = new BufferedReader(isr);
int ch;
while ((ch = in.read()) > -1) {
buffer.append((char)ch);
}
in.close();
return buffer.toString();
} catch (IOException e) {
e.printStackTrace();
return null;
}
}
I think I found a solution - of sorts:
private String readInput(String jtext) {
String n;
try {
n = new String(jtext.getBytes("8859_1"));
return n;
} catch (UnsupportedEncodingException e) {
return null;
}
}
Before I was desparately using getBytes(UTF8). But I by chance I used Latin-1 "8859_1" and it worked. Why it worked, I can't fathom. This is what I did step-by-step:
OpenOffice CSV(utf8)------>SQLite(utf8, apparently)------->java encoded as Latin-1, somehow readable.
The reason that StringBufferInputStream is deprecated is because it is fundamentally broken ... for anything other than Strings consisting entirely of Latin-1 characters. According to the javadoc it "encodes" characters by simply chopping off the top 8 bits! You don't want to use it if your application needs to handle Unicode, etc correctly.
If you want to create an InputStream from a String, then the correct way to do it is to use String.getBytes(...) to turn the String into a byte array, and then wrap that in a ByteArrayInputStream. (Make sure that you choose an appropriate encoding!).
But your sample application immediately takes the InputStream, converts it to a Reader and then adds a BufferedReader If this is your real aim, then a simpler and more efficient approach is simply this:
Reader in = new StringReader(text);
This avoids the unnecessary encoding and decoding of the String, and also the "buffer" layer which serves no useful purpose in this case.
(A buffered stream is much more efficient than an unbuffered stream if you are doing small I/O operations on a file, network or console stream. But for a stream that is served from an in-memory data structure the benefits are much smaller, and possibly even negative.)
FOLLOWUP
I realized what you are trying to do now ... work around a character encoding / decoding issue.
My advice would be to try to figure out definitively the actual encoding of the character data that is being delivered by the database, then make sure that the JDBC drivers are configured to use the same encoding. Trying to undo the mis-translation by encoding with one encoding and decoding with another is dodgy, and can give you only a partial correction of the problems.
You also need to consider the possibility that the characters got mangled on the way into the database. If this is the case, then you may be unable to de-mangle them.
Is this what you are trying to do? Here is previous answer on similar question. I am not sure why you want to convert to a String to an exactly the same String.
Java String holds a sequence of chars in which each char represents a Unicode number. So it is possible to construct the same string from two different byte sequences, says one is encoded with UTF-8 and the other is encoded with US-ASCII.
If you want to write it to file, you can always convert it with String.getBytes("encoder");
private static String readInput(String jtext) {
byte[] bytes = jtext.getBytes();
try {
String string = new String(bytes, "UTF-8");
return string;
} catch (UnsupportedEncodingException ex) {
// do something
return null;
}
}
Update
Here is my assumption.
According to your comment, you SQLite DB store text value using one encoding, says UTF-16. For some reason, your SQLite APi cannot determine what the encoding it uses to encode the Unicode values to sequence of bytes.
So when you use getString method from your SQLite API, it reads a set of bytes form you DB, and convert them into Java String using incorrect encoding. If this is the case, you should use getBytes method and reconstruct the String yourself, i.e. new String(bytes, "encoding used in your DB"); If you DB is stored in UTF-16, then new String(bytes, "UTF-16"); should be readable.
Update
I wasn't talking about getBytes method on String class. I talked about getBytes method on your SQL result object, e.g. result.getBytes(String columnLabel).
ResultSet result = .... // from SQL query
String readableString = readInput(result.getBytes("my_table_column"));
You will need to change the signature of your readInput method to
private static String readInput(byte[] bytes) {
try {
// change encoding to your DB encoding.
// this can be UTF-8, UTF-16, 8859_1, etc.
String string = new String(bytes, "UTF-8");
return string;
} catch (UnsupportedEncodingException ex) {
// do something, at least return garbled text
return new String(bytes, "UTF-8");;
}
}
Whatever encoding you set in here which makes your String readable, it is definitely the encoding of your column in DB. This involves no unexplanable phenomenon and you know exactly what your column encoding is.
But it will be good to config your JDBC driver to use the correct encoding so that you will not need to use this readInput method to convert.
If no encoding can make your string readable, you will need consider the possibility of the characters got mangled when it was written to DB as #Stephen C said. If this is the case, using walk around method may cause you to lose some of the charaters during conversions. You will also need to solve encoding problem during writting as well.
The StringReader class is the new alternative to the deprecated StringBufferInputStream class.
However, you state that what you actually want to do is take an existing String and return it encoded as UTF-8. You should be able to do that much more simply I expect. Something like:
s8 = new String(jtext.getBytes("UTF8"));

Java: convert UTF8 String to byte array in another encoding

I have UTF8 encoded String, but I need to post parameters to Runtime process in cp1251. How can I decode String or byte array?
I need smth like:.bytesInCp1251 = encodeTo(stringInUtf8, "cp1251");
Thanks to all! This is my own solution:
OutputStreamWriter writer = new OutputStreamWriter(out, "cp1251");
writer.write(s);
There is no such thing as an "UTF8 encoded String" in Java. Java Strings use UTF-16 internally, but should be seen as an abstraction without a specific encoding. If you have a String, it's already decoded. If you want to encode it, use string.getBytes(encoding). If you original data is UTF-8, you have to take that into account when you convert that data from bytes to String.
byte[] bytesInCp1251 = stringInUtf8.getBytes("cp1251");
This is solution!
OutputStreamWriter writer = new OutputStreamWriter(out, "cp1251");
writer.write(s);

Categories

Resources