Convert Utf-16 to UTF-8 strings with data losing using Java

Convert Utf-16 to UTF-8 strings with data losing using Java - java

I have to insert text which 99,9% is UTF-8 but have 0.01% UTF-16 characters. Sо when I try to save it in my Mysql databse using Hibernate and Spring an exception occured. I can even remove these chars there is no problem, so I want to convert all my text in UTF-8 and save to my database with data losing, so the problem chars to be removed. I tried
String string = "😈 Devil Emoji";
byte[] converttoBytes = string.getBytes("UTF-16");
string = new String(converttoBytes, "UTF-8");
System.out.println(string);
But nothing happens.
😈 Devil Emoji
Is there any external library in order to do that?

😈 probably has nothing to do with UTF-16. It's hex is F09F9888. Notice that that is 4 bytes. Also notice that that is a UTF-8 encoding, not a "Unicode" encoding: U+1F608 or \u1F608. UTF-16 would be none of the above. More (scarfboy).
MySQL's utf8 handles only 3-byte (or shorter) UTF-8 characters. MySQL's utf8mb4 also handles 4-byte characters like that little devil.
You need to change the CHARACTER SET of the column you are storing him into. And you need to establish that your connection is charset=UTF-8.
Note: things outside MySQL call it UTF-8, but MySQL calls it utf8mb4.

String holds Unicode in java, so all scripts can be combined.
byte[] converttoBytes = string.getBytes("UTF-16");
These bytes are binary data, but actually used to store text, encoded in UTF-16.
string = new String(converttoBytes, "UTF-8");
Now String thinks that the bytes represent text encoding in UTF-8, and converts those. This is wrong.
Now to detect the encoding, either UTF-8 or UTF-16, then that should best be done on bytes, not String, as that String then has an erroneous conversion with possible loss.
As UTF-8 has the most strict format of both, we'll check that one.
Also UTF-16 has a byte 0 for ASCII, that almost never occurs in normal text.
So something like
public static String string(byte[] bytes) {
ByteBuffer buffer = ByteBuffer.wrap(bytes);
CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);
decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
try {
String s = decoder.decode(buffer).toString();
if (!s.contains("\u0000")) { // Could be UTF-16
return s;
}
} catch (CharacterCodingException e) { // Error in UTF-8
}
return new String(bytes, "UTF-16LE");
}
If you only have a String (for instance from the database), then
if (!s.contains("\u0000")) { // Could be UTF-16
s = new String(s.getBytes("Windows-1252"), "UTF-16LE");
}
might work or make a larger mess.

Related

How to detect encoding mismatch

I have a bunch of old AES-encrypted Strings encrypted roughly like this:
String is converted to bytes with ISO-8859-1 encoding
Bytes are encrypted with AES
Result is converted to BASE64 encoded char array
Now I would like to change the encoding to UTF8 for new values (eg. '€' does not work with ISO-8859-1). This will of
course cause problems if I try to decrypt the old ISO-8859-1 encoded values with UTF-8 encoding:
org.junit.ComparisonFailure: expected:<!#[¤%&/()=?^*ÄÖÖÅ_:;>½§#${[]}<|'äöå-.,+´¨]'-Lorem ipsum dolor ...> but was:<!#[�%&/()=?^*����_:;>��#${[]}<|'���-.,+��]'-Lorem ipsum dolor ...>
I'm thinking of creating some automatic encoding fallback for this.
So the main question would be that is it enough to inspect the decrypted char array for '�' characters to figure out encoding mismatch? And what is the 'correct' way to declare that '�' symbol when comparing?
if (new String(utf8decryptedCharArray).contains("�")) {
// Revert to doing the decrypting with ISO-8859-1
decryptAsISO...
}

When decrypting, you get back the original byte sequence (result of your step 1), and then you can only guess whether these bytes denote characters according to the ISO-8859-1 or the UTF-8 encoding.
From a byte sequence, there's no way to clearly tell how it is to be interpreted.
A few ideas:
You could migrate all the old encrypted strings (decrypt, decode to string using ISO-8859-1, encode to byte array using UTF-8, encrypt). Then the problem is solved once and forever.
You could try to decode the byte array in both versions, see if one version is illegal, or if both versions are equal, and if it still is ambiguous, take the one with higher probability according to expected characters. I wouldn't recommend to go that way, as it needs a lot of work and still there's some probability of error.
For the new entries, you could prepend the string / byte sequence by some marker that doesn't appear in ISO-8859-1 text. E.g. some people follow the convention to prepend a Byte Order Marker at the beginning of UTF-8 encoded files. Although the resulting bytes (EF BB BF) aren't strictly illegal in ISO-8859-1 (being read as ï»¿), they are highly unlikely. Then, when your decrypted bytes start with EF BB BF, decode to string using UTF-8, otherwise using ISO-8859-1. Still, there's a non-zero probability of error.
If ever possible, I'd go for migrating the existing entries. Otherwise, you'll have to carry on with "old-format compatibility stuff" in your code base forever, and still can't absolutely guarantee correct behaviour.

When decoding bytes to text, don't rely on the � character to detect malformed input. Use a strict decoder. Here is a helper method for that:
static String decodeStrict(byte[] bytes, Charset charset) throws CharacterCodingException {
return charset.newDecoder()
.onMalformedInput(CodingErrorAction.REPORT)
.onUnmappableCharacter(CodingErrorAction.REPORT)
.decode(ByteBuffer.wrap(bytes))
.toString();
}
Here is the corresponding strict encoder helper method, in case you need it:
static byte[] encodeStrict(String str, Charset charset) throws CharacterCodingException {
ByteBuffer buf = charset.newEncoder()
.onMalformedInput(CodingErrorAction.REPORT)
.onUnmappableCharacter(CodingErrorAction.REPORT)
.encode(CharBuffer.wrap(str));
byte[] bytes = buf.array();
if (bytes.length == buf.limit())
return bytes;
return Arrays.copyOfRange(bytes, 0, buf.limit());
}
Since ISO-8859-1 allows all bytes, you can't use it to detect malformed input. UTF-8 is however validating, so it is very likely to detect malformed input. It is however not 100% guaranteed, but it's the best we get do.
So, try decoding using strict UTF-8, and then fall back to ISO-8859-1 if it fails:
static String decode(byte[] bytes) {
try {
return decodeStrict(bytes, StandardCharsets.UTF_8);
} catch (CharacterCodingException e) {
return new String(bytes, StandardCharsets.ISO_8859_1);
}
}
Test
System.out.println(decode("señor".getBytes(StandardCharsets.ISO_8859_1))); // prints: señor
System.out.println(decode("señor".getBytes(StandardCharsets.UTF_8))); // prints: señor
System.out.println(decode("€100".getBytes(StandardCharsets.UTF_8))); // prints: €100

Base64.Decoder returning foreign characters

I am building a small application to turn the text in a text file to Base64 then back to normal. The decoded text always returns some Chinese characters in the beginning of the first line.
public EncryptionEngine(File appFile){
this.appFile= appFile;
}
public void encrypt(){
try {
byte[] fileText = Files.readAllBytes(appFile.toPath());// get file text as bytes
Base64.Encoder encoder = Base64.getEncoder();
PrintWriter writer = new PrintWriter(appFile);
writer.print("");//erase old, readable text
writer.print(encoder.encodeToString(fileText));// insert encoded text
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
}
public void deycrpt(){
try {
byte[] fileText = Files.readAllBytes(appFile.toPath());
String s = new String (fileText, StandardCharsets.UTF_8);//String s = new String (fileText);
Base64.Decoder decoder = Base64.getDecoder();
byte[] decodedByteArray = decoder.decode(s);
PrintWriter writer = new PrintWriter(appFile);
writer.print("");
writer.print(new String (decodedByteArray,StandardCharsets.UTF_8)); //writer.print(new String (decodedByteArray));
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
}
Text FileBefore before encrypt():
cheese
tomatoes
potatoes
hams
yams
Text File after encrypt()
//5jAGgAZQBlAHMAZQANAAoAdABvAG0AYQB0AG8AZQBzAA0ACgBwAG8AdABhAHQAbwBlAHMADQAKAGgAYQBtAHMADQAKAHkAYQBtAHMA
Text File After decrypt
뿯붿cheese
tomatoes
potatoes
hams
yams
Before encrypt() :
After decrypt() :

Your input file is UTF-16, not UTF-8. It begins with FF FE, the little-endian byte order mark. StandardCharsets.UTF_16 will handle this correctly. (Or instead, set your text editor to UTF-8 instead of UTF-16.)
When you decoded fffe as UTF-8, you got two replacement characters "��", one for each of the two bytes that was not valid in UTF-8. Then when you printed this out, each replacement character '�' was encoded as ef bf bd in UTF-8. Then you interpreted the result as UTF-16, taking them in groups of two, reading it as efbf bdef bfbd. The remainder of the file was UTF-16 the whole time, but the null bytes will safely round-trip.
(If the file were ascii text encoded as UTF-16 without a byte-order mark, you would not have noticed how broken this was!)

Your encrypt and decrypt functions don't make the same assumptions. encrypt Base64-encodes any file and is just fine except for the variable names and comments that suggest that the file is a text file. It need not be.
decrypt reverses the Base64-encoded data back to bytes but then "overprocesses" by assuming that the bytes were text encoding with UTF-8 and decoding then and re-encoding them before writing them to the file. If the assumption was true, it would just be a NOP; It's clearly not true in your case and it mangles the data.
Perhaps you did that because you were trying to use a PrintWriter. In Java (and .NET), the multiple stream and file I/O classes are often confusing—expecially considering their decades-long evolution. Sometimes there is one that does exactly what you need but it could be hard to find; other times, there isn't. And, sometimes, a commonly used library like Apache Commons fills the gap.
So, just write the bytes to the file. There are lots of modern and historical options as explained in the answers to this direct question byte[] to file in Java. Here's one with Files.write:
Files.write(appFile.toPath(), decodedByteArray, StandardOpenOption.CREATE);
Note: While Base64 possibly would have been considered encryption (and cracked) a couple of hundred years ago, it's not intended for that purpose. It's a bit dangerous (and confusing) to call it as such.

Java functions to encode Windows-1252 to UTF-8 getting the same symbol

I am new of this forum. I have a problem about the conversion between the encoding Windows-1252 to UTF-8.
I have a string encoded in Windows-1252 (e.g. the character: ¢). I would like to obtain the same symbol, but encoded in UTF-8. I mean: the source character and the destination character I would like that appear always the same (¢) but with different encoding.
Is it possibile? In addition: it exists a Java function which performs this conversion automatically (e.g. by passing the starting encoding and the end encoding)?
Thank you in advance for all of your help.
Hello,
Simone

You can transcode between various encodings using strings as an intermediary:
byte[] windows1252 = { (byte) 0xA2 };
String utf16 = new String(windows1252, Charset.forName("windows-1252"));
byte[] utf8 = utf16.getBytes(StandardCharsets.UTF_8);
char data is always UTF-16 in Java.

Java Strings Character Encoding - For French - Dutch Locales

I have the following piece of code
public static void main(String[] args) throws UnsupportedEncodingException {
System.out.println(Charset.defaultCharset().toString());
String accentedE = "é";
String utf8 = new String(accentedE.getBytes("utf-8"), Charset.forName("UTF-8"));
System.out.println(utf8);
utf8 = new String(accentedE.getBytes(), Charset.forName("UTF-8"));
System.out.println(utf8);
utf8 = new String(accentedE.getBytes("utf-8"));
System.out.println(utf8);
utf8 = new String(accentedE.getBytes());
System.out.println(utf8);
}
The output of the above is as follows
windows-1252
é
?
Ã©
é
Can someone help me understand what does this do ? Why this output ?

If you already have a String, there is no need to encode and decode it right back, the string is already a result from someone having decoded raw bytes.
In the case of a string literal, the someone is the compiler reading your source as raw bytes and decoding it in the encoding you have specified to it. If you have physically saved your source file in Windows-1252 encoding, and the compiler decodes it as Windows-1252, all is well. If not, you need to fix this by declaring the correct encoding for the compiler to use when compiling your source...
The line
String utf8 = new String(accentedE.getBytes("utf-8"), Charset.forName("UTF-8"));
Does absolutely nothing. (Encode as UTF-8, Decode as UTF-8 == no-op)
The line
utf8 = new String(accentedE.getBytes(), Charset.forName("UTF-8"));
Encodes string as Windows-1252, and then decodes it as UTF-8. The result must only be decoded in Windows-1252 (because it is encoded in Windows-1252, duh), otherwise you will get strange results.
The line
utf8 = new String(accentedE.getBytes("utf-8"));
Encodes a string as UTF-8, and then decodes it as Windows-1252. Same principles apply as in previous case.
The line
utf8 = new String(accentedE.getBytes());
Does absolutely nothing. (Encode as Windows-1252, Decode as Windows-1252 == no-op)
Analogy with integers that might be easier to understand:
int a = 555;
//The case of encoding as X and decoding right back as X
a = Integer.parseInt(String.valueOf(a), 10);
//a is still 555
int b = 555;
//The case of encoding as X and decoding right back as Y
b = Integer.parseInt(String.valueOf(b), 15);
//b is now 1205 I.E. strange result
Both of these are useless because we already have what we needed before doing any of the code, the integer 555.
There is a need for
encoding your string into raw bytes when it leaves your system and there is a need for decoding raw bytes into a string when they come into your system. There is no need to encode and decode right back within the system.

Line #1 - the default character set on your system is windows-1252.
Line #2 - you created a String by encoding a String literal to UTF-8 bytes, and then decoding it using the UTF-8 scheme. The result is correctly formed String, which can be output correctly using windows-1252 encoding.
Line #3 - you created a String by encoding a string literal as windows-1252, and then decoding it using UTF-8. The UTF-8 decoder has detected a sequence that cannot possibly be UTF-8, and has replaced the offending character with a question mark"?". (The UTF-8 format says that any byte that has the top bit set to 1 is one byte of a multi-byte character. But the windows-1252 encoding is just one byte long .... ergo, this is bad UTF-8)
Line #4 - you created a String by encoding in UTF-8 and then decoding in windows-1252. In this case the decoding has not "failed", but it has produced garbage (aka mojibake). The reason you got 2 characters of output is that the UTF-8 encoding of "é" is a 2 byte sequence.
Line #5 - you created a String by encoding as windows-1252 and decoding as windows-1252. This produce the correct output.
And the overall lesson is that if you encode characters to bytes with one character encoding, and then decode with a different character encoding you are liable to get mangling of one form or another.

When you call upon String getBytes method it:
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.
So whenever you do:
accentedE.getBytes()
it takes the contents of accentedE String as bytes encoded in the default OS code page, in your case cp-1252.
This line:
new String(accentedE.getBytes(), Charset.forName("UTF-8"))
takes the accentedE bytes (encoded in cp1252) and tries to decode them in UTF-8, hence the error. The same situation from the other side for:
new String(accentedE.getBytes("utf-8"))
The getBytes method takes the accentedE bytes encoded in cp-1252, reencodes them in UTF-8 but then the String constructor encodes them with the default OS codepage which is cp-1252.
Constructs a new String by decoding the specified array of bytes using the platform's default charset. The length of the new String is a function of the charset, and hence may not be equal to the length of the byte array.
I strongly recommend reading this excellent article:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
UPDATE:
In short, every character is stored as a number. In order to know which character is which number the OS uses the codepages. Consider the following snippet:
String accentedE = "é";
System.out.println(String.format("%02X ", accentedE.getBytes("UTF-8")[0]));
System.out.println(String.format("%02X ", accentedE.getBytes("UTF-8")[1]));
System.out.println(String.format("%02X ", accentedE.getBytes("windows-1252")[0]));
which outputs:
C3
A9
E9
That is because small accented e in UTF-8 is stored as two bytes of value C3A9, while in cp-1252 is stored as a single byte of value E9. For detailed explanation read the linked article.

How can I generate 'un-mappable' input for a Java CharsetDecoder?

I'm writing a set of unit tests for a text decoding class. I'd like to write a test that correctly exercises the handling of un-mappable input to a CharsetDecoder. However, I've struggle to initiate a byte buffer that does this. Example:
CharsetDecoder decoder = Charset.forName("utf-8").newDecoder();
decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
ByteBuffer in = ?
CharBuffer out = CharBuffer.allocate(256);
CoderResult result = decoder.decode(in, out, true);
assertTrue(result.isUnmappable());
How can I initiate the byte buffer (line 3) to pass the assertion (line 6)?
Things that don't work:
NULL characters (e.g. \u0000 encoded as utf-8)
Control characters (e.g. \u0001 encoded as utf-8)
Undefined characters (e.g. \u2065 encoded as utf-8)
Non-characters (e.g. \ufdd0 encoded as utf-8)
Private use characters (e.g. \ue000 encoded as utf-8)
Standalone combining characters (e.g. \u0305 encoded as utf-8).

I think that the unmappable character condition is relevant for encoding tasks only. Here, the 256 character is not defined for iso-8859-1:
public void testUnmappableCharacter() {
CharsetEncoder encoder = Charset.forName("iso-8859-1").newEncoder();
CharBuffer in = CharBuffer.wrap(new char[]{256});
ByteBuffer out = ByteBuffer.allocate(1);
CoderResult result = encoder.encode(in, out, false);
System.out.println(result);
}
For UTF-8 decoding, the only thing you'll be able to produce is a malformed condition since all illegal UTF-8 codepoints cannot be encoded.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.