Java cyrillic encoding - java

I have input string - UAH;"Ãîëüô 855229-7", it should be displayed like UAH;"Гольф 855229-7", I'm trying to use Cp1251 encoding, but get output UAH;"????? 855229-7".
String cyrillic = row[0] + row[1];
String utf8String= new String(cyrillic.getBytes("Cp1251"), "UTF-8");
lbl1.setText(utf8String);

UTF-8 has nothing to do with this. All of your characters in cyrillic are being represented as single bytes.
Currently, those bytes are in the ISO 8859-1 encoding, also known as Latin-1, which is a subset of the Windows English code page, Cp1252. So, you want to encode the string as Cp1252, then decode the resulting bytes as Cp1251:
String corrected8String = new String(cyrillic.getBytes("Cp1252"), "Cp1251");

Related

Convert Utf-16 to UTF-8 strings with data losing using Java

I have to insert text which 99,9% is UTF-8 but have 0.01% UTF-16 characters. Sо when I try to save it in my Mysql databse using Hibernate and Spring an exception occured. I can even remove these chars there is no problem, so I want to convert all my text in UTF-8 and save to my database with data losing, so the problem chars to be removed. I tried
String string = "😈 Devil Emoji";
byte[] converttoBytes = string.getBytes("UTF-16");
string = new String(converttoBytes, "UTF-8");
System.out.println(string);
But nothing happens.
😈 Devil Emoji
Is there any external library in order to do that?
😈 probably has nothing to do with UTF-16. It's hex is F09F9888. Notice that that is 4 bytes. Also notice that that is a UTF-8 encoding, not a "Unicode" encoding: U+1F608 or \u1F608. UTF-16 would be none of the above. More (scarfboy).
MySQL's utf8 handles only 3-byte (or shorter) UTF-8 characters. MySQL's utf8mb4 also handles 4-byte characters like that little devil.
You need to change the CHARACTER SET of the column you are storing him into. And you need to establish that your connection is charset=UTF-8.
Note: things outside MySQL call it UTF-8, but MySQL calls it utf8mb4.
String holds Unicode in java, so all scripts can be combined.
byte[] converttoBytes = string.getBytes("UTF-16");
These bytes are binary data, but actually used to store text, encoded in UTF-16.
string = new String(converttoBytes, "UTF-8");
Now String thinks that the bytes represent text encoding in UTF-8, and converts those. This is wrong.
Now to detect the encoding, either UTF-8 or UTF-16, then that should best be done on bytes, not String, as that String then has an erroneous conversion with possible loss.
As UTF-8 has the most strict format of both, we'll check that one.
Also UTF-16 has a byte 0 for ASCII, that almost never occurs in normal text.
So something like
public static String string(byte[] bytes) {
ByteBuffer buffer = ByteBuffer.wrap(bytes);
CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);
decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
try {
String s = decoder.decode(buffer).toString();
if (!s.contains("\u0000")) { // Could be UTF-16
return s;
}
} catch (CharacterCodingException e) { // Error in UTF-8
}
return new String(bytes, "UTF-16LE");
}
If you only have a String (for instance from the database), then
if (!s.contains("\u0000")) { // Could be UTF-16
s = new String(s.getBytes("Windows-1252"), "UTF-16LE");
}
might work or make a larger mess.

How does Java print a String?

Java uses char array to store String and String uses UTF-16 to store characters.
For my ubuntu:
$ echo $LANG
en_US.UTF-8
If the encoding of my java source file is UTF-8, and the main content is:
System.out.println("你好");
The meaning of 你好 is hello. With UTF-8, 你 and 好 both need 3 bytes to store. With UTF-16, they need 2 bytes.
When 你好 is printed to screen, is the data that Java sends to Linux OS encoded with UTF-8 or UTF-16 ?
The System.out is a PrintStream, which in turn uses StreamEncoder to encode the String (atleast in Java 6).
StreamEncoder is asked to use the encoding the OS expects. So in your case, it outputs in UTF-8.
String text = "你好";
byte[] array = text.getBytes("UTF-8");
String s = new String(array, Charset.forName("UTF-8"));
System.out.println(s);
You can try with UTF-16 if you want UTF-16

Java functions to encode Windows-1252 to UTF-8 getting the same symbol

I am new of this forum. I have a problem about the conversion between the encoding Windows-1252 to UTF-8.
I have a string encoded in Windows-1252 (e.g. the character: ¢). I would like to obtain the same symbol, but encoded in UTF-8. I mean: the source character and the destination character I would like that appear always the same (¢) but with different encoding.
Is it possibile? In addition: it exists a Java function which performs this conversion automatically (e.g. by passing the starting encoding and the end encoding)?
Thank you in advance for all of your help.
Hello,
Simone
You can transcode between various encodings using strings as an intermediary:
byte[] windows1252 = { (byte) 0xA2 };
String utf16 = new String(windows1252, Charset.forName("windows-1252"));
byte[] utf8 = utf16.getBytes(StandardCharsets.UTF_8);
char data is always UTF-16 in Java.

Java Strings Character Encoding - For French - Dutch Locales

I have the following piece of code
public static void main(String[] args) throws UnsupportedEncodingException {
System.out.println(Charset.defaultCharset().toString());
String accentedE = "é";
String utf8 = new String(accentedE.getBytes("utf-8"), Charset.forName("UTF-8"));
System.out.println(utf8);
utf8 = new String(accentedE.getBytes(), Charset.forName("UTF-8"));
System.out.println(utf8);
utf8 = new String(accentedE.getBytes("utf-8"));
System.out.println(utf8);
utf8 = new String(accentedE.getBytes());
System.out.println(utf8);
}
The output of the above is as follows
windows-1252
é
?
é
é
Can someone help me understand what does this do ? Why this output ?
If you already have a String, there is no need to encode and decode it right back, the string is already a result from someone having decoded raw bytes.
In the case of a string literal, the someone is the compiler reading your source as raw bytes and decoding it in the encoding you have specified to it. If you have physically saved your source file in Windows-1252 encoding, and the compiler decodes it as Windows-1252, all is well. If not, you need to fix this by declaring the correct encoding for the compiler to use when compiling your source...
The line
String utf8 = new String(accentedE.getBytes("utf-8"), Charset.forName("UTF-8"));
Does absolutely nothing. (Encode as UTF-8, Decode as UTF-8 == no-op)
The line
utf8 = new String(accentedE.getBytes(), Charset.forName("UTF-8"));
Encodes string as Windows-1252, and then decodes it as UTF-8. The result must only be decoded in Windows-1252 (because it is encoded in Windows-1252, duh), otherwise you will get strange results.
The line
utf8 = new String(accentedE.getBytes("utf-8"));
Encodes a string as UTF-8, and then decodes it as Windows-1252. Same principles apply as in previous case.
The line
utf8 = new String(accentedE.getBytes());
Does absolutely nothing. (Encode as Windows-1252, Decode as Windows-1252 == no-op)
Analogy with integers that might be easier to understand:
int a = 555;
//The case of encoding as X and decoding right back as X
a = Integer.parseInt(String.valueOf(a), 10);
//a is still 555
int b = 555;
//The case of encoding as X and decoding right back as Y
b = Integer.parseInt(String.valueOf(b), 15);
//b is now 1205 I.E. strange result
Both of these are useless because we already have what we needed before doing any of the code, the integer 555.
There is a need for
encoding your string into raw bytes when it leaves your system and there is a need for decoding raw bytes into a string when they come into your system. There is no need to encode and decode right back within the system.
Line #1 - the default character set on your system is windows-1252.
Line #2 - you created a String by encoding a String literal to UTF-8 bytes, and then decoding it using the UTF-8 scheme. The result is correctly formed String, which can be output correctly using windows-1252 encoding.
Line #3 - you created a String by encoding a string literal as windows-1252, and then decoding it using UTF-8. The UTF-8 decoder has detected a sequence that cannot possibly be UTF-8, and has replaced the offending character with a question mark"?". (The UTF-8 format says that any byte that has the top bit set to 1 is one byte of a multi-byte character. But the windows-1252 encoding is just one byte long .... ergo, this is bad UTF-8)
Line #4 - you created a String by encoding in UTF-8 and then decoding in windows-1252. In this case the decoding has not "failed", but it has produced garbage (aka mojibake). The reason you got 2 characters of output is that the UTF-8 encoding of "é" is a 2 byte sequence.
Line #5 - you created a String by encoding as windows-1252 and decoding as windows-1252. This produce the correct output.
And the overall lesson is that if you encode characters to bytes with one character encoding, and then decode with a different character encoding you are liable to get mangling of one form or another.
When you call upon String getBytes method it:
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.
So whenever you do:
accentedE.getBytes()
it takes the contents of accentedE String as bytes encoded in the default OS code page, in your case cp-1252.
This line:
new String(accentedE.getBytes(), Charset.forName("UTF-8"))
takes the accentedE bytes (encoded in cp1252) and tries to decode them in UTF-8, hence the error. The same situation from the other side for:
new String(accentedE.getBytes("utf-8"))
The getBytes method takes the accentedE bytes encoded in cp-1252, reencodes them in UTF-8 but then the String constructor encodes them with the default OS codepage which is cp-1252.
Constructs a new String by decoding the specified array of bytes using the platform's default charset. The length of the new String is a function of the charset, and hence may not be equal to the length of the byte array.
I strongly recommend reading this excellent article:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
UPDATE:
In short, every character is stored as a number. In order to know which character is which number the OS uses the codepages. Consider the following snippet:
String accentedE = "é";
System.out.println(String.format("%02X ", accentedE.getBytes("UTF-8")[0]));
System.out.println(String.format("%02X ", accentedE.getBytes("UTF-8")[1]));
System.out.println(String.format("%02X ", accentedE.getBytes("windows-1252")[0]));
which outputs:
C3
A9
E9
That is because small accented e in UTF-8 is stored as two bytes of value C3A9, while in cp-1252 is stored as a single byte of value E9. For detailed explanation read the linked article.

Character Encoding in Java

In eclipse, I changed the default encoding to ISO-8859-1. Then I wrote this:
String str = "Русский язык ";
PrintStream ps = new PrintStream(System.out, true, "UTF-8");
ps.print(str);
It should print the String correctly, as I am specifying UTF-8 encoding. However, it is not printing.
The ISO-8859-1 character encoding only supports characters between 0 and 255, and anything else is likely to be turned into '?'
If you save the source file (the .java file) as ISO-8859-1 than str will be encoded by javac using ISO-8859-1. Your problem does not lie in the creation of PrintStream: the str you are printing is wrong from the beginning.
Yes, it looks like the terminal that your are sending this output does not support this encoding.
If you are running Eclipse, you could set the encoding as follows:
In Run Configurations...->Common ->Encoding->Other
Select UTF-8
You are basically telling the PrintStream writer to expect the input characters to be UTF-8 encoded and to output it as UTF-8. There is no conversion. If you set your IDE to use ISO-8859-1 as character encoding for your file, which in turns contains the input string than you pipe ISO-8859-1 encoded characters into an UTF-8 expecting writer. So the writer treats the bytes receiving as UTF encoded characters which will result in data junk.
Either set your IDE to encode your source files in UTF-8 and check that your characters are correctly displayed and stored. Or tell your writer to treat them as ISO-8859-1, either way should do.

Categories

Resources