Java Encoding for "GB2312" CHARACTER ® replacing with question mark(?)

Java Encoding for "GB2312" CHARACTER ® replacing with question mark(?) - java

I'm trying to get encoded value using GB2312 characterset but I'm getting '? 'instead of '®'
Below is my sample code:
new String("Test ®".getBytes("GB2312"));
but I'm getting Test ? instead of Test ®.
Does any one faced this issue?
Java version- JDK6
Platform: Window 7
I'm not aware of Chinese character encoding so need suggestion.

For better understanding, the statement divided in two parts:
byte[] bytes = "Test ®".getBytes("GB2312"); // bytes, encoding the string to GB2312
new String(bytes); // back to string, using default encoding
Probably ® is not a valid GB2312 character, so it is converted to ?. See the result of
Charset.forName("GB2312").newEncoder().canEncode("®")
Based on documentation of getBytes:
The behavior of this method when this string cannot be encoded in the given charset is unspecified. The CharsetEncoder class should be used when more control over the encoding process is required.
which also suggest using CharsetEncoder.

Related

Illegal base64 character "a" using java.util.Base64 from within Scala

Suppose I have the following Base64 encoded String from a github API call to a file:
LyoKICogQ29weXJpZ2h0IDIwMTkgY29tLmdpdGh1Yi50aGVvcnlkdWRlcwog
KgogKiBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNp
b24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKICogeW91IG1heSBub3QgdXNlIHRo
aXMgZmlsZSBleGNlcHQgaW4gY29tcGxpYW5jZSB3aXRoIHRoZSBMaWNlbnNl
LgogKiBZb3UgbWF5IG9idGFpbiBhIGNvcHkgb2YgdGhlIExpY2Vuc2UgYXQK
ICoKICogICAgIGh0dHA6Ly93d3cuYXBhY2hlLm9yZy9saWNlbnNlcy9MSUNF
TlNFLTIuMAogKgogKiBVbmxlc3MgcmVxdWlyZWQgYnkgYXBwbGljYWJsZSBs
YXcgb3IgYWdyZWVkIHRvIGluIHdyaXRpbmcsIHNvZnR3YXJlCiAqIGRpc3Ry
aWJ1dGVkIHVuZGVyIHRoZSBMaWNlbnNlIGlzIGRpc3RyaWJ1dGVkIG9uIGFu
ICJBUyBJUyIgQkFTSVMsCiAqIFdJVEhPVVQgV0FSUkFOVElFUyBPUiBDT05E
SVRJT05TIE9GIEFOWSBLSU5ELCBlaXRoZXIgZXhwcmVzcyBvciBpbXBsaWVk
LgogKiBTZWUgdGhlIExpY2Vuc2UgZm9yIHRoZSBzcGVjaWZpYyBsYW5ndWFn
ZSBnb3Zlcm5pbmcgcGVybWlzc2lvbnMgYW5kCiAqIGxpbWl0YXRpb25zIHVu
ZGVyIHRoZSBMaWNlbnNlLgogKi8KCnBhY2thZ2UgY29tLmdpdGh1Yi50aGVv
cnlkdWRlcy5tb2RlbAoKaW1wb3J0IGNvbS5naXRodWIudGhlb3J5ZHVkZXMu
dXRpbC5LaXZ5UHJldHR5UHJpbnRlcgppbXBvcnQgb3JnLmJpdGJ1Y2tldC5p
bmt5dG9uaWsua2lhbWEuPT0+CmltcG9ydCBvcmcuYml0YnVja2V0Lmlua3l0
b25pay5raWFtYS5yZXdyaXRpbmcuUmV3cml0ZXIuXwppbXBvcnQgb3JnLmJp
dGJ1Y2tldC5pbmt5dG9uaWsua2lhbWEucmV3cml0aW5nLlN0cmF0ZWd5Cgov
KioKICogQmFzZSBUeXBlIGZvciBhbGwgbm9kZXMgb2YgYSBLaXZ5LUFTVAog
Ki8KdHJhaXQgQVNUTm9kZSBleHRlbmRzIEZvbGRhYmxlQVNUIHsgc2VsZiA9
PgogIC8qKgogICAqIFRyYXZlcnNlcyB0aGUgQVNUTm9kZSBhbmQgYXBwbGll
cyBTdHJhdGVneSBgc2Agb250byBgc2VsZmAgYW5kIGFsbCBjaGlsZHJlbiBv
ZiBzZWxmLgogICAqCiAgICogYHNgIGlzIGhlcmVieSBhcHBsaWVkIGJvdHRv
bSB1cCBpbiBsZWZ0IHRvIHJpZ2h0IG9yZGVyLgogICAqCiAgICogQHNlZSBb
W2h0dHBzOi8vYml0YnVja2V0Lm9yZy9pbmt5dG9uaWsva2lhbWEvc3JjLzAz
MjYzMGZhMjFkZGFkNWNmMzNjYmQ2ZWY5YzJmMDI3ODY2MWE2NzUvd2lraS9S
ZXdyaXRpbmcubWRdXQogICAqIEBwYXJhbSBzIHN0cmF0ZWd5IHRoYXQgaXMg
YXBwbGllZCB0byBgc2VsZmAgYW5kIGFsbCBjaGlsZHJlbi4KICAgKiBAcmV0
dXJuIGEgcmV3cml0dGVuIEFTVE5vZGUgYWNjb3JkaW5nIHRvIHRoZSBzdHJh
dGVneSBgc2AKICAgKi8KICBwcml2YXRlW3RoZW9yeWR1ZGVzXSBkZWYgdHJh
dmVyc2VBbmRBcHBseShzOlN0cmF0ZWd5KTpBU1ROb2RlCgogIC8qKgogICAq
IFJld3JpdGUgdGhlIEFTVE5vZGUgYHNlbGZgIGJ5IHRoZSBzcGVjaWZpY2F0
aW9uIG9mIGEgcGFydGlhbCBmdW5jdGlvbiBgZnBgLgogICAqCiAgICogSWYg
d2Ugd2FudCB0byBjaGFuZ2UgYSBzcGVjaWZpYyBbW21vZGVsLlB5dGhvbl1d
LW5vZGUgaW4gdGhlIEFTVCBmb3IgZXhhbXBsZSB3ZSBjb3VsZAogICAqIGFw
cGx5IHRoZSBmb2xsb3dpbmcgcmV3cml0ZSBzdHJhdGVneToKICAgKnt7ewog
ICAqICAgYXN0LnJld3JpdGUoewogICAqICAgIGNhc2UgUHl0aG9uKCJbMSwy
LDNdIikgPT4gUHl0aG9uKCJbMSwyLDMsNF0iKQogICAqICAgfSkKICAgKn19
fQogICAqCiAgICogUGxlYXNlIG5vdGUsIHRoYXQgQVNUTm9kZXMgY2FuIG5v
dCBiZSByZXdyaXR0ZW4gYXJiaXRyYXJpbHkuIFNpbmNlIGVhY2ggQVNUTm9k
ZSBpbXBsaWVzCiAgICogYSBzcGVjaWZpYyBwYXJhbWV0ZXIgbGlzdC4gQW4g
QVNUIGhhcyB0byBzdGF5IHN0cnVjdHVyZS1jb25zaXN0ZW50IGFmdGVyIGFw
cGx5aW5nIHJld3JpdGluZyBydWxlcy4KICAgKiBBIHJld3JpdGluZyBydWxl
IGFzOgogICAqIHt7ewogICAqICAgewogICAqICAgIGNhc2UgUHl0aG9uKHMp
ID0+IFRvcExldmVsKE5pbCkKICAgKiAgIH0KICAgKiB9fX0KICAgKiBpcyBu
b3QgdmFsaWQgYXMgYSBbW21vZGVsLlRvcExldmVsXV0tbm9kZSBjYW4gbm90
IG9jY3VyIGF0IHBvc2l0aW9ucyB3aGVyZSBhIFtbbW9kZWwuUHl0aG9uXV0t
bm9kZSBjYW4uCiAgICoKICAgKiBAc2VlIFtbaHR0cHM6Ly9iaXRidWNrZXQu
b3JnL2lua3l0b25pay9raWFtYS9zcmMvMDMyNjMwZmEyMWRkYWQ1Y2YzM2Ni
ZDZlZjljMmYwMjc4NjYxYTY3NS93aWtpL1Jld3JpdGluZy5tZF1dCiAgICog
QHBhcmFtIGZwIFBhcnRpYWwgZnVuY3Rpb24gdGhhdCBkZWZpbmVzIGhvdyB0
aGUgYXN0IHNob3VsZCBiZSByZXdyaXR0ZW4uCiAgICogQHJldHVybiBBIHJl
d3JpdHRlbiBBU1QgYWNjb3JkaW5nIHRvIHRoZSBzcGVjaWZpY2F0aW9uIGlu
IGBmcGAgb3IgdGhlIHNhbWUgYXN0IGlmIGBmcGAgY291bGQgbm90IGJlIGFw
cGxpZWQuCiAgICovCiAgZGVmIHJld3JpdGUoZnA6QVNUTm9kZSA9PT4gQVNU
Tm9kZSk6IEFTVE5vZGUgPSBzZWxmLnRyYXZlcnNlQW5kQXBwbHkocnVsZShm
cCkpCgogIC8qKgogICAqIFRyYW5zZm9ybXMgYHNlbGZgIGludG8gYSB3ZWxs
IGZvcm1hdHRlZCBraXZ5IHByb2dyYW0gdGhhdCBjYW4gYmUgd3JpdHRlbgog
ICAqIGludG8gYSBmaWxlLgogICAqCiAgICogVGhlIGZvbGxvd2luZyBBU1RO
b2RlIGZvciBleGFtcGxlOgogICAqIHt7ewogICAqICAgVG9wTGV2ZWwoCiAg
ICogICAgTGlzdCgKICAgKiAgICAgIFJvb3QoCiAgICogICAgICAgIFdpZGdl
dCgKICAgKiAgICAgICAgICBQbG90LAogICAqICAgICAgICAgIExpc3QoCiAg
ICogICAgICAgICAgICBXaWRnZXQoCiAgICogICAgICAgICAgICAgIExpbmVH
cmFwaCwKICAgKiAgICAgICAgICAgICAgTGlzdCgKICAgKiAgICAgICAgICAg
ICAgICBQcm9wZXJ0eShiYWNrZ3JvdW5kX25vcm1hbCxMaXN0KCcnKSksCiAg
ICogICAgICAgICAgICAgICAgUHJvcGVydHkoYmFja2dyb3VuZF9jb2xvcixM
aXN0KFswLDAsMCwxXSkpCiAgICogICApKSkpKSkpCiAgICogfX19CiAgICoK
ICAgKiBpcyBwcmludGVkOgogICAqIHt7ewogICAqIFBsb3Q6CiAgICogIExp
bmVHcmFwaDoKICAgKiAgICBiYWNrZ3JvdW5kX25vcm1hbDogJycKICAgKiAg
ICBiYWNrZ3JvdW5kX2NvbG9yOiBbMCwwLDAsMV0KICAgKiB9fX0KICAgKgog
ICAqIEByZXR1cm4gQSBmb3JtYXR0ZWQgQVNUTm9kZSB0aGF0IGNhbiBiZSBp
bnRlcnByZXRlZCBhcyBhIEtpdnkgZmlsZS4KICAgKi8KICBkZWYgcHJldHR5
OlN0cmluZyA9IEtpdnlQcmV0dHlQcmludGVyLmZvcm1hdChzZWxmKS5sYXlv
dXQKfQ==
As far as I see, this encoding is correct and only contains the standard alphabet of characters for a Base64 encoding. If I decode this encoding here, I get a correct translation. However, I tried various approaches to decode it programmatically and did not find a solution yet.
Let contentEncoded be the string containing the encoded file. I tried the following:
java.util.Base64.getDecoder.decode(contentEncoded)
java.util.Base64.getDecoder.decode(contentEncoded.getBytes)
java.util.Base64.getDecoder.decode(contentEncoded.getBytes(StandardCharsets.UTF_8))
java.util.Base64.getUrlDecoder.decode(contentEncoded))
java.util.Base64.getUrlDecoder.decode(contentEncoded.getBytes(StandardCharsets.UTF_8))
java.util.Base64.getMimeDecoder.decode(contentEncoded.replaceAll("\\n", "").replaceAll("\\r", ""))
However, all of them resulted in an error message: java.lang.IllegalArgumentException: Illegal base64 character a.
My question is: Am I not seeing something obvious? Are there some hidden control characters? Has anybody had similar issues and was able to fix them?

Just remove line breaks and it should work.
contentEncoded.replace("\n", "")

The following snippet decodes the encoding correctly:
val decodedWithMime = java.util.Base64.getMimeDecoder.decode(contentEncoded)
val convertedByteArray = decodedWithMime.map(_.toChar).mkString
as pointed out by comments, the error message Illegal Base64 character a corresponds to the hex value for the newline character \n. Using the Mime Decoder it is possible to decode the string without removing the newline characters beforehand.

Encode to UTF-8. Encode character eg. ö to Ã¶

I want to encode a string in Android to UTF-8. For example this string:
Grüne Ähren beißen Flöhe
to
GrÃ¼ne Ãhren beiÃen FlÃ¶he
But no matter what I do I encode ü to ü or ü to %C3%BC (online often called 'raw URL encode').
Found solutions to convert to byte[] or URI.toASCIIString(). But non of them work for me.
UPDATE
I am participating in the eBay partner network and try to concat a searchword to my partner url.
The people of eBay must use a wrong character set, as UTF-8 URL encoded string don't work.
A searchword with UTF-8 URL encoding
(Grüne Ähren beißen Flöhe
to
Gr%C3%BCne%20%C3%84hren%20bei%C3%9Fen%20Fl%C3%B6he)
comes out to this result in the eBay searchbox:
If I encode my searchword with ISO_8859_1 it works (GrÃ¼ne Ãhren beiÃen FlÃ¶he):
Thank you very much community

What you essentially want is to convert a String to it's byte representation according to UTF-8 and interpret these bytes using a different Charset, such as ISO-8859-1.
This is usually the cause of many problems. You want to intentionally do what most developers do incorrectly (or they simply ignore the problems of charsets).
Since you just need this to work, use this piece of code:
byte[] bytes = "Grüne Ähren beißen Flöhe".getBytes("UTF-8");
String result = new String(bytes, "ISO-8859-1");
see it at work here.

Why do I have to encode a utf-8 parameter String to iso-Latin and then decode as utf-8 to get Java utf-8 String?

I have a Java servlet that takes a parameter String (inputString) that may contain Greek letters from a web page marked up as utf-8. Before I send it to a database I have to convert it to a new String (utf8String) as follows:
String utf8String = new String(inputString.getBytes("8859_1"), "UTF-8");
This works, but, as I hope will be appreciated, I hate doing something I don't understand, even if it works.
From the method description in the Java doc the getBytes() method "Encodes this String into a sequence of bytes using the named charset, storing the result into a new byte array" i.e. I am encoding it in 8859_1 — isoLatin. And from the Constructor description "Constructs a new String by decoding the specified array of bytes using the specified charset" i.e. decodes the byte array to utf-8.
Can someone explain to me why this is necessary?

My question is based on a misconception regarding the character set used for the HTTP request. I had assumed that because I marked up the web page from which the request was sent as UTF-8 the request would be sent as UTF-8, and so the Greek characters in the parameter sent to the servlet would be read as a UTF-8 String (‘inputString’ in my line of code) by the HttpRequest.getParameter() method. This is not the case.
HTTP requests are sent as ISO-8859-1 (POST) or ASCII (GET), which are generally the same. This is part of the URI Syntax specification — thanks to Andreas for pointing me to http://wiki.apache.org/tomcat/FAQ/CharacterEncoding where this is explained.
I had also forgotten that the encoding of Greek letters such as α for the request is URL-encoding, which produces %CE%B1. The getParameter() handles this by decoding it as two ISO-8859-1 characters, %CE and %B1 — Î and ± (I checked this).
I now understand why this needs to be turned into a byte array and the bytes interpreted as UTF-8. 0xCE does not represent a one-byte character in UTF-8 and hence it is addressed with the next byte, 0xB1, to be interpretted as α. (Î is 0xC3 0x8E and ± is 0xC2 0xB1 in UTF-8.)

When decoding, could you not create a class with a decoder method that takes the bytes [] as a parameter and
return it as a string? here is an example that i have used before.
public class Decoder
{
public String decode(byte[] bytes)
{
//Turns the bytes array into a string
String decodedString = new String(bytes);
return decodedString;
}
}
Try use this instead of .getBytes(). hope this works.

Java functions to encode Windows-1252 to UTF-8 getting the same symbol

I am new of this forum. I have a problem about the conversion between the encoding Windows-1252 to UTF-8.
I have a string encoded in Windows-1252 (e.g. the character: ¢). I would like to obtain the same symbol, but encoded in UTF-8. I mean: the source character and the destination character I would like that appear always the same (¢) but with different encoding.
Is it possibile? In addition: it exists a Java function which performs this conversion automatically (e.g. by passing the starting encoding and the end encoding)?
Thank you in advance for all of your help.
Hello,
Simone

You can transcode between various encodings using strings as an intermediary:
byte[] windows1252 = { (byte) 0xA2 };
String utf16 = new String(windows1252, Charset.forName("windows-1252"));
byte[] utf8 = utf16.getBytes(StandardCharsets.UTF_8);
char data is always UTF-16 in Java.

Character Encoding in Java

In eclipse, I changed the default encoding to ISO-8859-1. Then I wrote this:
String str = "Русский язык ";
PrintStream ps = new PrintStream(System.out, true, "UTF-8");
ps.print(str);
It should print the String correctly, as I am specifying UTF-8 encoding. However, it is not printing.

The ISO-8859-1 character encoding only supports characters between 0 and 255, and anything else is likely to be turned into '?'

If you save the source file (the .java file) as ISO-8859-1 than str will be encoded by javac using ISO-8859-1. Your problem does not lie in the creation of PrintStream: the str you are printing is wrong from the beginning.

Yes, it looks like the terminal that your are sending this output does not support this encoding.
If you are running Eclipse, you could set the encoding as follows:
In Run Configurations...->Common ->Encoding->Other
Select UTF-8

You are basically telling the PrintStream writer to expect the input characters to be UTF-8 encoded and to output it as UTF-8. There is no conversion. If you set your IDE to use ISO-8859-1 as character encoding for your file, which in turns contains the input string than you pipe ISO-8859-1 encoded characters into an UTF-8 expecting writer. So the writer treats the bytes receiving as UTF encoded characters which will result in data junk.
Either set your IDE to encode your source files in UTF-8 and check that your characters are correctly displayed and stored. Or tell your writer to treat them as ISO-8859-1, either way should do.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java Encoding for "GB2312" CHARACTER ® replacing with question mark(?) - java

Related

Illegal base64 character "a" using java.util.Base64 from within Scala

Encode to UTF-8. Encode character eg. ö to Ã¶

Why do I have to encode a utf-8 parameter String to iso-Latin and then decode as utf-8 to get Java utf-8 String?

Java functions to encode Windows-1252 to UTF-8 getting the same symbol

Character Encoding in Java

Categories

Resources