How to Convert UTF-16 Surrogate Decimal to UNICODE in Java - java

I have some string data like
&#55357 ;&#56842 ;
These are surrogate pairs in UTF 16 in decimal format.
How can I convert them to Unicode Code Points in Java, so that my client can understand the Unicode decimal html entity without the surrogate pair?
Example: &#128522 ; - Get this response for the above string

Assuming you already parsed the string to get the 2 numbers, just create a String from those two char values:
String s = new String(new char[] { 55357, 56842 });
System.out.println(s);
Output
😊
To get the code point of that:
s.codePointAt(0) // returns 128522
You don't have to create a string though:
Character.toCodePoint((char) 55357, (char) 56842) // returns 128522

Related

Convert ASCII representation of unicode to unicode

I have an application that get som Strings by JSON.
The problem is that I think that they are sending it as ASCII and the text really should be in unicode.
For example, there are parts of the string that is "\u00f6" which is the swedish letter "ö"
For example the swedish word for "buy" is "köpa" and the string I get is "k\u00f6pa"
Is there an easy way for me after I recived this String in java to convert it to the correct representation?
That is, I want to convert strings like "k\u00f6pa" to "köpa"
Thank for all help!
Well, that is easy enough, just use a JSON library. With Jackson for instance you will:
final ObjectMapper mapper = new ObjectMapper();
final JsonNode node = mapper.readTree(your, source, here);
The JsonNode will in fact be a TextNode; you can just retrieve the text as:
node.textValue()
Note that this IS NOT an "ASCII representation" of a String; it just happens that JSON strings can contain UTF-16 code unit character escapes like this one.
(you will lose the quotes around the value, though, but that is probably what you expect anyway)
The hex code is just 2 bytes of integer, which an int can handle just fine -- so you can just use Integer.parse(s, 16) where s is the string without the "\u" prefix. Then you just narrow that int to a char, which is guaranteed to fit.
Throw in some regex (to validate the string and also extract the hex code), and you're all done.
Pattern p = Pattern.compile("\\\\u([0-9a-fA-F]{4})");
Matcher m = p.matcher(arg);
if (m.matches()) {
String code = m.group(1);
int i = Integer.parseInt(code, 16);
char c = (char) i;
System.out.println(c);
}

Handling Strings with octal ASCII code (in Java)

i'm having some trouble with a text file that contains strings like these:
Grandchamp-le-Ch\303\242teau
It's the name of a Wikipedia page by the way. The two asciis represent "â" I think.
Is there any piece of software that easily converts the string above into
Grandchamp-le-Château
or maybe
Grandchamp-le-Ch%C3%A2teau
I would prefer a java absed solution, but any other idea is just as well!
Any advice or hint is very much appreciated!
This is a slightly hacky way to achieve your goal:
final String name = "Grandchamp-le-Ch\\303\\242teau";
final Matcher m = Pattern.compile("\\\\(\\d{3})").matcher(name);
final StringBuffer out = new StringBuffer();
while (m.find()) m.appendReplacement(out, String.valueOf((char)parseInt(m.group(1), 8)));
m.appendTail(out);
final String decoded = new String(out.toString().getBytes(ISO_8859_1), UTF_8);
System.out.println(decoded);
How it works:
the regular expression matches the octal character notation;
the original string is transformed by replacing each such octal notation with a char whose numeric value equals that octal number;
the new string (now in "mojibake" state) is written out as bytes, using a single-byte encoding (any will do, but ISO_8859_1 happens to be the standard one);
the bytes are re-read, now assuming they are an UTF-8-encoded string.
The code will print out
Grandchamp-le-Château
Here you are:
String myString = "Grandchamp-le-Ch\303\242teau";
byte[] byteArray = myString.getBytes("ISO-8859-1");
String result = new String(byteArray, "UTF-8");
System.out.println(result);
This prints:
Grandchamp-le-Château

Java: Convert a hexadecimal encoded String to a hexadecimal byte

An Item-ID in hexadecimal and the amount in decimal has to be entered in two JTextFields.
Now I have to convert the Item ID hexadecimal encoded in a String to a byte hexadecimal.
String str = itemIdField.getText(); // Would be, for example, "5e"
byte b = // Should be 0x5e then.
So if str = "5e", b = 0x5e
if str = "6b" b = 0x6b and so on.
Does anybody now, what the code to convert that would be then?
Google doesn't know, it thinks, I want to convert the text to a byte[]
Thank you, Richie
You can use Byte.parseByte(str, 16), that will return the byte value represented by the hexadecimal value in str.

Converting binary data to String

If I have some binary data D And I convert it to string S. I expect than on converting it back to binary I will get D. But It's wrong.
public class A {
public static void main(String[] args) throws IOException {
final byte[] bytes = new byte[]{-114, 104, -35};// In hex: 8E 68 DD
System.out.println(bytes.length); //prints 3
System.out.println(new String(bytes, "UTF-8").getBytes("UTF-8").length); //prints 7
}
}
Why does this happens?
Converting between a byte array to a String and back again is not a one-to-one mapping operation. Reading the docs, the String implmentation uses the CharsetDecoder to convert the incoming byte array into unicode. The first and last bytes in your input byte array must not map to a valid unicode character, thus it replaces it with some replacement string.
It's likely that the bytes you're converting to a string don't actually form a valid string. If java can't figure out what you mean by each byte, it will attempt to fix them. This means that when you convert back to the byte array, it won't be the same as when you started. If you try with a valid set of bytes, then you should be more successful.
Your data can't be decoded into valid Unicode characters using UTF-8 encoding. Look at decoded string. It consists of 3 characters: 0xFFFD, 0x0068 and 0xFFFD. First and last are "�" - Unicode replacement characters. I think you need to choose other encoding. I.e. "CP866" produces valid string and converts back into same array.

Java String.codePointAt returns unexpected value

If I use any ASCII characters from 33 to 127, the codePointAt method gives the correct decimal value, for example:
String s1 = new String("#");
int val = s1.codePointAt(0);
This returns 35 which is the correct value.
But if I try use ASCII characters from 128 to 255 (extended ASCII/ISO-8859-1), this method gives wrong value, for example:
String s1 = new String("Æ’") // Latin small letter f with hook
int val = s1.codePointAt(0);
This should return 159 as per this reference table, but instead returns 409, why is this?
But if I try use ASCII characters from 128 to 255
ASCII doesn't have values in this range. It only uses 7 bits.
Java chars are UTF-16 (and nothing else!). If you want to represent ASCII using Java, you need to use a byte array.
The codePointAt method returns the 32-bit codepoint. 16-bit chars can't contain the entire Unicode range, so some code points must be split across two chars (as per the encoding scheme for UTF-16). The codePointAt method helps resolve to chars code points.
I wrote a rough guide to encoding in Java here.
Java chars are not encoded in ISO-8859-1. They use UTF-16 which has the same values for 7bit ASCII characters (only values from 0-127).
To get the correct value for ISO-8859-1 you have to convert your string into a byte[] with String.getBytes("ISO-8859-1"); and look in the byte array.
Update
ISO-8859-1 is not the extended ASCII encoding, use String.getBytes("Cp437"); to get the correct values.
in Unicode
Æ’ 0x0192 LATIN SMALL LETTER F WITH HOOK
String.codePointAt returns the Unicode-Codepoint at this specified index.
The Unicode-Codepoint of Æ’ is 402, see
http://www.decodeunicode.org/de/u+0192/properties
So
System.out.println("Æ’".codePointAt(0));
printing 402 is correct.
If you are interested in the representation in other charsets, you can printout the bytes representaion of the character in other charsets via getBytes(String charsetName):
final String s = "Æ’";
for (final String csName : Charset.availableCharsets().keySet()) {
try {
final Charset cs = Charset.forName(csName);
final CharsetEncoder encode = cs.newEncoder();
if (encode.canEncode(s))
{
System.out.println(csName + ": " + Arrays.toString(s.getBytes(csName)));
}
} catch (final UnsupportedOperationException uoe) {
} catch (final UnsupportedEncodingException e) {
}
}

Categories

Resources