Java String.codePointAt returns unexpected value - java

If I use any ASCII characters from 33 to 127, the codePointAt method gives the correct decimal value, for example:
String s1 = new String("#");
int val = s1.codePointAt(0);
This returns 35 which is the correct value.
But if I try use ASCII characters from 128 to 255 (extended ASCII/ISO-8859-1), this method gives wrong value, for example:
String s1 = new String("ƒ") // Latin small letter f with hook
int val = s1.codePointAt(0);
This should return 159 as per this reference table, but instead returns 409, why is this?

But if I try use ASCII characters from 128 to 255
ASCII doesn't have values in this range. It only uses 7 bits.
Java chars are UTF-16 (and nothing else!). If you want to represent ASCII using Java, you need to use a byte array.
The codePointAt method returns the 32-bit codepoint. 16-bit chars can't contain the entire Unicode range, so some code points must be split across two chars (as per the encoding scheme for UTF-16). The codePointAt method helps resolve to chars code points.
I wrote a rough guide to encoding in Java here.

Java chars are not encoded in ISO-8859-1. They use UTF-16 which has the same values for 7bit ASCII characters (only values from 0-127).
To get the correct value for ISO-8859-1 you have to convert your string into a byte[] with String.getBytes("ISO-8859-1"); and look in the byte array.
Update
ISO-8859-1 is not the extended ASCII encoding, use String.getBytes("Cp437"); to get the correct values.

in Unicode
ƒ 0x0192 LATIN SMALL LETTER F WITH HOOK

String.codePointAt returns the Unicode-Codepoint at this specified index.
The Unicode-Codepoint of ƒ is 402, see
http://www.decodeunicode.org/de/u+0192/properties
So
System.out.println("ƒ".codePointAt(0));
printing 402 is correct.
If you are interested in the representation in other charsets, you can printout the bytes representaion of the character in other charsets via getBytes(String charsetName):
final String s = "ƒ";
for (final String csName : Charset.availableCharsets().keySet()) {
try {
final Charset cs = Charset.forName(csName);
final CharsetEncoder encode = cs.newEncoder();
if (encode.canEncode(s))
{
System.out.println(csName + ": " + Arrays.toString(s.getBytes(csName)));
}
} catch (final UnsupportedOperationException uoe) {
} catch (final UnsupportedEncodingException e) {
}
}

Related

convert ASCII into Hex

I have a Gui where i want to convert ASCII into Hex, but it prints me fffff84 instead of 84. This only happens at ä, ö, ü. What went wrong?
Example input:
ä
Output:
ffffff84
My Code:
asciihex.addActionListener(new ActionListener() {
public void actionPerformed(ActionEvent e) {
output6.setText("");
String hexadecimal2 = input4.getText().replace("\n", "");
byte[] chars;
try {
chars = hexadecimal2.getBytes("CP850");
StringBuffer hexa = new StringBuffer();
for(int i = 0;i<chars.length;i++){
hexa.append(Integer.toHexString((int) chars[i]));
}
output6.append(hexa.toString());
} catch (UnsupportedEncodingException e1) {
e1.printStackTrace();
}
}
});
Code page 850 is not ASCII. And ä is not an ASCII character. Neither are your other examples of characters that don't work correctly.
What's happening is that the values of those characters, as bytes, are negative, because byte is a signed type in Java. (ä is -124, for instance.) -124 in two's complement hex as an int is 0xFFFFF84. You can get the unsigned version of that by adding it to 256, to get 132 (0x84). Then your conversion to hex would work.
You have to make an unsigned conversion of byte value to int, e.g.
hexa.append(Integer.toHexString((int) chars[i] & 0xFF));
or (Java 8)
hexa.append(Integer.toHexString(Byte.toUnsignedInt(chars[i])));
First of all, hexadecimal value of "ä" is not 0x84, its 0x7B.
For checking all the hexadecimal values please refer standard "ETSI TS 123 038 V14.0.0 (2017-04)".
Now for the coding part, I already made a function which takes any ASCII character and returns its hexadecimal value as per given standard. Since I do not want to post that code as it will be spoon feeding, instead I want to guide you to write your own.
Steps:
1. First refer the given document and understand the given character tables.
2. Create a list which contains the all the characters given in the table as per index values.
3. Make a function to extract the given character's index position and make the actual hexadecimal number. Do keep in mind to write extra functionality for extended character set.
Hope this will help you. :-)

how to use bytes in java?

Why does this code return 221 here? What is the logic behind this? How this working? Please explain this to me for I am new to Java.
import java.io.UnsupportedEncodingException;
public class Checksrting {
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
byte[] byteArray = new byte[2];
byteArray[0] = 100;
byteArray[1] = 100;
Long ID = null;
try {
ID = Long.parseLong(new String(byteArray, "utf-8").trim(), 16);
System.out.print(ID);
} catch (NumberFormatException | UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
So please explain to me what is the use of utf-8 and ,16?
100 is the equivalent of the d character. So your string will become dd.
When you do
ID = Long.parseLong(new String(byteArray, "utf-8").trim(), 16);
You are converting the string to a long number, with hexadecimal format.
the decimal value for dd is 221, that's why you get that output.
what is the use of utf-8 and ,16?
utf-8 is the character encoding that the String constructor will use to build up the string, and 16 is the radix that will be used to convert your string to a long.
As you can see in the documentation String constructor gets a parameter charset:
Constructs a new String by decoding the specified array of bytes using the specified charset. The length of the new String is a function of the charset, and hence may not be equal to the length of the byte array.
And the 16 is the radix which is to use for the conversion:
See the documentation from Long
It returns 221 because of the conversion of dd string to hexadecimal number.
new String(byteArray, "utf-8").trim();
With this statement byteArray[0] contains 100 which is converted to character its representation is 'd' as there are 2 elements in the byteArray therefore it creates the String 'dd' and converts the String into the hexadecimal code.
new String(byteArray, "utf-8").trim();
returns 'dd'
then it is parsed into the Long value as the regEx parameter is given 16, therefore it converts into hexadecimal format i.e; 221
Long.parseLong("dd",16);
"utf-8" is the character encoding, i.e. how a String is represented as bytes.
UTF-8 uses a variable length encoding, and ASCII characters can be represented as a single byte with 0 as highest bit. This is the case for d which is represented as 100 (in decimal notation). Since you have 2 bytes with the number 100, this translates to the string "dd"
16 is the radix used for conversion from String to Long, so this translates from strings in hex notation.
A d in hex notation is 13 in decimal notation. So dd becomes 13 * 16 + 13 = 208 + 13 = 221
I agree with the older answers, but am adding some advice on how to figure out this sort of issue.
First, if you are having trouble understanding a complicated expression, extract sub-expressions into local variables, and print those variables:
String s1 = new String(byteArray, "utf-8");
System.out.println("s1: |" + s1 + "|");
String s2 = s1.trim();
System.out.println("s2: |" + s2 + "|");
ID = Long.parseLong(s2, 16);
System.out.print(ID);
It now prints:
s1: |dd|
s2: |dd|
221
Next, look at the individual sub-expressions. If there is anything you do not understand about a call and what it did, look it up in the API documentation.
For example, you asked about the "16". The Long.parseLong(String s, int radix) documentation says: "Parses the string argument as a signed long in the radix specified by the second argument. ". The output from the modified program shows that s is "dd", so it is going to parse "dd" as a hexadecimal number. A programmer's calculator will show you that hex "dd" is decimal 221.

Decoding %E9 to utf8 fails

I'm having some trouble decoding some encoding char.
What i need to decode is the %E9, i have a string like this D%E9bardeur and degr%E9
What i do in my java class, is the following:
try
{
System.out.println(o);// test
o = URLDecoder.decode((String) o, "UTF-8");
}
catch (UnsupportedEncodingException e)
{
e.printStackTrace();
}
After this operation, what i get is
D�bardeur and degr�
The very same happens when i dont decode to utf-8
Any advice?
thx
%E9 is not UTF-8.
The correct way to decode this would be:
URLDecoder.decode((String) o, "ISO-8859-1")
By %E9, could you mean there is a byte in your string that evaluates to hex E9? Because if so, that flags as "multibyte" in UTF-8, and there are 2 more "continuation bytes" (within the correct range) that follow.
Because remember, UTF-8 is a variable length encoding, so some code points (character values) are represented by 1 byte, some by 2, 3, etc.
If you have a string you're treating as UTF-8 and E9 is encountered, the next 2 bytes need to be in the correct range. For example, in this string, 00, which follows E9 is not a valid continuation byte:
http://hexutf8.com/?q=0x640x650x670x720xe90x00
Here's an example where E9 in a string is followed by the correct 2 bytes:
http://hexutf8.com/?q=0xc20xa90xe90x810xaa
And the appropriate character is represented.

Converting binary data to String

If I have some binary data D And I convert it to string S. I expect than on converting it back to binary I will get D. But It's wrong.
public class A {
public static void main(String[] args) throws IOException {
final byte[] bytes = new byte[]{-114, 104, -35};// In hex: 8E 68 DD
System.out.println(bytes.length); //prints 3
System.out.println(new String(bytes, "UTF-8").getBytes("UTF-8").length); //prints 7
}
}
Why does this happens?
Converting between a byte array to a String and back again is not a one-to-one mapping operation. Reading the docs, the String implmentation uses the CharsetDecoder to convert the incoming byte array into unicode. The first and last bytes in your input byte array must not map to a valid unicode character, thus it replaces it with some replacement string.
It's likely that the bytes you're converting to a string don't actually form a valid string. If java can't figure out what you mean by each byte, it will attempt to fix them. This means that when you convert back to the byte array, it won't be the same as when you started. If you try with a valid set of bytes, then you should be more successful.
Your data can't be decoded into valid Unicode characters using UTF-8 encoding. Look at decoded string. It consists of 3 characters: 0xFFFD, 0x0068 and 0xFFFD. First and last are "�" - Unicode replacement characters. I think you need to choose other encoding. I.e. "CP866" produces valid string and converts back into same array.

get char value in java

How can I get the UTF8 code of a char in Java ?
I have the char 'a' and I want the value 97
I have the char 'é' and I want the value 233
here is a table for more values
I tried Character.getNumericValue(a) but for a it gives me 10 and not 97, any idea why?
This seems very basic but any help would be appreciated!
char is actually a numeric type containing the unicode value (UTF-16, to be exact - you need two chars to represent characters outside the BMP) of the character. You can do everything with it that you can do with an int.
Character.getNumericValue() tries to interpret the character as a digit.
You can use the codePointAt(int index) method of java.lang.String for that. Here's an example:
"a".codePointAt(0) --> 97
"é".codePointAt(0) --> 233
If you want to avoid creating strings unnecessarily, the following works as well and can be used for char arrays:
Character.codePointAt(new char[] {'a'},0)
Those "UTF-8" codes are no such thing. They're actually just Unicode values, as per the Unicode code charts.
So an 'é' is actually U+00E9 - in UTF-8 it would be represented by two bytes { 0xc3, 0xa9 }.
Now to get the Unicode value - or to be more precise the UTF-16 value, as that's what Java uses internally - you just need to convert the value to an integer:
char c = '\u00e9'; // c is now e-acute
int i = c; // i is now 233
This produces good result:
int a = 'a';
System.out.println(a); // outputs 97
Likewise:
System.out.println((int)'é');
prints out 233.
Note that the first example only works for characters included in the standard and extended ASCII character sets. The second works with all Unicode characters. You can achieve the same result by multiplying the char by 1.
System.out.println( 1 * 'é');
Your question is unclear. Do you want the Unicode codepoint for a particular character (which is the example you gave), or do you want to translate a Unicode codepoint into a UTF-8 byte sequence?
If the former, then I recommend the code charts at http://www.unicode.org/
If the latter, then the following program will do it:
public class Foo
{
public static void main(String[] argv)
throws Exception
{
char c = '\u00E9';
ByteArrayOutputStream bos = new ByteArrayOutputStream();
OutputStreamWriter out = new OutputStreamWriter(bos, "UTF-8");
out.write(c);
out.flush();
byte[] bytes = bos.toByteArray();
for (int ii = 0 ; ii < bytes.length ; ii++)
System.out.println(bytes[ii] & 0xFF);
}
}
(there's also an online Unicode to UTF8 page, but I don't have the URL on this machine)
My method to do it is something like this:
char c = 'c';
int i = Character.codePointAt(String.valueOf(c), 0);
// testing
System.out.println(String.format("%c -> %d", c, i)); // c -> 99
You can create a simple loop to list all the UTF-8 characters available like this:
public class UTF8Characters {
public static void main(String[] args) {
for (int i = 12; i <= 999; i++) {
System.out.println(i +" - "+ (char)i);
}
}
}
There is an open source library MgntUtils that has a Utility class StringUnicodeEncoderDecoder. That class provides static methods that convert any String into Unicode sequence vise-versa. Very simple and useful. To convert String you just do:
String codes = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(myString);
For example a String "Hello World" will be converted into
"\u0048\u0065\u006c\u006c\u006f\u0020
\u0057\u006f\u0072\u006c\u0064"
It works with any language. Here is the link to the article that explains all te ditails about the library: MgntUtils. Look for the subtitle "String Unicode converter". The article gives you link to Maven Central where you can get artifacts and github where you can get the project itself. The library comes with well written javadoc and source code.

Categories

Resources