Decoding %E9 to utf8 fails - java

I'm having some trouble decoding some encoding char.
What i need to decode is the %E9, i have a string like this D%E9bardeur and degr%E9
What i do in my java class, is the following:
try
{
System.out.println(o);// test
o = URLDecoder.decode((String) o, "UTF-8");
}
catch (UnsupportedEncodingException e)
{
e.printStackTrace();
}
After this operation, what i get is
D�bardeur and degr�
The very same happens when i dont decode to utf-8
Any advice?
thx

%E9 is not UTF-8.
The correct way to decode this would be:
URLDecoder.decode((String) o, "ISO-8859-1")

By %E9, could you mean there is a byte in your string that evaluates to hex E9? Because if so, that flags as "multibyte" in UTF-8, and there are 2 more "continuation bytes" (within the correct range) that follow.
Because remember, UTF-8 is a variable length encoding, so some code points (character values) are represented by 1 byte, some by 2, 3, etc.
If you have a string you're treating as UTF-8 and E9 is encountered, the next 2 bytes need to be in the correct range. For example, in this string, 00, which follows E9 is not a valid continuation byte:
http://hexutf8.com/?q=0x640x650x670x720xe90x00
Here's an example where E9 in a string is followed by the correct 2 bytes:
http://hexutf8.com/?q=0xc20xa90xe90x810xaa
And the appropriate character is represented.

Related

convert ASCII into Hex

I have a Gui where i want to convert ASCII into Hex, but it prints me fffff84 instead of 84. This only happens at ä, ö, ü. What went wrong?
Example input:
ä
Output:
ffffff84
My Code:
asciihex.addActionListener(new ActionListener() {
public void actionPerformed(ActionEvent e) {
output6.setText("");
String hexadecimal2 = input4.getText().replace("\n", "");
byte[] chars;
try {
chars = hexadecimal2.getBytes("CP850");
StringBuffer hexa = new StringBuffer();
for(int i = 0;i<chars.length;i++){
hexa.append(Integer.toHexString((int) chars[i]));
}
output6.append(hexa.toString());
} catch (UnsupportedEncodingException e1) {
e1.printStackTrace();
}
}
});
Code page 850 is not ASCII. And ä is not an ASCII character. Neither are your other examples of characters that don't work correctly.
What's happening is that the values of those characters, as bytes, are negative, because byte is a signed type in Java. (ä is -124, for instance.) -124 in two's complement hex as an int is 0xFFFFF84. You can get the unsigned version of that by adding it to 256, to get 132 (0x84). Then your conversion to hex would work.
You have to make an unsigned conversion of byte value to int, e.g.
hexa.append(Integer.toHexString((int) chars[i] & 0xFF));
or (Java 8)
hexa.append(Integer.toHexString(Byte.toUnsignedInt(chars[i])));
First of all, hexadecimal value of "ä" is not 0x84, its 0x7B.
For checking all the hexadecimal values please refer standard "ETSI TS 123 038 V14.0.0 (2017-04)".
Now for the coding part, I already made a function which takes any ASCII character and returns its hexadecimal value as per given standard. Since I do not want to post that code as it will be spoon feeding, instead I want to guide you to write your own.
Steps:
1. First refer the given document and understand the given character tables.
2. Create a list which contains the all the characters given in the table as per index values.
3. Make a function to extract the given character's index position and make the actual hexadecimal number. Do keep in mind to write extra functionality for extended character set.
Hope this will help you. :-)

How to read extend ascii code in java

Hello today I have a problem to print extend ASCII code in java. When I try to print it. It does not display. How can I print it.
You can use the String constructor that takes a byte array and a character set to convert a code page 437 ("IBM extended ASCII") character to a Java UTF-16 char:
public static extendedAscii(int codePoint) throws UnsupportedEncodingException {
return new String(new byte[] { (byte) codePoint }, "Cp437").charAt(0);
}
(Note: Yes, all characters in code page 437 fit in single UTF-16 chars; I checked.)

how to use bytes in java?

Why does this code return 221 here? What is the logic behind this? How this working? Please explain this to me for I am new to Java.
import java.io.UnsupportedEncodingException;
public class Checksrting {
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
byte[] byteArray = new byte[2];
byteArray[0] = 100;
byteArray[1] = 100;
Long ID = null;
try {
ID = Long.parseLong(new String(byteArray, "utf-8").trim(), 16);
System.out.print(ID);
} catch (NumberFormatException | UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
So please explain to me what is the use of utf-8 and ,16?
100 is the equivalent of the d character. So your string will become dd.
When you do
ID = Long.parseLong(new String(byteArray, "utf-8").trim(), 16);
You are converting the string to a long number, with hexadecimal format.
the decimal value for dd is 221, that's why you get that output.
what is the use of utf-8 and ,16?
utf-8 is the character encoding that the String constructor will use to build up the string, and 16 is the radix that will be used to convert your string to a long.
As you can see in the documentation String constructor gets a parameter charset:
Constructs a new String by decoding the specified array of bytes using the specified charset. The length of the new String is a function of the charset, and hence may not be equal to the length of the byte array.
And the 16 is the radix which is to use for the conversion:
See the documentation from Long
It returns 221 because of the conversion of dd string to hexadecimal number.
new String(byteArray, "utf-8").trim();
With this statement byteArray[0] contains 100 which is converted to character its representation is 'd' as there are 2 elements in the byteArray therefore it creates the String 'dd' and converts the String into the hexadecimal code.
new String(byteArray, "utf-8").trim();
returns 'dd'
then it is parsed into the Long value as the regEx parameter is given 16, therefore it converts into hexadecimal format i.e; 221
Long.parseLong("dd",16);
"utf-8" is the character encoding, i.e. how a String is represented as bytes.
UTF-8 uses a variable length encoding, and ASCII characters can be represented as a single byte with 0 as highest bit. This is the case for d which is represented as 100 (in decimal notation). Since you have 2 bytes with the number 100, this translates to the string "dd"
16 is the radix used for conversion from String to Long, so this translates from strings in hex notation.
A d in hex notation is 13 in decimal notation. So dd becomes 13 * 16 + 13 = 208 + 13 = 221
I agree with the older answers, but am adding some advice on how to figure out this sort of issue.
First, if you are having trouble understanding a complicated expression, extract sub-expressions into local variables, and print those variables:
String s1 = new String(byteArray, "utf-8");
System.out.println("s1: |" + s1 + "|");
String s2 = s1.trim();
System.out.println("s2: |" + s2 + "|");
ID = Long.parseLong(s2, 16);
System.out.print(ID);
It now prints:
s1: |dd|
s2: |dd|
221
Next, look at the individual sub-expressions. If there is anything you do not understand about a call and what it did, look it up in the API documentation.
For example, you asked about the "16". The Long.parseLong(String s, int radix) documentation says: "Parses the string argument as a signed long in the radix specified by the second argument. ". The output from the modified program shows that s is "dd", so it is going to parse "dd" as a hexadecimal number. A programmer's calculator will show you that hex "dd" is decimal 221.

Converting binary data to String

If I have some binary data D And I convert it to string S. I expect than on converting it back to binary I will get D. But It's wrong.
public class A {
public static void main(String[] args) throws IOException {
final byte[] bytes = new byte[]{-114, 104, -35};// In hex: 8E 68 DD
System.out.println(bytes.length); //prints 3
System.out.println(new String(bytes, "UTF-8").getBytes("UTF-8").length); //prints 7
}
}
Why does this happens?
Converting between a byte array to a String and back again is not a one-to-one mapping operation. Reading the docs, the String implmentation uses the CharsetDecoder to convert the incoming byte array into unicode. The first and last bytes in your input byte array must not map to a valid unicode character, thus it replaces it with some replacement string.
It's likely that the bytes you're converting to a string don't actually form a valid string. If java can't figure out what you mean by each byte, it will attempt to fix them. This means that when you convert back to the byte array, it won't be the same as when you started. If you try with a valid set of bytes, then you should be more successful.
Your data can't be decoded into valid Unicode characters using UTF-8 encoding. Look at decoded string. It consists of 3 characters: 0xFFFD, 0x0068 and 0xFFFD. First and last are "�" - Unicode replacement characters. I think you need to choose other encoding. I.e. "CP866" produces valid string and converts back into same array.

Java String.codePointAt returns unexpected value

If I use any ASCII characters from 33 to 127, the codePointAt method gives the correct decimal value, for example:
String s1 = new String("#");
int val = s1.codePointAt(0);
This returns 35 which is the correct value.
But if I try use ASCII characters from 128 to 255 (extended ASCII/ISO-8859-1), this method gives wrong value, for example:
String s1 = new String("ƒ") // Latin small letter f with hook
int val = s1.codePointAt(0);
This should return 159 as per this reference table, but instead returns 409, why is this?
But if I try use ASCII characters from 128 to 255
ASCII doesn't have values in this range. It only uses 7 bits.
Java chars are UTF-16 (and nothing else!). If you want to represent ASCII using Java, you need to use a byte array.
The codePointAt method returns the 32-bit codepoint. 16-bit chars can't contain the entire Unicode range, so some code points must be split across two chars (as per the encoding scheme for UTF-16). The codePointAt method helps resolve to chars code points.
I wrote a rough guide to encoding in Java here.
Java chars are not encoded in ISO-8859-1. They use UTF-16 which has the same values for 7bit ASCII characters (only values from 0-127).
To get the correct value for ISO-8859-1 you have to convert your string into a byte[] with String.getBytes("ISO-8859-1"); and look in the byte array.
Update
ISO-8859-1 is not the extended ASCII encoding, use String.getBytes("Cp437"); to get the correct values.
in Unicode
ƒ 0x0192 LATIN SMALL LETTER F WITH HOOK
String.codePointAt returns the Unicode-Codepoint at this specified index.
The Unicode-Codepoint of ƒ is 402, see
http://www.decodeunicode.org/de/u+0192/properties
So
System.out.println("ƒ".codePointAt(0));
printing 402 is correct.
If you are interested in the representation in other charsets, you can printout the bytes representaion of the character in other charsets via getBytes(String charsetName):
final String s = "ƒ";
for (final String csName : Charset.availableCharsets().keySet()) {
try {
final Charset cs = Charset.forName(csName);
final CharsetEncoder encode = cs.newEncoder();
if (encode.canEncode(s))
{
System.out.println(csName + ": " + Arrays.toString(s.getBytes(csName)));
}
} catch (final UnsupportedOperationException uoe) {
} catch (final UnsupportedEncodingException e) {
}
}

Categories

Resources