Java String to byteArray conversion issue

Java String to byteArray conversion issue - java

I am trying to encode/decode a ByteArray to String, but input/output are not matching. Am I doing something wrong?
System.out.println(org.apache.commons.codec.binary.Hex.encodeHexString(by));
String s = new String(by, Charsets.UTF_8);
System.out.println(org.apache.commons.codec.binary.Hex.encodeHexString(s.getBytes(Charsets.UTF_8)));
The output is:
130021000061f8f0001a
130021000061efbfbd
Complete code:
String[] arr = {"13", "00", "21", "00", "00", "61", "F8", "F0", "00", "1A"};
byte[] by = new byte[arr.length];
for (int i = 0; i < arr.length; i++) {
by[i] = (byte)(Integer.parseInt(arr[i],16) & 0xff);
}
System.out.println(org.apache.commons.codec.binary.Hex.encodeHexString(by));
String s = new String(by, Charsets.UTF_8);
System.out.println(org.apache.commons.codec.binary.Hex.encodeHexString(s.getBytes(Charsets.UTF_8)));

The problem here is that f8f0001a isn't a valid UTF-8 byte sequence.
First of all, the f8 opening byte denotes a 5 byte sequence and you've only got four. Secondly, f8 can only be followed by a byte of 8x, 9x, ax or bx form.
Therefore it gets replaced with a unicode replacement character (U+FFFD), whose byte sequence in UTF-8 is efbfbd.
And there (rightly) is no guarantee that the conversion of an invalid byte sequence to and from a string will result in the same byte sequence. (Note that even with two, seemingly identical strings, you might get different bytes representing them in Unicode, see Unicode equivalence. )
The moral of the story is: if you want to represent bytes, don't convert them to characters, and if you want to represent text, don't use byte arrays.

My UTF-8 is a bit rusty :-), but the sequence F8 F0 is imho not a valid utf-8 encoding.
Look at http://en.wikipedia.org/wiki/Utf-8#Description.

When you build the String from the array of bytes, the bytes are decoded.
Since the bytes from your code does not represent valid characters, the bytes that finally composes the String are not the same your passed as parameter.
public String(byte[] bytes)
Constructs a new String by decoding the
specified array of bytes using the platform's default charset. The
length of the new String is a function of the charset, and hence may
not be equal to the length of the byte array.
The behavior of this
constructor when the given bytes are not valid in the default charset
is unspecified. The CharsetDecoder class should be used when more
control over the decoding process is required.

Related

How many bytes of English and Chinese characters take in java?

import java.io.UnsupportedEncodingException;
public class TestChar {
public static void main(String[] args) throws UnsupportedEncodingException {
String cnStr = "龙";
String enStr = "a";
byte[] cnBytes = cnStr.getBytes("UTF-8");
byte[] enBytes = enStr.getBytes("UTF-8");
System.out.println("bytes size of Chinese：" + cnBytes.length);
System.out.println("bytes size of English：" + enBytes.length);
// in java, char takes two bytes, the question is:
char cnc = '龙'; // will '龙‘ take two or three bytes ?
char enc = 'a'; // will 'a' take one or two bytes ?
}
}
Output :
bytes size of Chinese：3
bytes size of English：1
Here, My JVM is set as UTF-8, from the output, we know Chinese character '龙' takes 3 bytes, and English character 'a' takes one byte. My question is:
In Java, char takes two bytes, here, char cnc = '龙'; char enc = 'a'; will cnc only takes two bytes instead of 3 bytes ? And 'a' takes two bytes instead of one byte ?

The codepoint value of 龙 is 40857. That fits inside the two bytes of a char.
It takes 3 bytes to encode in UTF-8 because not all 2-byte sequences are valid in UTF-8.

UTF-8 is a variable-length character encoding, where characters take up 1 to 4 bytes.
A Java char is 16 bits. See 3.1 Unicode in the Java Language Specification to understand how exactly Java handles Unicode.

Internally, Strings/chars are UTF-16, so it'll be the same for both: Each char will be 16bits.
byte[] cnBytes = cnStr.getBytes("UTF-8");
UTF-8 is a variable length encoding, so the Chinese char takes more bits because it's out of the ASCII character range.

Why do I take java.nio.BufferUnderflowException

I take BufferUnderflowException from following code.
int length = mBuf.remaining();
char[] charBuff = new char[length];
for (int i = 0; i < length; ++i) {
char[i] = mBuf.getChar();
}
mBuf is a ByteBuffer. Line "char[i] =mBuf.getChar();" is crashed.
What do you think about this problem?

You have mistakenly assumed that a char is one byte in size. In Java, a char is two bytes, so mBuf.getChar() consumes two bytes. The documentation even states that the method reads the next two bytes.
If you use the CharBuffer returned by mBuf.asCharBuffer(), that buffer's remaining() method will give you the number you expect.
Update: Based on your comment, I understand now that your buffer does in fact contain one-byte characters. Since Java deals with the entire Unicode repertoire (which contains hundreds of thousands of characters), you must tell it which Charset (character-to-byte encoding) you are using:
// This is a pretty common one-byte charset.
Charset charset = StandardCharsets.ISO_8859_1;
// This is another common one-byte charset. You must use the same charset
// that was used to write the bytes in your ObjectiveC program.
//Charset charset = Charset.forName("windows-1252");
CharBuffer c = charset.newDecoder().decode(mBuf);
char[] charBuff = new char[c.remaining()];
c.get(charBuff);

Prepend a byte to a byte array

I need to prepend the string "00" or byte 0x00 to the beginning of a byte array? I tried to do it with a for loop but when I convert it to hex it doesn't show up in the front.

The string "00" is different than the number 0x00 when converted to Bytes. What is the data type you are trying to prepend to your byte array? Assuming it's a Byte representation of the string "00", try the following:
bytes[] orig = <your byte array>;
String prepend = "00";
bytes[] prependBytes = prepend.getBytes();
bytes[] output = new Bytes[prependBytes.length + orig.length];
for(i=0;i<prependBytes.length;i++){
output[i] = prependBytes[i];
}
for(i=prependBytes.length;i<(orig.length+prepend.lenth);i++){
output[i] = orig[i];
}
or you can use Arrays.copy(...) instead of the two for loops as mentioned before to do the prepending. See How to combine two byte arrays
Alternativly, if you are trying to literally prepend 0 to your byte array, decalare prependBytes in the following way and use the same algorithm
byte[] prependBytes = new byte[]{0,0};
Also you say that you're converting your byte array to hex, and that may truncate leading zeros. To test this, try prepending the follwoing and converting to hex and see if there is a different output:
byte[] prependBytes = new byte[]{1,1};
If it is removing the leading zeros that you want, you may wish to convert your hex number to a string and format it.

Converting binary data to String

If I have some binary data D And I convert it to string S. I expect than on converting it back to binary I will get D. But It's wrong.
public class A {
public static void main(String[] args) throws IOException {
final byte[] bytes = new byte[]{-114, 104, -35};// In hex: 8E 68 DD
System.out.println(bytes.length); //prints 3
System.out.println(new String(bytes, "UTF-8").getBytes("UTF-8").length); //prints 7
}
}
Why does this happens?

Converting between a byte array to a String and back again is not a one-to-one mapping operation. Reading the docs, the String implmentation uses the CharsetDecoder to convert the incoming byte array into unicode. The first and last bytes in your input byte array must not map to a valid unicode character, thus it replaces it with some replacement string.

It's likely that the bytes you're converting to a string don't actually form a valid string. If java can't figure out what you mean by each byte, it will attempt to fix them. This means that when you convert back to the byte array, it won't be the same as when you started. If you try with a valid set of bytes, then you should be more successful.

Your data can't be decoded into valid Unicode characters using UTF-8 encoding. Look at decoded string. It consists of 3 characters: 0xFFFD, 0x0068 and 0xFFFD. First and last are "�" - Unicode replacement characters. I think you need to choose other encoding. I.e. "CP866" produces valid string and converts back into same array.

Java String.codePointAt returns unexpected value

If I use any ASCII characters from 33 to 127, the codePointAt method gives the correct decimal value, for example:
String s1 = new String("#");
int val = s1.codePointAt(0);
This returns 35 which is the correct value.
But if I try use ASCII characters from 128 to 255 (extended ASCII/ISO-8859-1), this method gives wrong value, for example:
String s1 = new String("ƒ") // Latin small letter f with hook
int val = s1.codePointAt(0);
This should return 159 as per this reference table, but instead returns 409, why is this?

But if I try use ASCII characters from 128 to 255
ASCII doesn't have values in this range. It only uses 7 bits.
Java chars are UTF-16 (and nothing else!). If you want to represent ASCII using Java, you need to use a byte array.
The codePointAt method returns the 32-bit codepoint. 16-bit chars can't contain the entire Unicode range, so some code points must be split across two chars (as per the encoding scheme for UTF-16). The codePointAt method helps resolve to chars code points.
I wrote a rough guide to encoding in Java here.

Java chars are not encoded in ISO-8859-1. They use UTF-16 which has the same values for 7bit ASCII characters (only values from 0-127).
To get the correct value for ISO-8859-1 you have to convert your string into a byte[] with String.getBytes("ISO-8859-1"); and look in the byte array.
Update
ISO-8859-1 is not the extended ASCII encoding, use String.getBytes("Cp437"); to get the correct values.

in Unicode
ƒ 0x0192 LATIN SMALL LETTER F WITH HOOK

String.codePointAt returns the Unicode-Codepoint at this specified index.
The Unicode-Codepoint of ƒ is 402, see
http://www.decodeunicode.org/de/u+0192/properties
So
System.out.println("ƒ".codePointAt(0));
printing 402 is correct.
If you are interested in the representation in other charsets, you can printout the bytes representaion of the character in other charsets via getBytes(String charsetName):
final String s = "ƒ";
for (final String csName : Charset.availableCharsets().keySet()) {
try {
final Charset cs = Charset.forName(csName);
final CharsetEncoder encode = cs.newEncoder();
if (encode.canEncode(s))
{
System.out.println(csName + ": " + Arrays.toString(s.getBytes(csName)));
}
} catch (final UnsupportedOperationException uoe) {
} catch (final UnsupportedEncodingException e) {
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java String to byteArray conversion issue - java

My UTF-8 is a bit rusty :-), but the sequence F8 F0 is imho not a valid utf-8 encoding. Look at http://en.wikipedia.org/wiki/Utf-8#Description.

Related

How many bytes of English and Chinese characters take in java?

Why do I take java.nio.BufferUnderflowException

Prepend a byte to a byte array

Converting binary data to String

Java String.codePointAt returns unexpected value

Categories

Resources