In Java, how to interpret the UTF-8 bytes of a String?

In Java, how to interpret the UTF-8 bytes of a String? - java

The Java byte type is signed, with the scope from -128 to 127 (inclusive). What a terrible design it is!
Now, I want to get the UTF-8 representation of a Java String. As I understand, the UTF-8 representation is a sequence of unsigned bytes (with the scope from 0 to 255, inclusively). The String class in Java provides the following method, which seems to be able to provide the UTF-8 representation of a String:
byte[] getBytes(String charsetName)
However, as you can see, this method returns an array of the Java byte type. So, how should I interpret this array?
For example, if s is a String, and bArray is the returned array of s.getBytes("UTF-8"), then:
If bArray[0] is -100, then what is the first unsigned byte (in the scope of 0 to 255) of this UTF-8 representation?
If the first unsigned byte (in the scope of 0 to 255) of this UTF-8 representation is 200, then what is bArray[0]?

From int to signed byte
int i = 200; // some value between 0 and 255
byte b = (byte) i; // 8 bits representing that value
From signed byte to int
byte b = -100; // 8 bits representing a value between -128 and 127
int i = b & 0xFF; // an int representing the value but in range [0..255]

Related

Convert from hex to int and vice versa in java

I have next method for converting to int from hex:
public static byte[] convertsHexStringToByteArray2(String hexString) {
int hexLength = hexString.length() / 2;
byte[] bytes = new byte[hexLength];
for (int i = 0; i < hexLength; i++) {
int hex = Integer.parseInt(hexString.substring(2 * i, 2 * i + 2), 16);
bytes[i] = (byte) hex;
}
return bytes;
}
when I use it for "80" hex string I get strange, not expected result:
public static void main(String[] args) {
System.out.println(Arrays.toString(convertsHexStringToByteArray2("80"))); // gives -128
System.out.println(Integer.toHexString(-128));
System.out.println(Integer.toHexString(128));
}
the output:
[-128]
ffffff80
80
I expect that "80" will be 128. What is wrong with my method?

I have next method for converting to int from hex
The method you posted converts a hex String to a byte array, and not to an int. That's why it is messing with its sign.
Converting from hex to int is easy:
Integer.parseInt("80", 16)
$1 ==> 128
But if you want to get a byte array for further processing by just casting:
(byte) Integer.parseInt("80", 16)
$2 ==> -128
It "changes" its sign. For further information on primitives and signed variable types take a look at Primitive Data Types, where it says:
The byte data type is an 8-bit signed two's complement integer. It has a minimum value of -128 and a maximum value of 127 (inclusive). The byte data type can be useful for saving memory in large arrays, where the memory savings actually matters. They can also be used in place of int where their limits help to clarify your code; the fact that a variable's range is limited can serve as a form of documentation.
One could easily invert the sign by just increasing the value to convert:
(byte) Integer.parseInt("80", 16) & 0xFF
$3 ==> 128
That gets you a byte with the value you expect. Technically that result isn't correct and you must to switch the sign again, if you want to get an int or a hex string back again. I'd suggest you to don't use a byte array if you only want to convert between hex and dec.

A byte in Java stores numbers from -128 to 127. 80 in hex is 128 as an integer, which is too large to be stored in a byte. So, the value wraps around. Use a different type to store your value (such as a short).

Java integer to fixed length byte array

How can I convert an integer between 0 and 65k to fixed length of two bytes? as an example
2 would be 00000000 00000010
~65k would be 11111111 11111111
and all that in a byte array

java has the short data type, this is a 2 bytes integer. You cast the integer to short.
int a = 1;
short b = (short)a;
If you want the bytes of the integer you can use ByteBuffer
byte[] bytes = ByteBuffer.allocate(2).putShort((short)intnumber).array();
Or if you want the binary format you can just use toBinaryString method of the Integer.
int x = 2;
System.out.println(Integer.toBinaryString(x));

When x is your number between 0 and 65,535 just use
new byte[] { (byte) (x >> 8), (byte) x }
to create a byte array containing the value as two bytes in big-endian format.

How to convert binary string to the byte array of 2 bytes in java

I have binary string String A = "1000000110101110". I want to convert this string into byte array of length 2 in java
I have taken the help of this link
I have tried to convert it into byte by various ways
I have converted that string into decimal first and then apply the code to store into the byte array
int aInt = Integer.parseInt(A, 2);
byte[] xByte = new byte[2];
xByte[0] = (byte) ((aInt >> 8) & 0XFF);
xByte[1] = (byte) (aInt & 0XFF);
System.arraycopy(xByte, 0, record, 0,
xByte.length);
But the values get store into the byte array are negative
xByte[0] :-127
xByte[1] :-82
Which are wrong values.
2.I have also tried using
byte[] xByte = ByteBuffer.allocate(2).order(ByteOrder.BIG_ENDIAN).putInt(aInt).array();
But it throws the exception at the above line like
java.nio.Buffer.nextPutIndex(Buffer.java:519) at
java.nio.HeapByteBuffer.putInt(HeapByteBuffer.java:366) at
org.com.app.convert.generateTemplate(convert.java:266)
What should i do now to convert the binary string to byte array of 2 bytes?Is there any inbuilt function in java to get the byte array

The answer you are getting
xByte[0] :-127
xByte[1] :-82
is right.
This is called 2's compliment Represantation.
1st bit is used as signed bit.
0 for +ve
1 for -ve
if 1st bit is 0 than it calculates as regular.
but if 1st bit is 1 than it deduct the values of 7 bit from 128 and what ever the answer is presented in -ve form.
In your case
1st value is10000001
so 1(1st bit) for -ve and 128 - 1(last seven bits) = 127
so value is -127
For more detail read 2's complement representation.

Use putShort for putting a two byte value. int has four bytes.
// big endian is the default order
byte[] xByte = ByteBuffer.allocate(2).putShort((short)aInt).array();
By the way, your first attempt is perfect. You can’t change the negative sign of the bytes as the most significant bit of these bytes is set. That’s always interpreted as negative value.
10000001₂ == -127
10101110₂ == -82

try this
String s = "1000000110101110";
int i = Integer.parseInt(s, 2);
byte[] a = {(byte) ( i >> 8), (byte) i};
System.out.println(Arrays.toString(a));
System.out.print(Integer.toBinaryString(0xFF & a[0]) + " " + Integer.toBinaryString(0xFF & a[1]));
output
[-127, -82]
10000001 10101110
that is -127 == 0xb10000001 and -82 == 0xb10101110

Bytes are signed 8 bit integers. As such your result is completely correct.
That is: 01111111 is 127, but 10000000 is -128. If you want to get numbers in 0-255 range you need to use a bigger variable type like short.
You can print byte as unsigned like this:
public static String toString(byte b) {
return String.valueOf(((short)b) & 0xFF);
}

How to Convert Int to Unsigned Byte and Back

I need to convert a number into an unsigned byte. The number is always less than or equal to 255, and so it will fit in one byte.
I also need to convert that byte back into that number. How would I do that in Java? I've tried several ways and none work. Here's what I'm trying to do now:
int size = 5;
// Convert size int to binary
String sizeStr = Integer.toString(size);
byte binaryByte = Byte.valueOf(sizeStr);
and now to convert that byte back into the number:
Byte test = new Byte(binaryByte);
int msgSize = test.intValue();
Clearly, this does not work. For some reason, it always converts the number into 65. Any suggestions?

A byte is always signed in Java. You may get its unsigned value by binary-anding it with 0xFF, though:
int i = 234;
byte b = (byte) i;
System.out.println(b); // -22
int i2 = b & 0xFF;
System.out.println(i2); // 234

Java 8 provides Byte.toUnsignedInt to convert byte to int by unsigned conversion. In Oracle's JDK this is simply implemented as return ((int) x) & 0xff; because HotSpot already understands how to optimize this pattern, but it could be intrinsified on other VMs. More importantly, no prior knowledge is needed to understand what a call to toUnsignedInt(foo) does.
In total, Java 8 provides methods to convert byte and short to unsigned int and long, and int to unsigned long. A method to convert byte to unsigned short was deliberately omitted because the JVM only provides arithmetic on int and long anyway.
To convert an int back to a byte, just use a cast: (byte)someInt. The resulting narrowing primitive conversion will discard all but the last 8 bits.

If you just need to convert an expected 8-bit value from a signed int to an unsigned value, you can use simple bit shifting:
int signed = -119; // 11111111 11111111 11111111 10001001
/**
* Use unsigned right shift operator to drop unset bits in positions 8-31
*/
int psuedoUnsigned = (signed << 24) >>> 24; // 00000000 00000000 00000000 10001001 -> 137 base 10
/**
* Convert back to signed by using the sign-extension properties of the right shift operator
*/
int backToSigned = (psuedoUnsigned << 24) >> 24; // back to original bit pattern
http://docs.oracle.com/javase/tutorial/java/nutsandbolts/op3.html
If using something other than int as the base type, you'll obviously need to adjust the shift amount: http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
Also, bear in mind that you can't use byte type, doing so will result in a signed value as mentioned by other answerers. The smallest primitive type you could use to represent an 8-bit unsigned value would be a short.

Except char, every other numerical data type in Java are signed.
As said in a previous answer, you can get the unsigned value by performing an and operation with 0xFF. In this answer, I'm going to explain how it happens.
int i = 234;
byte b = (byte) i;
System.out.println(b); // -22
int i2 = b & 0xFF;
// This is like casting b to int and perform and operation with 0xFF
System.out.println(i2); // 234
If your machine is 32-bit, then the int data type needs 32-bits to store values. byte needs only 8-bits.
The int variable i is represented in the memory as follows (as a 32-bit integer).
0{24}11101010
Then the byte variable b is represented as:
11101010
As bytes are signed, this value represent -22. (Search for 2's complement to learn more about how to represent negative integers in memory)
Then if you cast is to int it will still be -22 because casting preserves the sign of a number.
1{24}11101010
The the casted 32-bit value of b perform and operation with 0xFF.
1{24}11101010 & 0{24}11111111
=0{24}11101010
Then you get 234 as the answer.

The solution works fine (thanks!), but if you want to avoid casting and leave the low level work to the JDK, you can use a DataOutputStream to write your int's and a DataInputStream to read them back in. They are automatically treated as unsigned bytes then:
For converting int's to binary bytes;
ByteArrayOutputStream bos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(bos);
int val = 250;
dos.write(byteVal);
...
dos.flush();
Reading them back in:
// important to use a (non-Unicode!) encoding like US_ASCII or ISO-8859-1,
// i.e., one that uses one byte per character
ByteArrayInputStream bis = new ByteArrayInputStream(
bos.toString("ISO-8859-1").getBytes("ISO-8859-1"));
DataInputStream dis = new DataInputStream(bis);
int byteVal = dis.readUnsignedByte();
Esp. useful for handling binary data formats (e.g. flat message formats, etc.)

The Integer.toString(size) call converts into the char representation of your integer, i.e. the char '5'. The ASCII representation of that character is the value 65.
You need to parse the string back to an integer value first, e.g. by using Integer.parseInt, to get back the original int value.
As a bottom line, for a signed/unsigned conversion, it is best to leave String out of the picture and use bit manipulation as #JB suggests.

Even though it's too late, I'd like to give my input on this as it might clarify why the solution given by JB Nizet works. I stumbled upon this little problem working on a byte parser and to string conversion myself.
When you copy from a bigger size integral type to a smaller size integral type as this java doc says this happens:
https://docs.oracle.com/javase/specs/jls/se7/html/jls-5.html#jls-5.1.3
A narrowing conversion of a signed integer to an integral type T simply discards all but the n lowest order bits, where n is the number of bits used to represent type T. In addition to a possible loss of information about the magnitude of the numeric value, this may cause the sign of the resulting value to differ from the sign of the input value.
You can be sure that a byte is an integral type as this java doc says
https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
byte: The byte data type is an 8-bit signed two's complement integer.
So in the case of casting an integer(32 bits) to a byte(8 bits), you just copy the last (least significant 8 bits) of that integer to the given byte variable.
int a = 128;
byte b = (byte)a; // Last 8 bits gets copied
System.out.println(b); // -128
Second part of the story involves how Java unary and binary operators promote operands.
https://docs.oracle.com/javase/specs/jls/se7/html/jls-5.html#jls-5.6.2
Widening primitive conversion (§5.1.2) is applied to convert either or both operands as specified by the following rules:
If either operand is of type double, the other is converted to double.
Otherwise, if either operand is of type float, the other is converted to float.
Otherwise, if either operand is of type long, the other is converted to long.
Otherwise, both operands are converted to type int.
Rest assured, if you are working with integral type int and/or lower it'll be promoted to an int.
// byte b(0x80) gets promoted to int (0xFF80) by the & operator and then
// 0xFF80 & 0xFF (0xFF translates to 0x00FF) bitwise operation yields
// 0x0080
a = b & 0xFF;
System.out.println(a); // 128
I scratched my head around this too :). There is a good answer for this here by rgettman.
Bitwise operators in java only for integer and long?

If you want to use the primitive wrapper classes, this will work, but all java types are signed by default.
public static void main(String[] args) {
Integer i=5;
Byte b = Byte.valueOf(i+""); //converts i to String and calls Byte.valueOf()
System.out.println(b);
System.out.println(Integer.valueOf(b));
}

In terms of readability, I favor Guava's:
UnsignedBytes.checkedCast(long) to convert a signed number to an unsigned byte.
UnsignedBytes.toInt(byte) to convert an unsigned byte to a signed int.

Handling bytes and unsigned integers with BigInteger:
byte[] b = ... // your integer in big-endian
BigInteger ui = new BigInteger(b) // let BigInteger do the work
int i = ui.intValue() // unsigned value assigned to i

in java 7
public class Main {
public static void main(String[] args) {
byte b = -2;
int i = 0 ;
i = ( b & 0b1111_1111 ) ;
System.err.println(i);
}
}
result : 254

I have tested it and understood it.
In Java, the byte is signed, so 234 in one signed byte is -22, in binary, it is "11101010", signed bit has a "1", so with negative's presentation 2's complement, it becomes -22.
And operate with 0xFF, cast 234 to 2 byte signed(32 bit), keep all bit unchanged.
I use String to solve this:
int a = 14206;
byte[] b = String.valueOf(a).getBytes();
String c = new String(b);
System.out.println(Integer.valueOf(c));
and output is 14206.

How are integers cast to bytes in Java?

I know Java doesn't allow unsigned types, so I was wondering how it casts an integer to a byte. Say I have an integer a with a value of 255 and I cast the integer to a byte. Is the value represented in the byte 11111111? In other words, is the value treated more as a signed 8 bit integer, or does it just directly copy the last 8 bits of the integer?

This is called a narrowing primitive conversion. According to the spec:
A narrowing conversion of a signed integer to an integral type T simply discards all but the n lowest order bits, where n is the number of bits used to represent type T. In addition to a possible loss of information about the magnitude of the numeric value, this may cause the sign of the resulting value to differ from the sign of the input value.
So it's the second option you listed (directly copying the last 8 bits).
I am unsure from your question whether or not you are aware of how signed integral values are represented, so just to be safe I'll point out that the byte value 1111 1111 is equal to -1 in the two's complement system (which Java uses).

int i = 255;
byte b = (byte)i;
So the value of be in hex is 0xFF but the decimal value will be -1.
int i = 0xff00;
byte b = (byte)i;
The value of b now is 0x00. This shows that java takes the last byte of the integer. ie. the last 8 bits but this is signed.

or does it just directly copy the last
8 bits of the integer
yes, this is the way this casting works

The following fragment casts an int to a byte. If the integer’s value is larger than the range of a byte, it will be reduced modulo (the remainder of an integer division by the) byte’s range.
int a;
byte b;
// …
b = (byte) a;

Just a thought on what is said: Always mask your integer when converting to bytes with 0xFF (for ints). (Assuming myInt was assigned values from 0 to 255).
e.g.
char myByte = (char)(myInt & 0xFF);
why? if myInt is bigger than 255, just typecasting to byte returns a negative value (2's complement) which you don't want.

Byte is 8 bit. 8 bit can represent 256 numbers.(2 raise to 8=256)
Now first bit is used for sign. [if positive then first bit=0, if negative first bit= 1]
let's say you want to convert integer 1099 to byte. just devide 1099 by 256. remainder is your byte representation of int
examples
1099/256 => remainder= 75
-1099/256 =>remainder=-75
2049/256 => remainder= 1
reason why? look at this image http://i.stack.imgur.com/FYwqr.png

According to my understanding, you meant
Integer i=new Integer(2);
byte b=i; //will not work
final int i=2;
byte b=i; //fine
At last
Byte b=new Byte(2);
int a=b; //fine

for (int i=0; i <= 255; i++) {
byte b = (byte) i; // cast int values 0 to 255 to corresponding byte values
int neg = b; // neg will take on values 0..127, -128, -127, ..., -1
int pos = (int) (b & 0xFF); // pos will take on values 0..255
}
The conversion of a byte that contains a value bigger than 127 (i.e,. values 0x80 through 0xFF) to an int results in sign extension of the high-order bit of the byte value (i.e., bit 0x80). To remove the 'extra' one bits, use x & 0xFF; this forces bits higher than 0x80 (i.e., bits 0x100, 0x200, 0x400, ...) to zero but leaves the lower 8 bits as is.
You can also write these; they are all equivalent:
int pos = ((int) b) & 0xFF; // convert b to int first, then strip high bits
int pos = b & 0xFF; // done as int arithmetic -- the cast is not needed
Java automatically 'promotes' integer types whose size (in # of bits) is smaller than int to an int value when doing arithmetic. This is done to provide a more deterministic result (than say C, which is less constrained in its specification).
You may want to have a look at this question on casting a 'short'.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.