This question already has answers here:
What is the difference between the float and integer data type when the size is the same?
(3 answers)
Closed 3 years ago.
Looking at Java (but probably similar or the same in other languages), a long and a double both use 8 bytes to store a value.
A long uses 8 bytes to store long integers from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
A double uses 8 bytes to store double-precision, floating-point numbers from -1.7E308 to 1.7E308 with up to 16 significant digits.
My question is, if both use the same number of bytes (8 bytes or 2^64), how can double store a much longer number? 1.7E308 is a hell of a lot larger than 9,223,372,036,854,775,807.
The absolute quantity of information that you can store in 64 bit is of course the same.
What changes is the meaning you assign to the bits.
In an integer or long variable, the codification used is the same you use for decimal numbers in your normal life, with the exception of the fact that number two complement is used, but this doesn't change that much, since it's only a trick to gain an additional number (while storing just one zero instead that a positive and a negative).
In a float or double variable, bits are split in two kinds: the mantissa and the exponent. This means that every double number is shaped like XXXXYYYYY where it's numerical value is something like XXXX*2^YYYY. Basically you decide to encode them in a different way, what you obtain is that you have the same amount of values but they are distribuited in a different way over the whole set of real numbers.
The fact that the largest/smallest value of a floating number is larger/smaller of the largest/smalles value of a integer number doesn't imply anything on the amount of data effectively stored.
A double can store a larger number by having larger intervals between the numbers it can store, essentially. Not every integer in the range of a double is representable by that double.
More specifically, a double has one bit (S) to store sign, 11 bits to store an exponent E, and 52 bits of precision, in what is called the mantissa (M).
For most numbers (There are some special cases), a double stores the number (-1)^S * (1 + (M * 2^{-52})) * 2^{E - 1023}, and as such, when E is large, changing M by one will make a much larger change in the size of the resulting number than one. These large gaps are what give doubles a larger range than longs.
Long or int are signed entities which can be positive or negative but never have any decimal part.
Float or double types are used in computers to represent numbers having decimal parts.
Both long and double are 64 bits.
Long has 1 bit for signed (to determine positive or negative), and the rest 63 bits make up the number. So there range can be -2^63 to 2^63-1
Doubles are represented as in a different way specified by IEEE Standard for Binary Floating-Point Arithmetic, devised to store very large numbers in computers.
64 bits of double represented as - [1 bit][11 bits exponent][52 bit mantissa]
Lets see this with an example to convert say 100.25 into binary form stored as double
Decimal 100.25 converted into binary is 1100100.01
Binary 1100100.01 is then normalized as 1.10010001 * 2^6
6 is the exponent component. We select base or offset to be 1023 to it so that both negative and positive can be represented properly. So 6+1023=1029 is the offsetted exponent component after adding bias. 100000011 is the binary representaion of exponent.
To calculate Mantissa from 1.10010001, we ignore 1 that is present towards the right of decimal and just use all the number(i.e. 10010001) to the right of the decimal and right padd any remaining 52 bits with zero.
So, now mantissa will be 1001 0001 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
So, finally the 64 bits are represented as
signed bit exponenet mantissa
0 100000011 1001 0001 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
Reading Delimanolis statement above, I tested a loss of precision in long-to-double conversion - an integral value as big as 500 is ignored(see below).
long L;
double D;
L = 922_3372_0368_5477_5807L;
L -= 500;
D = L;
L = (long)D;
System.out.println("D and L: " + D + " " + L);
output
D and L: 9.223372036854776E18 9223372036854775807
Related
This question already has answers here:
How are integers cast to bytes in Java?
(8 answers)
Type casting into byte in Java
(6 answers)
Explicit conversion from int to byte in Java
(4 answers)
Closed 4 months ago.
For example of this is my input:
byte x=(byte) 200;
This will be the output:
-56
if this is my input:
short x=(short) 250000;
This will be the output:
-12144
I realize that the output is off because the number does not fit into the datatype, but how can I predict what this output will be in this case? In my computer science exam this my be one of the questions and I do not understand why exactly 200 changes to -56 and so one.
I realize that the output is off because the number does not fit into the datatype, but how can I predict what this output will be in this case? In my computer science exam this my be one of the questions and I do not understand why exactly 200 changes to -56 and so one.
The relevant aspects are what overflow looks like, and how the bits that represent the underlying data are treated.
Computers are all bits, grouped together in groups of 8; a group of 8 bits is called a byte.
byte b = 5; for example, is stored in memory as 0000 0101.
Bits can be 0. Or 1. That's it. That's where it ends. And everything is, in the end, bits. This means: That - is not a thing. Computers do not know what - is and cannot store it. We need to write code and agree on some sort of meaning to represent them.
2's complement
So what's -5 in bits? It's 1111 1011. Which seems bizarre. But it's how it works. If you write: byte b = -5;, then b will contain 1111 1011. It is because javac made that happen. Similarly, if you then call System.out.println(b), then the println method gets the bit sequence 1111 1011. Why does the println method decide to print a - symbol and then a 5 symbol? Because it's programmed that way: We all are in agreement that 1111 1011 is -5. So why is that?
Because of a really cool property - signed/unsigned irrelevancy.
The rule is 2's complement: To switch the sign (i.e. turn 5, which is 0000 0101 into -5 which is 1111 1011), you flip every bit, and then add 1 to the end result. Try it with 0000 0101 - and you'll see it's 1111 1011. This algorithm is reversible - apply the same algorithm (flip every bit, then add 1) and you can turn -5 into 5.
This 2's complement thing has 2 great advantages:
There is only one 0 value. If we just flipped all bits, we'd have both 1111 1111 and 0000 0000 both representing some form of 0. In basic math, there's no such thing as 'negative 0' - it's the same as positive 0. Similarly if we just decided the first bit is the sign and the remaining 7 bits are the number, then we'd have both 1000 0000 and 0000 0000 both being 0, which is annoying and inefficient, why waste 2 different bit sequences on the same number?
plus and minus are sign-mode independent. The computer doesn't have to KNOW whether we are doing the 2's complement thing or not. Take the bit sequence 1111 1011. If we treat that as unsigned bits, then that is 251 (it's 128 + 64 + 32 + 16 + 8 + 2 + 1). If we treat that as a signed number, then the first bit is 1, so the thing is negative: We apply 2's complement and figure out that it is -5. So, is it -5 or 251? It's both, at once! Depends on the human/code that interprets this bit sequence which one it is. So how could the computer possibly do a + b given this? The weird answer is: It doesn't matter - because the math works out the same way. 251 - 10 is 241. -5 - 10 is -15. -15 and 241 are the exact same bit sequence.
Overflow
A byte is 8 bits, and there are 256 different sequences of bits, and then you have listed each and every possible variant. (2^8 = 256. Hence, a 16-bit number can be used to convey 65536 different things, because 2^16 is 65536, and so on). So, given that bytes are 8 bits and we decreed they are signed, and 2's complement signed, that means that the smallest number you can send with it is -128, which in bits is 1000 0000 (use 2's complement to check my work), and +127, which in bits is 0111 1111. So what happens if you add 1 to 127? That'd seemingly be +128 except that's not storable in 8 bits if we decree that we interpret these bits as 2's complement signed (which java does). What happens? The bits 'roll over'. We just add 1 as normal, which turns 0111 1111 into 1000 0000 which is -128:
byte b = 127;
b = (byte)(b + 1);
System.out.println(b); // prints -128
Imagine the number line - stretching out into infinity on both ends, from -infinite to +infinite. That's the usual way math works. Computers (or rather, int, long, etc) do not work like that. Instead of a line, it is a circle. Take your infinite number line and take some scissors, and snip that number line at -128 (because a 2's comp signed byte cannot represent -129 or anything else below -128), and at +127 (because our byte cannot represent 128 or anything above it).
And now tape the 2 cut ends together.
That's the number line. What's 'to the right' of 125? 126 - that's what +1 means: Move one to the right on the number line.
What's 'to the right' of +127? Why, -128. Because we taped it together.
Similarly, -127 - 5 is +123. '-5' is 'move 5 places to the left on the number line (or rather, number circle)'. Going in 1 decrements:
-127 (we start here)
-128 (-127 -1)
+127 (-127 -2)
+126 (-127 -3)
+125 (-127 -4)
+124 (-127 -5)
Hence, 124.
Same math applies to short (-32768 to +32767), char (which is really a 16-bit unsigned number - so 0 to 65535), int (-2147483648 to +2147483647), and even long (-2^63 to +2^63-1 - those get a little large).
short x = 32765;
x += 5;
System.out.println(x); // prints -32766.
Integer variables are 4-bytes or 32-bits, and 2^31 and -2^31 both in binary numbers are 32 bits. But when you put 2^31 = 2,147,483,648 in an integer variable it shows an error, but for -2^31 it is ok. Why?
Integer variables are 4-bytes or 32-bits, and 2^31 and -2^31 both in binary numbers are 32 bits
No they are not.
in basic binary, negative numbers aren't a thing. We have zeroes and ones. There is no - sign.
In binary, 2^31 becomes:
1000 0000 0000 0000 0000 0000 0000 0000
In binary, -2^31 cannot be represented without first defining how negative numbers are to be stored.
Commonly (and java does this too), a system called 2's complement is used. 2's complement sounds real complicated: Take the number, say, 5. Represent it in binary (for this exercise, let's go with byte, i.e. 8 bits): 0000 0101. Now, flip all bits: 1111 1010, and then add 1: 1111 1011.
That is -5 in signed 2's complement binary.
This bizarre system has two amazing properties: Math continues to work as normal without needing to know if the number is signed or unsigned. Let's try it. -5 + 2 is -3, right? let's see.. what's 1111 1011 + 0000 0010? Without worrying about 2's complement at all, I get 1111 1101. Let's apply 2's complement conversion: first flip the bits: 0000 0010, then add 1: 0000 0011, which is... 3. So -5 + 2 is -3. Check. The other amazing property is that it doesn't 'waste' 2 of the 2^32 "slots" on zeroes. Let's try the 2's complement of 0: 0000 0000, then flip all bits: 1111 1111, then add 1: 0000 0000 (with a bit overflow that we ignore). That's nice: 0 is its own 2's complement. We can't tell 0 and -0 apart, but that's generally a good thing.
Another property of this system is that the first bit is the 'sign' bit. if it is 1, it is negative, if 0, it is not.
Let's try to 2's complement 1000 0000 0000 0000 0000 0000 0000 0000. First, flip the bits: 0111 1111 1111 1111 1111 1111 1111 1111. Then add 1: 1000 0000 0000 0000 0000 0000 0000 0000. Wait. That's... what we had!!
Yup. and because the first bit is negative, 1000 0000 0000 0000 0000 0000 0000 0000 is NEGATIVE.
Perhaps you are forgetting that 0 is a thing, and 0 is neither positive nor negative.
So, if 0 needs to be representable, and gets a 0 sign bit (zero in bits is 0000000... of course), that means the 'space' in the half of all representable numbers that start with a 0 is now one smaller, because 0 has eaten one slot. That means there is one more negative number representable vs. the positive numbers. (or, alternatively, that 0 'counts' as positive, therefore 0 is the first positive number, but -1 is the first negative number). Therefore, there must be at least 1 negative number that has no positive equivalent in 2's complement. That number is... 2^31. -2^31 fits in 32-bit signed. +2^31 doesn't.
Let's imagine a 3-bit signed number, with 2's complement. We can list them all:
000 = 0
001 = 1
010 = 2
011 = 3
100 = -4
101 = -3
110 = -2
111 = -1
Note how -4 is in there, but +4 is not, and note how we covered 8 numbers. 2^3 = 8 - 3 bits can represent 8 numbers, not more than that.
From the oracle documentation we got that:
int: By default, the int data type is a 32-bit signed two's complement integer, which has a minimum value of -2^31 and a maximum value of 2^31-1. In Java SE 8 and later, you can use the int data type to represent an unsigned 32-bit integer, which has a minimum value of 0 and a maximum value of 2^32-1. Use the Integer class to use int data type as an unsigned integer. See the section The Number Classes for more information. Static methods like compareUnsigned, divideUnsigned etc have been added to the Integer class to support the arithmetic operations for unsigned integers.
From another document(simple and quite understandable) we got:
When an integer is signed, one of its bits becomes the sign bit, meaning that the maximum magnitude of the number is halved. (So an unsigned 32-bit int can store up to 2^32-1, whereas its signed counterpart has a maximum positive value of 2^31-1.)
In Java, all integer types are signed (except char).
Is because the first bit indicate the sign bit. Maximum positive value it can store it 2^31 - 1. There are many resources available for this.
Why the output of System.out.println((long)Math.pow(2,63)); and System.out.println((long)(Math.pow(2,63)-1)); is same in Java?
The output is the same because double does not have enough bits to represent 263 exactly.
Mantissa of a double has only 52 bits:
This gives you at most 17 decimal digit precision. The value you computed, on the other hand, is 9223372036854775808, so it needs 19 digits to be represented exactly. As the result, the actual representation of 263 is 9223372036854776000:
Mantissa is set to 1.0 (1 in front is implied)
Exponent is set to 1086 (1024 is implicitly subtracted to yield 63)
The mantissa of representation of 1 is the same, while the exponent is 1024 for the effective value of zero, i.e. the exponents of the two numbers differ by 63, which is more than the size of the mantissa.
Subtraction of 1 happens while your number is represented as double. Since the magnitude of minuend is much larger than that of the subtrahend, the whole subtraction operation is ignored.
You would get the same result after subtracting much larger numbers - all the way to 512, which is 29 (demo). After that the difference in exponent would be less than 52, so you would start getting different results.
Math.pow( double, double ) returns a double values.
double in java is a 64-bit IEEE 754 floating point.(https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html)
If you look here: https://en.wikipedia.org/wiki/Double-precision_floating-point_format you will find, that this format is composed of:
1 bit sign
11 bit exponent
53 bit significant precision
The returned number by pow would need a higher precision (63) to be stored exactly.
Basically the 1 you add is below this precision threshold.
In contrast long has 64 bit precision.
To make it more clear lets assume we are working in decimal and not in base2:
In some imaginary small float datatype with a precision of 2 the value 1000 would be stored as 1.00e3. If you add 1 it would have to store it as 1.001e3. But since we only have a precision of 2 it can only store 1.00e3 and nothing changes. So 1.00e3 + 1 == 1.00e3
The same happens in your example, only that we are dealing with larger numbers and base2 of cause.
You should use parenthesis to incorporate the result and then subtract 1, like this:
System.out.println((long)Math.pow(2,63));
System.out.println(((long)(Math.pow(2,63))-1));
Output:
9223372036854775807
9223372036854775806
For long data type in java,Maximum value is 9,223,372,036,854,775,807 (inclusive). (2^63 -1)
So Even if you try
System.out.println((long)Math.pow(2,65));
System.out.println((long)(Math.pow(2,63)-1));
output
9223372036854775807
9223372036854775807
This question already has answers here:
java byte data type
(3 answers)
Closed 8 years ago.
since english isn't my main language and i've didn't find any pointers as to how to manually calculate the new number.
Example:
byte b = (byte) 720;
System.out.println(b); // -48
I know that most primary data types use the two complement.
Byte value ranges from -128 to 127, so it's expected to round the number down to something that fits in byte.
The Question is how do i manually calculate the -48 of the typecasted 720? I've read that i have to convert it to two complement, so taking the binary number, searching from right to left for the first one and then inverting all others and since byte only has 8 bits, take the first 8 bits. But that didn't quite work for me, it would be helpful if you would tell me how to calculate a typecasted number that doesn't fit into byte. Thanks for reading! :)
The binary representation of 702 is
10 1101 0000
Get rid of everything except the last 8 bits (because that's what fits into a byte).
1101 0000
Because of the leading 1, get the complement
0010 1111
and add 1
0011 0000
and negate the value. Gives -48.
In Java, integral types are always 2's complement. Section 4.2 of the JLS states:
The integral types are byte, short, int, and long, whose values are 8-bit, 16-bit, 32-bit and 64-bit signed two's-complement integers
You can mask out the least significant 8 bits.
int value = 720;
value &= 0xFF;
But that leaves a number in the range 0-255. Numbers higher than 127 represent negative numbers for bytes.
if (value > Byte.MAX_VALUE)
value -= 1 << 8;
This manually yields the -48 that the (byte) cast yields.
First what happens is the value (in binary) is truncated.
720 binary is 0b1011010000
We can only fit 8 bits 0b11010000
first digit 1, so the value is negative.
2's compliment gives you -48.
This is close to redundant with the answer posted by rgettman, but since you inquired about the two's complement, here is "manually" taking the 2's complement, rather than simply subtracting 256 as in that answer.
To mask the integer down to the bits that will be considered for a byte, combine it bitwise with 11111111.
int i = 720;
int x = i & 0xFF;
Then to take the two's complement:
if (x >> 7 == 1) {
x = -1 * ((x ^ 0xFF) + 1);
}
System.out.println(x);
Calculating this value for an long is easy:
It is simply 2 to the power of n-1, and than minus 1. n is the number of bits in the type. For a long this is defined as 64 bits. Because we must use represent negative numbers as well, we use n-1 instead of n. Because 0 must be accounted for, we subtract 1. So the maximum value is:
MAX = 2^(n-1)-1
what it the equivalent thought process, for a double:
Double.MAX_VALUE
comes to be
1.7976931348623157E308
The maximum finite value for a double is, in hexadecimal format, 0x1.fffffffffffffp1023, representing the product of a number just below 2 (1.ff… in hexadecimal notation) by 21023. When written this way, is is easy to see that it is made of the largest possible significand and the largest possible exponent, in a way very similar to the way you build the largest possible long in your question.
If you want a formula where all numbers are written in the decimal notation, here is one:
Double.MAX_VALUE = (2 - 1/252) * 21023
Or if you prefer a formula that makes it clear that Double.MAX_VALUE is an integer:
Double.MAX_VALUE = 21024 - 2971
If we look at the representation provided by Oracle:
0x1.fffffffffffffp1023
or
(2-2^-52)·2^1023
We can see that
fffffffffffff
is 13 hexadecimal digits that can be represented as 52 binary digits ( 13 * 4 ).
If each is set to 1 as it is ( F = 1111 ), we obtain the maximum fractional part.
The fractional part is always 52 bits as defined by
http://en.wikipedia.org/wiki/Double-precision_floating-point_format
1 bit is for sign
and the remaining 11 bits make up the exponent.
Because the exponent must be both positive and negative and it must represent 0, it to can have a maximum value of:
2^10 - 1
or
1023
Doubles (and floats) are represented internally as binary fractions according to the IEEE standard 754
and can therefore not represent decimal fractions exactly:
http://mindprod.com/jgloss/floatingpoint.html
http://www.math.byu.edu/~schow/work/IEEEFloatingPoint.htm
http://en.wikipedia.org/wiki/Computer_numbering_formats
So there is no equivalent calculation.
Just take a look at the documentation. Basically, the MAX_VALUE computation for Double uses a different formula because of the finite number of real numbers that can be represented on 64 bits. For an extensive justification, you can consult this article about data representation.