Calculating this value for an long is easy:
It is simply 2 to the power of n-1, and than minus 1. n is the number of bits in the type. For a long this is defined as 64 bits. Because we must use represent negative numbers as well, we use n-1 instead of n. Because 0 must be accounted for, we subtract 1. So the maximum value is:
MAX = 2^(n-1)-1
what it the equivalent thought process, for a double:
Double.MAX_VALUE
comes to be
1.7976931348623157E308
The maximum finite value for a double is, in hexadecimal format, 0x1.fffffffffffffp1023, representing the product of a number just below 2 (1.ff… in hexadecimal notation) by 21023. When written this way, is is easy to see that it is made of the largest possible significand and the largest possible exponent, in a way very similar to the way you build the largest possible long in your question.
If you want a formula where all numbers are written in the decimal notation, here is one:
Double.MAX_VALUE = (2 - 1/252) * 21023
Or if you prefer a formula that makes it clear that Double.MAX_VALUE is an integer:
Double.MAX_VALUE = 21024 - 2971
If we look at the representation provided by Oracle:
0x1.fffffffffffffp1023
or
(2-2^-52)·2^1023
We can see that
fffffffffffff
is 13 hexadecimal digits that can be represented as 52 binary digits ( 13 * 4 ).
If each is set to 1 as it is ( F = 1111 ), we obtain the maximum fractional part.
The fractional part is always 52 bits as defined by
http://en.wikipedia.org/wiki/Double-precision_floating-point_format
1 bit is for sign
and the remaining 11 bits make up the exponent.
Because the exponent must be both positive and negative and it must represent 0, it to can have a maximum value of:
2^10 - 1
or
1023
Doubles (and floats) are represented internally as binary fractions according to the IEEE standard 754
and can therefore not represent decimal fractions exactly:
http://mindprod.com/jgloss/floatingpoint.html
http://www.math.byu.edu/~schow/work/IEEEFloatingPoint.htm
http://en.wikipedia.org/wiki/Computer_numbering_formats
So there is no equivalent calculation.
Just take a look at the documentation. Basically, the MAX_VALUE computation for Double uses a different formula because of the finite number of real numbers that can be represented on 64 bits. For an extensive justification, you can consult this article about data representation.
Related
Why do some numbers lose accuracy when stored as floating point numbers?
For example, the decimal number 9.2 can be expressed exactly as a ratio of two decimal integers (92/10), both of which can be expressed exactly in binary (0b1011100/0b1010). However, the same ratio stored as a floating point number is never exactly equal to 9.2:
32-bit "single precision" float: 9.19999980926513671875
64-bit "double precision" float: 9.199999999999999289457264239899814128875732421875
How can such an apparently simple number be "too big" to express in 64 bits of memory?
In most programming languages, floating point numbers are represented a lot like scientific notation: with an exponent and a mantissa (also called the significand). A very simple number, say 9.2, is actually this fraction:
5179139571476070 * 2 -49
Where the exponent is -49 and the mantissa is 5179139571476070. The reason it is impossible to represent some decimal numbers this way is that both the exponent and the mantissa must be integers. In other words, all floats must be an integer multiplied by an integer power of 2.
9.2 may be simply 92/10, but 10 cannot be expressed as 2n if n is limited to integer values.
Seeing the Data
First, a few functions to see the components that make a 32- and 64-bit float. Gloss over these if you only care about the output (example in Python):
def float_to_bin_parts(number, bits=64):
if bits == 32: # single precision
int_pack = 'I'
float_pack = 'f'
exponent_bits = 8
mantissa_bits = 23
exponent_bias = 127
elif bits == 64: # double precision. all python floats are this
int_pack = 'Q'
float_pack = 'd'
exponent_bits = 11
mantissa_bits = 52
exponent_bias = 1023
else:
raise ValueError, 'bits argument must be 32 or 64'
bin_iter = iter(bin(struct.unpack(int_pack, struct.pack(float_pack, number))[0])[2:].rjust(bits, '0'))
return [''.join(islice(bin_iter, x)) for x in (1, exponent_bits, mantissa_bits)]
There's a lot of complexity behind that function, and it'd be quite the tangent to explain, but if you're interested, the important resource for our purposes is the struct module.
Python's float is a 64-bit, double-precision number. In other languages such as C, C++, Java and C#, double-precision has a separate type double, which is often implemented as 64 bits.
When we call that function with our example, 9.2, here's what we get:
>>> float_to_bin_parts(9.2)
['0', '10000000010', '0010011001100110011001100110011001100110011001100110']
Interpreting the Data
You'll see I've split the return value into three components. These components are:
Sign
Exponent
Mantissa (also called Significand, or Fraction)
Sign
The sign is stored in the first component as a single bit. It's easy to explain: 0 means the float is a positive number; 1 means it's negative. Because 9.2 is positive, our sign value is 0.
Exponent
The exponent is stored in the middle component as 11 bits. In our case, 0b10000000010. In decimal, that represents the value 1026. A quirk of this component is that you must subtract a number equal to 2(# of bits) - 1 - 1 to get the true exponent; in our case, that means subtracting 0b1111111111 (decimal number 1023) to get the true exponent, 0b00000000011 (decimal number 3).
Mantissa
The mantissa is stored in the third component as 52 bits. However, there's a quirk to this component as well. To understand this quirk, consider a number in scientific notation, like this:
6.0221413x1023
The mantissa would be the 6.0221413. Recall that the mantissa in scientific notation always begins with a single non-zero digit. The same holds true for binary, except that binary only has two digits: 0 and 1. So the binary mantissa always starts with 1! When a float is stored, the 1 at the front of the binary mantissa is omitted to save space; we have to place it back at the front of our third element to get the true mantissa:
1.0010011001100110011001100110011001100110011001100110
This involves more than just a simple addition, because the bits stored in our third component actually represent the fractional part of the mantissa, to the right of the radix point.
When dealing with decimal numbers, we "move the decimal point" by multiplying or dividing by powers of 10. In binary, we can do the same thing by multiplying or dividing by powers of 2. Since our third element has 52 bits, we divide it by 252 to move it 52 places to the right:
0.0010011001100110011001100110011001100110011001100110
In decimal notation, that's the same as dividing 675539944105574 by 4503599627370496 to get 0.1499999999999999. (This is one example of a ratio that can be expressed exactly in binary, but only approximately in decimal; for more detail, see: 675539944105574 / 4503599627370496.)
Now that we've transformed the third component into a fractional number, adding 1 gives the true mantissa.
Recapping the Components
Sign (first component): 0 for positive, 1 for negative
Exponent (middle component): Subtract 2(# of bits) - 1 - 1 to get the true exponent
Mantissa (last component): Divide by 2(# of bits) and add 1 to get the true mantissa
Calculating the Number
Putting all three parts together, we're given this binary number:
1.0010011001100110011001100110011001100110011001100110 x 1011
Which we can then convert from binary to decimal:
1.1499999999999999 x 23 (inexact!)
And multiply to reveal the final representation of the number we started with (9.2) after being stored as a floating point value:
9.1999999999999993
Representing as a Fraction
9.2
Now that we've built the number, it's possible to reconstruct it into a simple fraction:
1.0010011001100110011001100110011001100110011001100110 x 1011
Shift mantissa to a whole number:
10010011001100110011001100110011001100110011001100110 x 1011-110100
Convert to decimal:
5179139571476070 x 23-52
Subtract the exponent:
5179139571476070 x 2-49
Turn negative exponent into division:
5179139571476070 / 249
Multiply exponent:
5179139571476070 / 562949953421312
Which equals:
9.1999999999999993
9.5
>>> float_to_bin_parts(9.5)
['0', '10000000010', '0011000000000000000000000000000000000000000000000000']
Already you can see the mantissa is only 4 digits followed by a whole lot of zeroes. But let's go through the paces.
Assemble the binary scientific notation:
1.0011 x 1011
Shift the decimal point:
10011 x 1011-100
Subtract the exponent:
10011 x 10-1
Binary to decimal:
19 x 2-1
Negative exponent to division:
19 / 21
Multiply exponent:
19 / 2
Equals:
9.5
Further reading
The Floating-Point Guide: What Every Programmer Should Know About Floating-Point Arithmetic, or, Why don’t my numbers add up? (floating-point-gui.de)
What Every Computer Scientist Should Know About Floating-Point Arithmetic (Goldberg 1991)
IEEE Double-precision floating-point format (Wikipedia)
Floating Point Arithmetic: Issues and Limitations (docs.python.org)
Floating Point Binary
This isn't a full answer (mhlester already covered a lot of good ground I won't duplicate), but I would like to stress how much the representation of a number depends on the base you are working in.
Consider the fraction 2/3
In good-ol' base 10, we typically write it out as something like
0.666...
0.666
0.667
When we look at those representations, we tend to associate each of them with the fraction 2/3, even though only the first representation is mathematically equal to the fraction. The second and third representations/approximations have an error on the order of 0.001, which is actually much worse than the error between 9.2 and 9.1999999999999993. In fact, the second representation isn't even rounded correctly! Nevertheless, we don't have a problem with 0.666 as an approximation of the number 2/3, so we shouldn't really have a problem with how 9.2 is approximated in most programs. (Yes, in some programs it matters.)
Number bases
So here's where number bases are crucial. If we were trying to represent 2/3 in base 3, then
(2/3)10 = 0.23
In other words, we have an exact, finite representation for the same number by switching bases! The take-away is that even though you can convert any number to any base, all rational numbers have exact finite representations in some bases but not in others.
To drive this point home, let's look at 1/2. It might surprise you that even though this perfectly simple number has an exact representation in base 10 and 2, it requires a repeating representation in base 3.
(1/2)10 = 0.510 = 0.12 = 0.1111...3
Why are floating point numbers inaccurate?
Because often-times, they are approximating rationals that cannot be represented finitely in base 2 (the digits repeat), and in general they are approximating real (possibly irrational) numbers which may not be representable in finitely many digits in any base.
While all of the other answers are good there is still one thing missing:
It is impossible to represent irrational numbers (e.g. π, sqrt(2), log(3), etc.) precisely!
And that actually is why they are called irrational. No amount of bit storage in the world would be enough to hold even one of them. Only symbolic arithmetic is able to preserve their precision.
Although if you would limit your math needs to rational numbers only the problem of precision becomes manageable. You would need to store a pair of (possibly very big) integers a and b to hold the number represented by the fraction a/b. All your arithmetic would have to be done on fractions just like in highschool math (e.g. a/b * c/d = ac/bd).
But of course you would still run into the same kind of trouble when pi, sqrt, log, sin, etc. are involved.
TL;DR
For hardware accelerated arithmetic only a limited amount of rational numbers can be represented. Every not-representable number is approximated. Some numbers (i.e. irrational) can never be represented no matter the system.
There are infinitely many real numbers (so many that you can't enumerate them), and there are infinitely many rational numbers (it is possible to enumerate them).
The floating-point representation is a finite one (like anything in a computer) so unavoidably many many many numbers are impossible to represent. In particular, 64 bits only allow you to distinguish among only 18,446,744,073,709,551,616 different values (which is nothing compared to infinity). With the standard convention, 9.2 is not one of them. Those that can are of the form m.2^e for some integers m and e.
You might come up with a different numeration system, 10 based for instance, where 9.2 would have an exact representation. But other numbers, say 1/3, would still be impossible to represent.
Also note that double-precision floating-points numbers are extremely accurate. They can represent any number in a very wide range with as much as 15 exact digits. For daily life computations, 4 or 5 digits are more than enough. You will never really need those 15, unless you want to count every millisecond of your lifetime.
Why can we not represent 9.2 in binary floating point?
Floating point numbers are (simplifying slightly) a positional numbering system with a restricted number of digits and a movable radix point.
A fraction can only be expressed exactly using a finite number of digits in a positional numbering system if the prime factors of the denominator (when the fraction is expressed in it's lowest terms) are factors of the base.
The prime factors of 10 are 5 and 2, so in base 10 we can represent any fraction of the form a/(2b5c).
On the other hand the only prime factor of 2 is 2, so in base 2 we can only represent fractions of the form a/(2b)
Why do computers use this representation?
Because it's a simple format to work with and it is sufficiently accurate for most purposes. Basically the same reason scientists use "scientific notation" and round their results to a reasonable number of digits at each step.
It would certainly be possible to define a fraction format, with (for example) a 32-bit numerator and a 32-bit denominator. It would be able to represent numbers that IEEE double precision floating point could not, but equally there would be many numbers that can be represented in double precision floating point that could not be represented in such a fixed-size fraction format.
However the big problem is that such a format is a pain to do calculations on. For two reasons.
If you want to have exactly one representation of each number then after each calculation you need to reduce the fraction to it's lowest terms. That means that for every operation you basically need to do a greatest common divisor calculation.
If after your calculation you end up with an unrepresentable result because the numerator or denominator you need to find the closest representable result. This is non-trivil.
Some Languages do offer fraction types, but usually they do it in combination with arbitary precision, this avoids needing to worry about approximating fractions but it creates it's own problem, when a number passes through a large number of calculation steps the size of the denominator and hence the storage needed for the fraction can explode.
Some languages also offer decimal floating point types, these are mainly used in scenarios where it is imporant that the results the computer gets match pre-existing rounding rules that were written with humans in mind (chiefly financial calculations). These are slightly more difficult to work with than binary floating point, but the biggest problem is that most computers don't offer hardware support for them.
Why the output of System.out.println((long)Math.pow(2,63)); and System.out.println((long)(Math.pow(2,63)-1)); is same in Java?
The output is the same because double does not have enough bits to represent 263 exactly.
Mantissa of a double has only 52 bits:
This gives you at most 17 decimal digit precision. The value you computed, on the other hand, is 9223372036854775808, so it needs 19 digits to be represented exactly. As the result, the actual representation of 263 is 9223372036854776000:
Mantissa is set to 1.0 (1 in front is implied)
Exponent is set to 1086 (1024 is implicitly subtracted to yield 63)
The mantissa of representation of 1 is the same, while the exponent is 1024 for the effective value of zero, i.e. the exponents of the two numbers differ by 63, which is more than the size of the mantissa.
Subtraction of 1 happens while your number is represented as double. Since the magnitude of minuend is much larger than that of the subtrahend, the whole subtraction operation is ignored.
You would get the same result after subtracting much larger numbers - all the way to 512, which is 29 (demo). After that the difference in exponent would be less than 52, so you would start getting different results.
Math.pow( double, double ) returns a double values.
double in java is a 64-bit IEEE 754 floating point.(https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html)
If you look here: https://en.wikipedia.org/wiki/Double-precision_floating-point_format you will find, that this format is composed of:
1 bit sign
11 bit exponent
53 bit significant precision
The returned number by pow would need a higher precision (63) to be stored exactly.
Basically the 1 you add is below this precision threshold.
In contrast long has 64 bit precision.
To make it more clear lets assume we are working in decimal and not in base2:
In some imaginary small float datatype with a precision of 2 the value 1000 would be stored as 1.00e3. If you add 1 it would have to store it as 1.001e3. But since we only have a precision of 2 it can only store 1.00e3 and nothing changes. So 1.00e3 + 1 == 1.00e3
The same happens in your example, only that we are dealing with larger numbers and base2 of cause.
You should use parenthesis to incorporate the result and then subtract 1, like this:
System.out.println((long)Math.pow(2,63));
System.out.println(((long)(Math.pow(2,63))-1));
Output:
9223372036854775807
9223372036854775806
For long data type in java,Maximum value is 9,223,372,036,854,775,807 (inclusive). (2^63 -1)
So Even if you try
System.out.println((long)Math.pow(2,65));
System.out.println((long)(Math.pow(2,63)-1));
output
9223372036854775807
9223372036854775807
I am trying to convert double to float in java.
Double d = 1234567.1234;
Float f = d.floatValue();
I see that the value of f is
1234567.1
I am not trying to print a string value of float. I just wonder what is the maximum number of digits not to lose any precision when converting double to float. Can i show more than 8 significant digits in java?
float: 32 bits (4 bytes) where 23 bits are used for the mantissa (6 to 9 decimal digits, about 7 on average). 8 bits are used for the exponent, so a float can “move” the decimal point to the right or to the left using those 8 bits. Doing so avoids storing lots of zeros in the mantissa as in 0.0000003 (3 × 10-7) or 3000000 (3 × 107). There is 1 bit used as the sign bit.
double: 64 bits (8 bytes) where 52 bits are used for the mantissa (15 to 17 decimal digits, about 16 on average). 11 bits are used for the exponent and 1 bit is the sign bit.
I believe you hit this limit what cause that problem.
If you change
Double d = 123456789.1234;
Float f = d.floatValue();
You will see that float value will be 1.23456792E8
The precision of a float is about 7 decimals, but since floats are stored in binary, that's an approximation.
To illustrate the actual precision of the float value in question, try this:
double d = 1234567.1234;
float f = (float)d;
System.out.printf("%.9f%n", d);
System.out.printf("%.9f%n", Math.nextDown(f));
System.out.printf("%.9f%n", f);
System.out.printf("%.9f%n", Math.nextUp(f));
Output
1234567.123400000
1234567.000000000
1234567.125000000
1234567.250000000
As you can see, the effective decimal precision is about 1 decimal place for this number, or 8 digits, but if you ran the code with the number 9876543.9876, you get:
9876543.987600000
9876543.000000000
9876544.000000000
9876545.000000000
That's only 7 digits of precision.
This is a simple example in support of the view that there is no safe number of decimal digits.
Consider 0.1.
The closest IEEE 754 64-bit binary floating point number has exact value 0.1000000000000000055511151231257827021181583404541015625. It converts to 32-bit binary floating point as 0.100000001490116119384765625, which is considerably further from 0.1.
There can be loss of precision with even a single significant digit and single decimal place.
Unless you really need the compactness of float, and have very relaxed precision requirements, it is generally better to just stick with double.
I just wonder what is the maximum number of digits not to lose any precision when converting double to float.
Maybe you don't realize it, but the concept of N digits precisions is already ambigous. Doubtlessly you meant "N digits precision in base 10". But unlike humans, our computers work with Base 2.
Its not possible to convert every number from Base X to Base Y (with a limited amount of retained digits) without loss of precision, e.g. the value of 1/3rd is perfectly accurately representable in Base 3 as "0.1". In Base 10 it has an infinite number of digits 0.3333333333333... Likewise, commonly perfectly representable numbers in Base 10, e.g. 0.1 need an infinite number of digits to be represented in Base 2. On the other hand, 0.5 (Base 10) is peferectly accurately representable as 0.1 (Base 2).
So back to
I just wonder what is the maximum number of digits not to lose any precision when converting double to float.
The answer is "it depends on the value". The commonly cited rule of thumb "float has about 6 to 7 digits decimal precision" is just an approximation. It can be much more or much less depending on the value.
When dealing with floating point the concept of relative accuracy is more useful, stop thinking about "digits" and replace it with relative error. Any number N (in range) is representable with an error of (at most) N / accuracy, and the accuracy is the number of mantissa bits in the chosen format (e.g. 23 (+1) for float, 52 (+1) for double). So a decimal number represented as a float is has a maximum approximation error of N / pow(2, 24). The error may be less, even zero, but it is never greater.
The 23+1 comes from the convention that floating point numbers are organized with the exponent chosen such that the first mantissa bit is always a 1 (whenever possible), so it doesn't need to be explicitly stored. The number of physically stored bits, e.g. 23 thus allows for one extra bit of accuracy. (There is an exceptional case where "whenever possible" does not apply, but lets ignore that here).
TL;DR: There is no fixed number of decimal digits accuracy in float or double.
EDIT.
No you cannot get any more precise with a float in Java because floats can only contain 32 bits ( 4 bytes). If you want more precision, then continue to use the Double. This might also be helpful
When I assign from an int to a float I thought float allows more precision, so would not lose or change the value assigned, but what I am seeing is something quite different. What is going on here?
for(int i = 63000000; i < 63005515; i++) {
int a = i;
float f = 0;
f=a;
System.out.print(java.text.NumberFormat.getInstance().format(a) + " : " );
System.out.println(java.text.NumberFormat.getInstance().format(f));
}
some of the output :
...
63,005,504 : 63,005,504
63,005,505 : 63,005,504
63,005,506 : 63,005,504
63,005,507 : 63,005,508
63,005,508 : 63,005,508
Thanks!
A float has the same number of bits as an int -- 32 bits. But it allows for a greater range of values, far greater than the range of int values. However, the precision is fixed at 24 bits (23 "mantissa" bits, plus 1 implied 1 bit). At the value of about 63,000,000, the precision is greater than 1. You can verify this with the Math.ulp method, which gives the difference between 2 consecutive values.
The code
System.out.println(Math.ulp(63000000.0f));
prints
4.0
You can use double values for a far greater (yet still limited) precision:
System.out.println(Math.ulp(63000000.0));
prints
7.450580596923828E-9
However, you can just use ints here, because your values, at about 63 million, are still well below the maximum possible int value, which is about 2 billion.
A float in java is a number IEEE 754 floating point representation, even when it can be used to represent values from ±1.40129846432481707e-45 to ±3.40282346638528860e+38 it has only 6 or 7 significant decimal digits.
A simple solution would be use a double which has at least 14 significant digits and can cover without any issue all the values of an int.
However, if it is accuracy what you're looking for stay away from native floating point representations and go for classes like BigInteger and BigDecimal.
No, they are not necessarily the same value. An int and a float are each 32 bits but in a float some of those bits are used for the floating point part of the number so there are fewer whole numbers which can be represented in a float than in an int. Depending on what your application is doing with these numbers you may not care about these differences or maybe you want to look at using something like BigDecimal.
Floats don't allow more precision, floats allow wider range of numbers.
We've got 2^32 possible values for integers in range (approximately) -2 * 10^9 to 2 * 10^9. Floats are also 32bit, so the number of possible values is at most the same as for integers.
Out of these 32 bits, some of them are reserved for mantisa, the rest of these is for exponent. The resulting number represented by the float is then calculated (for simplicity I'll use 10-base) as mantisa * 10^exponent.
Obviously, the maximum precision is limited by the number of bits assigned to mantisa. So you can represent some integers exactly as integers, but they won't fit to mantisa, so the least significant bits are thrown off, as in your case.
Float have a greater range of values but lower precision.
Int have a lower range of values but higher precision.
Int is specific to 1, while Float is specific to 4.
So if you are dealing with trillions but don't care about +/- 4 then use float. but if you need the last digit to be precise you need to use int.
How can i get 64 bits of the fractional part of the square root of a number in java?
You will have to find a library that calculates the value to 64 fractional bits, or study the algorithms and write it yourself. A double does not store 64 fractional bits; it stores 64 total bits. Only 53 of those are used to represent the fraction ("mantissa", technically speaking; a number like 1.xxxxxxx in binary, except the 1 is always 1, so there is no need to record it), 12 are used for an exponent (so that the double can represent very large numbers as well as numbers that are very close to zero), and 1 is used for a sign (so that it can represent negative numbers).
If you need 64-bits, you can use double precision to get up to 53-bits (note for values 1 or more, you will get less fractional precision) and then use BigDecimal and successive approximation to get more bits of precision.
Is this homework? I can't see a practical use for this.