When I run:
System.out.println(1f - 0.9f);
I get:
0.100000024
This is because 0.1 has no representation in binary.
Then why when I print this:
System.out.println(0.1f);
I get this:
0.1
0.1 can be represented better in floating point than 0.9. Loosely speaking that's because 0.1 is smaller and closer to its nearest dyadic rational.
So the error when subtracting from 1.0 is larger.
Hence the two values differ.
The embedded formatting heuristics in println do a better job with the 0.1
The value of 0.1f
Java’s float uses IEEE-754 basic 32-bit binary floating-point. In binary
floating-point, every representable number is an integer multiple of some
power of two. (This includes non-negative powers 1, 2, 4, 8, 16,…, and it
includes negative powers ½, ¼, ⅛, 1/16, 1/32,…)
For numbers from 1 to 2, the representable numbers are multiples of
2−23, which is 0.00000011920928955078125:
1.00000000000000000000000
1.00000011920928955078125
1.00000023841857910156250
1.00000035762786865234375
1.00000047683715820312500
…
For numbers near 0.1, the representable numbers are multiples of 2−27, which is 0.000000007450580596923828125:
…
0.099999979138374328613281250 (a)
0.099999986588954925537109375 (b)
0.099999994039535522460937500 (c)
0.100000001490116119384765625 (d)
0.100000008940696716308593750 (e)
0.100000016391277313232421875 (f)
…
(I labeled the numbers (a) to (f) to refer to them in text below.)
As we can see, the closest of these to 0.1 is (d), 0.100000001490116119384765625.
Thus, when 0.1 appears in source code, it is converted to this value,
0.100000001490116119384765625.
This is a general rule—any numeral in source code is converted to the nearest representable number.
(Note that 0.1 is not “represented by” 0.100000001490116119384765625, and
0.100000001490116119384765625 does not “represent” 0.1. The float
0.100000001490116119384765625 is exactly that. The 0.1f in source text was
converted to 0.100000001490116119384765625 and is now just that value.)
If 0.1f is not 0.1, why is “0.1” printed?
Java’s default formatting for floating-point numbers uses the fewest
significant decimal digits needed to distinguish the number from nearby
representable numbers.
The rule for Java SE 10 can be found in the documentation for java.lang.float, in
the toString(float d) section. I quote the passage below1. The
critical part says:
How many digits must be printed for the fractional part of m or a? There must be at least one digit to represent the fractional part, and beyond that as many, but only as many, more digits as are needed to uniquely distinguish the argument value from adjacent values of type float. That is, suppose that x is the exact mathematical value represented by the decimal representation produced by this method for a finite nonzero argument f. Then f must be the float value nearest to x; or, if two float values are equally close to x, then f must be one of them and the least significant bit of the significand of f must be 0.
Let us see how this applies while formatting 0.100000001490116119384765625 and
0.099999994039535522460937500.
For 0.100000001490116119384765625, which is (d), we will first consider formatting one digit
after the decimal point: “0.1”. This represents the number 0.1, of course.
Now, we ask: Is this good enough? If we take 0.1 and ask which number in the
list of nearby numbers above is closest, what is the answer? The nearest number
in the list is (d), 0.100000001490116119384765625. That is the number we are
formatting, so we are done, and the result is “0.1”. This does not mean the float is 0.1, just that, when it is converted to a string with default options, the result is the string “0.1”.
Now consider 0.099999994039535522460937500, which is (c). Again, if we consider using just
one digit, the number rounds to 0.1. When we ask which number in the list is
closest to that, the answer is (d), 0.100000001490116119384765625. That is not the
number we are formatting, so we need more digits. If we consider two digits,
rounding would give us 0.10, and that clearly is also not enough. Considering
more and more digits gives us 0.100, 0.1000, and so on, until we get to eight
digits. With eight digits, 0.099999994039535522460937500 rounds to
0.09999999. Now, when we check the list, we see the nearest number is
(b), 0.099999986588954925537109375. (Adding about 0.0000000035 to that produces
0.09999999, whereas the number we are formatting is about 0.0000000040 away,
which is farther.) So we try nine digits, which gives us 0.099999994. Finally,
the closest number in the list is (c), 0.099999994039535522460937500, which is the
number we are formatting, so we are done, and the result is “0.099999994”.
Footnote
1 The documentation for toString(float d) says:
Returns a string representation of the float argument. All characters mentioned below are ASCII characters.
If the argument is NaN, the result is the string "NaN".
Otherwise, the result is a string that represents the sign and magnitude (absolute value) of the argument. If the sign is negative, the first character of the result is '-' ('\u002D'); if the sign is positive, no sign character appears in the result. As for the magnitude m:
If m is infinity, it is represented by the characters "Infinity"; thus, positive infinity produces the result "Infinity" and negative infinity produces the result "-Infinity".
If m is zero, it is represented by the characters "0.0"; thus, negative zero produces the result "-0.0" and positive zero produces the result "0.0".
If m is greater than or equal to 10-3 but less than 107, then it is represented as the integer part of m, in decimal form with no leading zeroes, followed by '.' ('\u002E'), followed by one or more decimal digits representing the fractional part of m.
If m is less than 10-3 or greater than or equal to 107, then it is represented in so-called "computerized scientific notation." Let n be the unique integer such that 10n ≤ m < 10n+1; then let a be the mathematically exact quotient of m and 10n so that 1 ≤ a < 10. The magnitude is then represented as the integer part of a, as a single decimal digit, followed by '.' ('\u002E'), followed by decimal digits representing the fractional part of a, followed by the letter 'E' ('\u0045'), followed by a representation of n as a decimal integer, as produced by the method Integer.toString(int).
How many digits must be printed for the fractional part of m or a? There must be at least one digit to represent the fractional part, and beyond that as many, but only as many, more digits as are needed to uniquely distinguish the argument value from adjacent values of type float. That is, suppose that x is the exact mathematical value represented by the decimal representation produced by this method for a finite nonzero argument f. Then f must be the float value nearest to x; or, if two float values are equally close to x, then f must be one of them and the least significant bit of the significand of f must be 0.
System.out.println, and most methods of converting floating-point numbers to strings, operate using the following rule: they use exactly as many digits are necessary so that the true value of the double is the closest representable number to the printed value.
That is, it only prints out the digits 0.1 because the true value of the double, 0.1000000000000000055511151231257827021181583404541015625, is the closest double to the displayed value.
Related
Why do some numbers lose accuracy when stored as floating point numbers?
For example, the decimal number 9.2 can be expressed exactly as a ratio of two decimal integers (92/10), both of which can be expressed exactly in binary (0b1011100/0b1010). However, the same ratio stored as a floating point number is never exactly equal to 9.2:
32-bit "single precision" float: 9.19999980926513671875
64-bit "double precision" float: 9.199999999999999289457264239899814128875732421875
How can such an apparently simple number be "too big" to express in 64 bits of memory?
In most programming languages, floating point numbers are represented a lot like scientific notation: with an exponent and a mantissa (also called the significand). A very simple number, say 9.2, is actually this fraction:
5179139571476070 * 2 -49
Where the exponent is -49 and the mantissa is 5179139571476070. The reason it is impossible to represent some decimal numbers this way is that both the exponent and the mantissa must be integers. In other words, all floats must be an integer multiplied by an integer power of 2.
9.2 may be simply 92/10, but 10 cannot be expressed as 2n if n is limited to integer values.
Seeing the Data
First, a few functions to see the components that make a 32- and 64-bit float. Gloss over these if you only care about the output (example in Python):
def float_to_bin_parts(number, bits=64):
if bits == 32: # single precision
int_pack = 'I'
float_pack = 'f'
exponent_bits = 8
mantissa_bits = 23
exponent_bias = 127
elif bits == 64: # double precision. all python floats are this
int_pack = 'Q'
float_pack = 'd'
exponent_bits = 11
mantissa_bits = 52
exponent_bias = 1023
else:
raise ValueError, 'bits argument must be 32 or 64'
bin_iter = iter(bin(struct.unpack(int_pack, struct.pack(float_pack, number))[0])[2:].rjust(bits, '0'))
return [''.join(islice(bin_iter, x)) for x in (1, exponent_bits, mantissa_bits)]
There's a lot of complexity behind that function, and it'd be quite the tangent to explain, but if you're interested, the important resource for our purposes is the struct module.
Python's float is a 64-bit, double-precision number. In other languages such as C, C++, Java and C#, double-precision has a separate type double, which is often implemented as 64 bits.
When we call that function with our example, 9.2, here's what we get:
>>> float_to_bin_parts(9.2)
['0', '10000000010', '0010011001100110011001100110011001100110011001100110']
Interpreting the Data
You'll see I've split the return value into three components. These components are:
Sign
Exponent
Mantissa (also called Significand, or Fraction)
Sign
The sign is stored in the first component as a single bit. It's easy to explain: 0 means the float is a positive number; 1 means it's negative. Because 9.2 is positive, our sign value is 0.
Exponent
The exponent is stored in the middle component as 11 bits. In our case, 0b10000000010. In decimal, that represents the value 1026. A quirk of this component is that you must subtract a number equal to 2(# of bits) - 1 - 1 to get the true exponent; in our case, that means subtracting 0b1111111111 (decimal number 1023) to get the true exponent, 0b00000000011 (decimal number 3).
Mantissa
The mantissa is stored in the third component as 52 bits. However, there's a quirk to this component as well. To understand this quirk, consider a number in scientific notation, like this:
6.0221413x1023
The mantissa would be the 6.0221413. Recall that the mantissa in scientific notation always begins with a single non-zero digit. The same holds true for binary, except that binary only has two digits: 0 and 1. So the binary mantissa always starts with 1! When a float is stored, the 1 at the front of the binary mantissa is omitted to save space; we have to place it back at the front of our third element to get the true mantissa:
1.0010011001100110011001100110011001100110011001100110
This involves more than just a simple addition, because the bits stored in our third component actually represent the fractional part of the mantissa, to the right of the radix point.
When dealing with decimal numbers, we "move the decimal point" by multiplying or dividing by powers of 10. In binary, we can do the same thing by multiplying or dividing by powers of 2. Since our third element has 52 bits, we divide it by 252 to move it 52 places to the right:
0.0010011001100110011001100110011001100110011001100110
In decimal notation, that's the same as dividing 675539944105574 by 4503599627370496 to get 0.1499999999999999. (This is one example of a ratio that can be expressed exactly in binary, but only approximately in decimal; for more detail, see: 675539944105574 / 4503599627370496.)
Now that we've transformed the third component into a fractional number, adding 1 gives the true mantissa.
Recapping the Components
Sign (first component): 0 for positive, 1 for negative
Exponent (middle component): Subtract 2(# of bits) - 1 - 1 to get the true exponent
Mantissa (last component): Divide by 2(# of bits) and add 1 to get the true mantissa
Calculating the Number
Putting all three parts together, we're given this binary number:
1.0010011001100110011001100110011001100110011001100110 x 1011
Which we can then convert from binary to decimal:
1.1499999999999999 x 23 (inexact!)
And multiply to reveal the final representation of the number we started with (9.2) after being stored as a floating point value:
9.1999999999999993
Representing as a Fraction
9.2
Now that we've built the number, it's possible to reconstruct it into a simple fraction:
1.0010011001100110011001100110011001100110011001100110 x 1011
Shift mantissa to a whole number:
10010011001100110011001100110011001100110011001100110 x 1011-110100
Convert to decimal:
5179139571476070 x 23-52
Subtract the exponent:
5179139571476070 x 2-49
Turn negative exponent into division:
5179139571476070 / 249
Multiply exponent:
5179139571476070 / 562949953421312
Which equals:
9.1999999999999993
9.5
>>> float_to_bin_parts(9.5)
['0', '10000000010', '0011000000000000000000000000000000000000000000000000']
Already you can see the mantissa is only 4 digits followed by a whole lot of zeroes. But let's go through the paces.
Assemble the binary scientific notation:
1.0011 x 1011
Shift the decimal point:
10011 x 1011-100
Subtract the exponent:
10011 x 10-1
Binary to decimal:
19 x 2-1
Negative exponent to division:
19 / 21
Multiply exponent:
19 / 2
Equals:
9.5
Further reading
The Floating-Point Guide: What Every Programmer Should Know About Floating-Point Arithmetic, or, Why don’t my numbers add up? (floating-point-gui.de)
What Every Computer Scientist Should Know About Floating-Point Arithmetic (Goldberg 1991)
IEEE Double-precision floating-point format (Wikipedia)
Floating Point Arithmetic: Issues and Limitations (docs.python.org)
Floating Point Binary
This isn't a full answer (mhlester already covered a lot of good ground I won't duplicate), but I would like to stress how much the representation of a number depends on the base you are working in.
Consider the fraction 2/3
In good-ol' base 10, we typically write it out as something like
0.666...
0.666
0.667
When we look at those representations, we tend to associate each of them with the fraction 2/3, even though only the first representation is mathematically equal to the fraction. The second and third representations/approximations have an error on the order of 0.001, which is actually much worse than the error between 9.2 and 9.1999999999999993. In fact, the second representation isn't even rounded correctly! Nevertheless, we don't have a problem with 0.666 as an approximation of the number 2/3, so we shouldn't really have a problem with how 9.2 is approximated in most programs. (Yes, in some programs it matters.)
Number bases
So here's where number bases are crucial. If we were trying to represent 2/3 in base 3, then
(2/3)10 = 0.23
In other words, we have an exact, finite representation for the same number by switching bases! The take-away is that even though you can convert any number to any base, all rational numbers have exact finite representations in some bases but not in others.
To drive this point home, let's look at 1/2. It might surprise you that even though this perfectly simple number has an exact representation in base 10 and 2, it requires a repeating representation in base 3.
(1/2)10 = 0.510 = 0.12 = 0.1111...3
Why are floating point numbers inaccurate?
Because often-times, they are approximating rationals that cannot be represented finitely in base 2 (the digits repeat), and in general they are approximating real (possibly irrational) numbers which may not be representable in finitely many digits in any base.
While all of the other answers are good there is still one thing missing:
It is impossible to represent irrational numbers (e.g. π, sqrt(2), log(3), etc.) precisely!
And that actually is why they are called irrational. No amount of bit storage in the world would be enough to hold even one of them. Only symbolic arithmetic is able to preserve their precision.
Although if you would limit your math needs to rational numbers only the problem of precision becomes manageable. You would need to store a pair of (possibly very big) integers a and b to hold the number represented by the fraction a/b. All your arithmetic would have to be done on fractions just like in highschool math (e.g. a/b * c/d = ac/bd).
But of course you would still run into the same kind of trouble when pi, sqrt, log, sin, etc. are involved.
TL;DR
For hardware accelerated arithmetic only a limited amount of rational numbers can be represented. Every not-representable number is approximated. Some numbers (i.e. irrational) can never be represented no matter the system.
There are infinitely many real numbers (so many that you can't enumerate them), and there are infinitely many rational numbers (it is possible to enumerate them).
The floating-point representation is a finite one (like anything in a computer) so unavoidably many many many numbers are impossible to represent. In particular, 64 bits only allow you to distinguish among only 18,446,744,073,709,551,616 different values (which is nothing compared to infinity). With the standard convention, 9.2 is not one of them. Those that can are of the form m.2^e for some integers m and e.
You might come up with a different numeration system, 10 based for instance, where 9.2 would have an exact representation. But other numbers, say 1/3, would still be impossible to represent.
Also note that double-precision floating-points numbers are extremely accurate. They can represent any number in a very wide range with as much as 15 exact digits. For daily life computations, 4 or 5 digits are more than enough. You will never really need those 15, unless you want to count every millisecond of your lifetime.
Why can we not represent 9.2 in binary floating point?
Floating point numbers are (simplifying slightly) a positional numbering system with a restricted number of digits and a movable radix point.
A fraction can only be expressed exactly using a finite number of digits in a positional numbering system if the prime factors of the denominator (when the fraction is expressed in it's lowest terms) are factors of the base.
The prime factors of 10 are 5 and 2, so in base 10 we can represent any fraction of the form a/(2b5c).
On the other hand the only prime factor of 2 is 2, so in base 2 we can only represent fractions of the form a/(2b)
Why do computers use this representation?
Because it's a simple format to work with and it is sufficiently accurate for most purposes. Basically the same reason scientists use "scientific notation" and round their results to a reasonable number of digits at each step.
It would certainly be possible to define a fraction format, with (for example) a 32-bit numerator and a 32-bit denominator. It would be able to represent numbers that IEEE double precision floating point could not, but equally there would be many numbers that can be represented in double precision floating point that could not be represented in such a fixed-size fraction format.
However the big problem is that such a format is a pain to do calculations on. For two reasons.
If you want to have exactly one representation of each number then after each calculation you need to reduce the fraction to it's lowest terms. That means that for every operation you basically need to do a greatest common divisor calculation.
If after your calculation you end up with an unrepresentable result because the numerator or denominator you need to find the closest representable result. This is non-trivil.
Some Languages do offer fraction types, but usually they do it in combination with arbitary precision, this avoids needing to worry about approximating fractions but it creates it's own problem, when a number passes through a large number of calculation steps the size of the denominator and hence the storage needed for the fraction can explode.
Some languages also offer decimal floating point types, these are mainly used in scenarios where it is imporant that the results the computer gets match pre-existing rounding rules that were written with humans in mind (chiefly financial calculations). These are slightly more difficult to work with than binary floating point, but the biggest problem is that most computers don't offer hardware support for them.
Double.toString(0.1) produces "0.1", but 0.1 is a floating point number.
Floating point number can't represent exactly in program language, but Double.toString produces the exact result (0.1), how does it do that, is it always produces the result that mathematically equal to the double literal?
Assume that the literal is in double precision.
Here is the problem I have see:
When use Apache POI to read excel file, XSSFCell.getNumericCellValue can only return double, if I use BigDecimal.valueOf to convert it to BigDecimal, is that always safe, and why?
Double.toString produces the exact result (0.1), how does it do that, is it always produces the result that mathematically equal to the double literal?
Double.toString(XXX) will always produce a numeral equal to XXX if XXX is a decimal numeral with 15 or fewer significant digits and it is within the range of the Double format.
There are two reasons for this:
The Double format (IEEE-754 binary64) has enough precision so that 15-digit decimal numerals can always be distinguished.
Double.toString does not display the exact Double value but instead produces the fewest significant digits needed to distinguish the number from nearby Double values.
For example, the literal 0.1 in source text is converted to the Double value 0.1000000000000000055511151231257827021181583404541015625. But Double.toString will not produce all those digits by default. The algorithm it uses produces “0.1” because that is enough to uniquely distinguish 0.1000000000000000055511151231257827021181583404541015625 from its two neighbors, which are 0.09999999999999999167332731531132594682276248931884765625 and 0.10000000000000001942890293094023945741355419158935546875. Both of those are farther from 0.1.
Thus, Double.toString(1.234), Double.toString(123.4e-2), and Double.toString(.0001234e4) will all produce “1.234”—a numeral whose value equals all of the original decimal numerals (before they are converted to Double), although it differs in form from some of them.
When use Apache POI to read excel file, XSSFCell.getNumericCellValue can only return double, if I use BigDecimal.valueOf to convert it to BigDecimal, is that always safe, and why?
If the cell value being retrieved is not representable as a Double, then XSSFCell.getNumericCellValue must change it. From memory, I think BigDecimal.valueOf will produce the exact value of the Double returned, but I cannot speak authoritatively to this. That is a separate question from how Double and Double.toString behave, so you might ask it as a separate question.
10e-5d is a double literal equivalent to 10^-5
Double.toString(10e-5d) returns "1.0E-4"
Well, double type has limited precision, so if you add enough digits after the floating point, some of them will be truncated/rounded.
For example:
System.out.println (Double.toString(0.123456789123456789))
prints
0.12345678912345678
I agree with Eric Postpischil's answer, but another explanation may help.
For each double number there is a range of real numbers that round to it under round-half-even rules. For 0.1000000000000000055511151231257827021181583404541015625, the result of rounding 0.1 to a double, the range is [0.099999999999999998612221219218554324470460414886474609375,0.100000000000000012490009027033011079765856266021728515625].
Any double literal whose real number arithmetic value is in that range has the same double value as 0.1.
Double.toString(x) returns the String representation of the real number in the range that converts to x and has the fewest decimal places. Picking any real number in that range ensures that the round trip converting a double to a String using Double.toString and then converting the String back to a double using round-half-even rules recovers the original value.
System.out.println(0.100000000000000005); prints "0.1" because 0.100000000000000005 is in the range that rounds to the same double as 0.1, and 0.1 is the real number in that range with the fewest decimal places.
This effect is rarely visible because literals other than "0.1" with real number value in the range are rare. It is more noticeable for float because of the lesser precision. System.out.println(0.100000001f); prints "0.1".
Why do some numbers lose accuracy when stored as floating point numbers?
For example, the decimal number 9.2 can be expressed exactly as a ratio of two decimal integers (92/10), both of which can be expressed exactly in binary (0b1011100/0b1010). However, the same ratio stored as a floating point number is never exactly equal to 9.2:
32-bit "single precision" float: 9.19999980926513671875
64-bit "double precision" float: 9.199999999999999289457264239899814128875732421875
How can such an apparently simple number be "too big" to express in 64 bits of memory?
In most programming languages, floating point numbers are represented a lot like scientific notation: with an exponent and a mantissa (also called the significand). A very simple number, say 9.2, is actually this fraction:
5179139571476070 * 2 -49
Where the exponent is -49 and the mantissa is 5179139571476070. The reason it is impossible to represent some decimal numbers this way is that both the exponent and the mantissa must be integers. In other words, all floats must be an integer multiplied by an integer power of 2.
9.2 may be simply 92/10, but 10 cannot be expressed as 2n if n is limited to integer values.
Seeing the Data
First, a few functions to see the components that make a 32- and 64-bit float. Gloss over these if you only care about the output (example in Python):
def float_to_bin_parts(number, bits=64):
if bits == 32: # single precision
int_pack = 'I'
float_pack = 'f'
exponent_bits = 8
mantissa_bits = 23
exponent_bias = 127
elif bits == 64: # double precision. all python floats are this
int_pack = 'Q'
float_pack = 'd'
exponent_bits = 11
mantissa_bits = 52
exponent_bias = 1023
else:
raise ValueError, 'bits argument must be 32 or 64'
bin_iter = iter(bin(struct.unpack(int_pack, struct.pack(float_pack, number))[0])[2:].rjust(bits, '0'))
return [''.join(islice(bin_iter, x)) for x in (1, exponent_bits, mantissa_bits)]
There's a lot of complexity behind that function, and it'd be quite the tangent to explain, but if you're interested, the important resource for our purposes is the struct module.
Python's float is a 64-bit, double-precision number. In other languages such as C, C++, Java and C#, double-precision has a separate type double, which is often implemented as 64 bits.
When we call that function with our example, 9.2, here's what we get:
>>> float_to_bin_parts(9.2)
['0', '10000000010', '0010011001100110011001100110011001100110011001100110']
Interpreting the Data
You'll see I've split the return value into three components. These components are:
Sign
Exponent
Mantissa (also called Significand, or Fraction)
Sign
The sign is stored in the first component as a single bit. It's easy to explain: 0 means the float is a positive number; 1 means it's negative. Because 9.2 is positive, our sign value is 0.
Exponent
The exponent is stored in the middle component as 11 bits. In our case, 0b10000000010. In decimal, that represents the value 1026. A quirk of this component is that you must subtract a number equal to 2(# of bits) - 1 - 1 to get the true exponent; in our case, that means subtracting 0b1111111111 (decimal number 1023) to get the true exponent, 0b00000000011 (decimal number 3).
Mantissa
The mantissa is stored in the third component as 52 bits. However, there's a quirk to this component as well. To understand this quirk, consider a number in scientific notation, like this:
6.0221413x1023
The mantissa would be the 6.0221413. Recall that the mantissa in scientific notation always begins with a single non-zero digit. The same holds true for binary, except that binary only has two digits: 0 and 1. So the binary mantissa always starts with 1! When a float is stored, the 1 at the front of the binary mantissa is omitted to save space; we have to place it back at the front of our third element to get the true mantissa:
1.0010011001100110011001100110011001100110011001100110
This involves more than just a simple addition, because the bits stored in our third component actually represent the fractional part of the mantissa, to the right of the radix point.
When dealing with decimal numbers, we "move the decimal point" by multiplying or dividing by powers of 10. In binary, we can do the same thing by multiplying or dividing by powers of 2. Since our third element has 52 bits, we divide it by 252 to move it 52 places to the right:
0.0010011001100110011001100110011001100110011001100110
In decimal notation, that's the same as dividing 675539944105574 by 4503599627370496 to get 0.1499999999999999. (This is one example of a ratio that can be expressed exactly in binary, but only approximately in decimal; for more detail, see: 675539944105574 / 4503599627370496.)
Now that we've transformed the third component into a fractional number, adding 1 gives the true mantissa.
Recapping the Components
Sign (first component): 0 for positive, 1 for negative
Exponent (middle component): Subtract 2(# of bits) - 1 - 1 to get the true exponent
Mantissa (last component): Divide by 2(# of bits) and add 1 to get the true mantissa
Calculating the Number
Putting all three parts together, we're given this binary number:
1.0010011001100110011001100110011001100110011001100110 x 1011
Which we can then convert from binary to decimal:
1.1499999999999999 x 23 (inexact!)
And multiply to reveal the final representation of the number we started with (9.2) after being stored as a floating point value:
9.1999999999999993
Representing as a Fraction
9.2
Now that we've built the number, it's possible to reconstruct it into a simple fraction:
1.0010011001100110011001100110011001100110011001100110 x 1011
Shift mantissa to a whole number:
10010011001100110011001100110011001100110011001100110 x 1011-110100
Convert to decimal:
5179139571476070 x 23-52
Subtract the exponent:
5179139571476070 x 2-49
Turn negative exponent into division:
5179139571476070 / 249
Multiply exponent:
5179139571476070 / 562949953421312
Which equals:
9.1999999999999993
9.5
>>> float_to_bin_parts(9.5)
['0', '10000000010', '0011000000000000000000000000000000000000000000000000']
Already you can see the mantissa is only 4 digits followed by a whole lot of zeroes. But let's go through the paces.
Assemble the binary scientific notation:
1.0011 x 1011
Shift the decimal point:
10011 x 1011-100
Subtract the exponent:
10011 x 10-1
Binary to decimal:
19 x 2-1
Negative exponent to division:
19 / 21
Multiply exponent:
19 / 2
Equals:
9.5
Further reading
The Floating-Point Guide: What Every Programmer Should Know About Floating-Point Arithmetic, or, Why don’t my numbers add up? (floating-point-gui.de)
What Every Computer Scientist Should Know About Floating-Point Arithmetic (Goldberg 1991)
IEEE Double-precision floating-point format (Wikipedia)
Floating Point Arithmetic: Issues and Limitations (docs.python.org)
Floating Point Binary
This isn't a full answer (mhlester already covered a lot of good ground I won't duplicate), but I would like to stress how much the representation of a number depends on the base you are working in.
Consider the fraction 2/3
In good-ol' base 10, we typically write it out as something like
0.666...
0.666
0.667
When we look at those representations, we tend to associate each of them with the fraction 2/3, even though only the first representation is mathematically equal to the fraction. The second and third representations/approximations have an error on the order of 0.001, which is actually much worse than the error between 9.2 and 9.1999999999999993. In fact, the second representation isn't even rounded correctly! Nevertheless, we don't have a problem with 0.666 as an approximation of the number 2/3, so we shouldn't really have a problem with how 9.2 is approximated in most programs. (Yes, in some programs it matters.)
Number bases
So here's where number bases are crucial. If we were trying to represent 2/3 in base 3, then
(2/3)10 = 0.23
In other words, we have an exact, finite representation for the same number by switching bases! The take-away is that even though you can convert any number to any base, all rational numbers have exact finite representations in some bases but not in others.
To drive this point home, let's look at 1/2. It might surprise you that even though this perfectly simple number has an exact representation in base 10 and 2, it requires a repeating representation in base 3.
(1/2)10 = 0.510 = 0.12 = 0.1111...3
Why are floating point numbers inaccurate?
Because often-times, they are approximating rationals that cannot be represented finitely in base 2 (the digits repeat), and in general they are approximating real (possibly irrational) numbers which may not be representable in finitely many digits in any base.
While all of the other answers are good there is still one thing missing:
It is impossible to represent irrational numbers (e.g. π, sqrt(2), log(3), etc.) precisely!
And that actually is why they are called irrational. No amount of bit storage in the world would be enough to hold even one of them. Only symbolic arithmetic is able to preserve their precision.
Although if you would limit your math needs to rational numbers only the problem of precision becomes manageable. You would need to store a pair of (possibly very big) integers a and b to hold the number represented by the fraction a/b. All your arithmetic would have to be done on fractions just like in highschool math (e.g. a/b * c/d = ac/bd).
But of course you would still run into the same kind of trouble when pi, sqrt, log, sin, etc. are involved.
TL;DR
For hardware accelerated arithmetic only a limited amount of rational numbers can be represented. Every not-representable number is approximated. Some numbers (i.e. irrational) can never be represented no matter the system.
There are infinitely many real numbers (so many that you can't enumerate them), and there are infinitely many rational numbers (it is possible to enumerate them).
The floating-point representation is a finite one (like anything in a computer) so unavoidably many many many numbers are impossible to represent. In particular, 64 bits only allow you to distinguish among only 18,446,744,073,709,551,616 different values (which is nothing compared to infinity). With the standard convention, 9.2 is not one of them. Those that can are of the form m.2^e for some integers m and e.
You might come up with a different numeration system, 10 based for instance, where 9.2 would have an exact representation. But other numbers, say 1/3, would still be impossible to represent.
Also note that double-precision floating-points numbers are extremely accurate. They can represent any number in a very wide range with as much as 15 exact digits. For daily life computations, 4 or 5 digits are more than enough. You will never really need those 15, unless you want to count every millisecond of your lifetime.
Why can we not represent 9.2 in binary floating point?
Floating point numbers are (simplifying slightly) a positional numbering system with a restricted number of digits and a movable radix point.
A fraction can only be expressed exactly using a finite number of digits in a positional numbering system if the prime factors of the denominator (when the fraction is expressed in it's lowest terms) are factors of the base.
The prime factors of 10 are 5 and 2, so in base 10 we can represent any fraction of the form a/(2b5c).
On the other hand the only prime factor of 2 is 2, so in base 2 we can only represent fractions of the form a/(2b)
Why do computers use this representation?
Because it's a simple format to work with and it is sufficiently accurate for most purposes. Basically the same reason scientists use "scientific notation" and round their results to a reasonable number of digits at each step.
It would certainly be possible to define a fraction format, with (for example) a 32-bit numerator and a 32-bit denominator. It would be able to represent numbers that IEEE double precision floating point could not, but equally there would be many numbers that can be represented in double precision floating point that could not be represented in such a fixed-size fraction format.
However the big problem is that such a format is a pain to do calculations on. For two reasons.
If you want to have exactly one representation of each number then after each calculation you need to reduce the fraction to it's lowest terms. That means that for every operation you basically need to do a greatest common divisor calculation.
If after your calculation you end up with an unrepresentable result because the numerator or denominator you need to find the closest representable result. This is non-trivil.
Some Languages do offer fraction types, but usually they do it in combination with arbitary precision, this avoids needing to worry about approximating fractions but it creates it's own problem, when a number passes through a large number of calculation steps the size of the denominator and hence the storage needed for the fraction can explode.
Some languages also offer decimal floating point types, these are mainly used in scenarios where it is imporant that the results the computer gets match pre-existing rounding rules that were written with humans in mind (chiefly financial calculations). These are slightly more difficult to work with than binary floating point, but the biggest problem is that most computers don't offer hardware support for them.
I'm working on a method that translates a string into an appropriate Number type, depending upon the format of the number. If the number appears to be a floating point value, then I need to return the smallest type I can use without sacrificing precision (Float, Double or BigDecimal).
Based on How many significant digits have floats and doubles in java? (and other resources), I've learned than Float values have 23 bits for the mantissa. Based on this, I used the following method to return the bit length for a given value:
private static int getBitLengthOfSignificand(String integerPart,
String fractionalPart) {
return new BigInteger(integerPart + fractionalPart).bitLength();
}
If the result of this test is below 24, I return a Float. If below 53 I return a Double, otherwise a BigDecimal.
However, I'm confused by the result when I consider Float.MAX_VALUE, which is 3.4028235E38. The bit length of the significand is 26 according to my method (where integerPart = 3 and fractionalPart = 4028235. This triggers my method to return a Double, when clearly Float would suffice.
Can someone highlight the flaw in my thinking or implementation? Another idea I had was to convert the string to a BigDecimal and scale down using floatValue() and doubleValue(), testing for overflow (which is represented by infinite values). But that loses precision, so isn't appropriate for me.
The significand is stored in binary, and you can think of it as a number in its decimal representation only if you don't let it confuse you.
The exponent is a binary exponent that does not represent a multiplication by a power of ten but by a power of two. For this reason, the E38 in the number you used as example is only a convenience: the real significand is in binary and should be multiplied by a power of two to obtain the actual number. Powers of two and powers of ten aren't the same, so “3.4028235” is not the real significand.
The real significand of Float.MAX_VALUE is in hexadecimal notation, 0x1.fffffe, and its associated exponent is 127, meaning that Float.MAX_VALUE is actually 0x1.fffffe * 2127.
Looking at the decimal representation to choose a binary floating-point type to put the value in, as you are trying to do, doesn't work. For one thing, the number of decimal digits that one is sure to recover from a float is different from the number of decimal digits one may need to write to distinguish a float from its neighbors (6 and 9 respectively). You chose to write “3.4028235E38” but you could have written 3.40282E38, which for your algorithm, looks easier to represent, when it isn't, really. When people write that “3.4028235E38” is the largest finite value of the float type, they mean that if you round this decimal number to float, you will arrive to the largest float. If you parse “3.4028235E38” as a double-precision number it won't even be equal to Float.MAX_VALUE.
To put it differently: another way to write Float.MAX_VALUE is 3.4028234663852885981170418348451692544E38. It is still representable as a float (it represents the exact same value as 3.4028235E38). It looks like it has many digits because these are decimal digits that appear for a decimal exponent, when in fact the number is represented internally with a binary exponent.
(By the way, your approach does not check that the exponent is in range to represent a number in the chosen type, which is another condition for a type to be able to represent the number from a string.)
I would work in terms of the difference between the actual value and the nearest float. BigDecimal can store any finite length decimal fraction exactly and do arithmetic on it:
Convert the String to the nearest float x. If x is infinite, but the value has a finite double representation use that.
Convert the String exactly to BigDecimal y.
If y is zero, use float, which can represent zero exactly.
If not, convert the float x to BigDecimal, z.
Calculate, in BigDecimal to a reasonable number of decimal places, the absolute value of (y-z)/z. That is the relative rounding error due to using float. If it is small enough for your purposes, less than some value you pick, use float. If not, use double.
If you literally want no sacrifice in precision, it is much simpler. Convert to both float and double. Compare them for equality. The comparison will be done in double. If they compare equal, go with the float. If not, go with the double.
My experimentation suggests a bound of 24, which is reached by -Double.MIN_NORMAL, which results in
-2.2250738585072014E-308
...but I can't prove it, nor come up with a conclusive reason why no other value should beat -MIN_NORMAL.
It's a 64-bit IEEE-754 float.
The most decimal numbers that can be stored in a 52-bit mantissa is 17 (see page 4: ceil( 1 + N Log10(2) )), so that's 19 characters with the decimal point and negative sign.
The bias is 1023, so the smallest base-2 exponent is 2^-1022, which is around 10^-308, so the longest exponent is 5 characters with the 'E' and negative sign.
19 + 5 == 24
26 seems to be an upper bound, for certain, as follows.
According to GrepCode's version of FloatingDecimal.getChars, OpenJDK7 asserts that the value nDigits is at most 19. Looking at the code, nDigits appears to refer to the digits (not the decimal point) of the mantissa: in the above example, 22250738585072014. Additional characters, then, include
a - sign on the value as a whole
the . decimal point
the E for the exponent
a - sign on the exponent
at most three decimal digits on the exponent
... which makes 19 + 7 = 26.
(Arguments for tighter bounds are still welcome.)
I think this bound is correct. The javadoc says
How many digits must be printed for the fractional part of m or a? There must be at least one digit to represent the fractional part, and beyond that as many, but only as many, more digits as are needed to uniquely distinguish the argument value from adjacent values of type double.
A double has an implied integer part (1) and 52 mantissa bits. Thus, if the base-2 exponent represented by the double is 0, so that x is a double in the range [1,2), then the next higher adjacent double is x + 2-52. 2-52 is about 2.2204 * 10-16. This suggests that the number of fractional digits needed to distinguish a value from the next adjacent value is 16, i.e. the double adjacent to 1 would be represented as 1.0000000000000002 (15 zeroes). Since that matches the number of fractional digits in your experiment, it seems very likely that it's indeed the maximum number that would ever be required. This isn't a rigorous proof, of course; that would take a little more work.