If you try to run the following code
public class Main {
public static void main(String[] args) {
long a = (long)Math.pow(13, 15);
System.out.println(a + " " + a%13);
}
}
You will get "51185893014090752 8"
The correct value of 13^15 is 51185893014090757, i.e. greater than the result returned by Math.pow by 5. Any ideas of what may cause it?
You've exceeded the number of significant digits available (~15 to 16) in double-precision floating-point values. Once you do that, you can't expect the least significant digit(s) of your result to actually be meaningful/precise.
If you need arbitrarily precise arithmetic in Java, consider using BigInteger and BigDecimal.
The problem is that as you get to higher and higher double values, the gap between consecutive values increases - a double can't represent every integer value within its range, and that's what's going wrong here. It's returning the closest double value to the exact result.
This is not a problem of precision. The Math.pow method performs an approximation of the result. To get the correct result use the following code.
long b = 13;
for(int i = 0; i != 14; i ++) {
b = b * 13;
}
System.out.println(b);
The output is the expected result 51185893014090757L.
More generally, the Math.pow method usage should be avoided when the exponent is an integer. First, the result is an approximation, and second it is more costly to compute.
The implementation of Math.pow (and most other methods in the Math class) is based on the network library netlib as the package "Freely Distributable Math Library" (see StrictMath javadoc). The implementation in C is available at e_pow.c.
A double has finite precision, its mantissa is 52 bits, which roughly equals 15 to 16 decimals. So the number you're trying to calculate can't be represented (exactly) by a double any more.
The correct answer is to provide the closest number which can be represented by a double
Have you checked whether this is the case or not?
it is because of the limit of holding digits in long by casting to double, float you may be able but it will have some errors, you should yourself handle the digits of the calculation by saving them in an array that's not an easy way
but in python programming language you can have the result of any length, it is so powerful!
be successful!!!
Related
I'm wondering what the best way to fix precision errors is in Java. As you can see in the following example, there are precision errors:
class FloatTest
{
public static void main(String[] args)
{
Float number1 = 1.89f;
for(int i = 11; i < 800; i*=2)
{
System.out.println("loop value: " + i);
System.out.println(i*number1);
System.out.println("");
}
}
}
The result displayed is:
loop value: 11
20.789999
loop value: 22
41.579998
loop value: 44
83.159996
loop value: 88
166.31999
loop value: 176
332.63998
loop value: 352
665.27997
loop value: 704
1330.5599
Also, if someone can explain why it only does it starting at 11 and doubling the value every time. I think all other values (or many of them at least) displayed the correct result.
Problems like this have caused me headache in the past and I usually use number formatters or put them into a String.
Edit: As people have mentioned, I could use a double, but after trying it, it seems that 1.89 as a double times 792 still outputs an error (the output is 1496.8799999999999).
I guess I'll try the other solutions such as BigDecimal
If you really care about precision, you should use BigDecimal
https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/math/BigDecimal.html
The problem is not with Java but with the good standard float's (http://en.wikipedia.org/wiki/IEEE_floating-point_standard).
You can either:
use Double and have a bit more precision (but not perfect of course, it also has limited precision)
use a arbitrary-precision-library
use numerically stable algorithms and truncate/round digits of which you are not sure they are correct (you can calculate numeric precision of operations)
When you print the result of a double operation you need to use appropriate rounding.
System.out.printf("%.2f%n", 1.89 * 792);
prints
1496.88
If you want to round the result to a precision, you can use rounding.
double d = 1.89 * 792;
d = Math.round(d * 100) / 100.0;
System.out.println(d);
prints
1496.88
However if you see below, this prints as expected, as there is a small amount of implied rounding.
It worth nothing that (double) 1.89 is not exactly 1.89 It is a close approximation.
new BigDecimal(double) converts the exact value of double without any implied rounding. It can be useful in finding the exact value of a double.
System.out.println(new BigDecimal(1.89));
System.out.println(new BigDecimal(1496.88));
prints
1.8899999999999999023003738329862244427204132080078125
1496.8800000000001091393642127513885498046875
Most of your question has been pretty well covered, though you might still benefit from reading the [floating-point] tag wiki to understand why the other answers work.
However, nobody has addressed "why it only does it starting at 11 and doubling the value every time," so here's the answer to that:
for(int i = 11; i < 800; i*=2)
╚═══╤════╝ ╚╤═╝
│ └───── "double the value every time"
│
└───── "start at 11"
You could use doubles instead of floats
If you really need arbitrary precision, use BigDecimal.
first of Float is the wrapper class for the primitive float
and doubles have more precision
but if you only want to calculate down to the second digit (for monetary purposes for example) use an integer (as if you are using cents as unit) and add some scaling logic when you are multiplying/dividing
or if you need arbitrary precision use BigDecimal
If precision is vital, you should use BigDecimal to make sure that the required precision remains. When you instantiate the calculation, remember to use strings to instantiate the values instead of doubles.
I never had a problem with simple arithmetic precision in either Basic, Visual Basic, FORTRAN, ALGOL or other "primitive" languages. It is beyond comprehension that JAVA can't do simple arithmetic without introducing errors. I need just two digits to the right of the decimal point for doing some accounting. Using Float subtracting 1000 from 1355.65 I get 355.650002! In order to get around this ridiculous error I have implemented a simple solution. I process my input by separating the values on each side of the decimal point as character, convert each to integers, multiply each by 1000 and add the two back together as integers. Ridiculous but there are no errors introduced by the poor JAVA algorithms.
I was just messing around with this method to see what it does. I created a variable with value 3.14 just because it came to my mind at that instance.
double n = 3.14;
System.out.println(Math.nextUp(n));
The preceding displayed 3.1400000000000006.
Tried with 3.1400000000000001, displayed the same.
Tried with 333.33, displayed 333.33000000000004.
With many other values, it displays the appropriate value for example 73.6 results with 73.60000000000001.
What happens to the values in between 3.1400000000000000 and 3.1400000000000006? Why does it skips some values? I know about the hardware related problems but sometimes it works right. Also even though it is known that precise operations cannot be done, why is such method included in the library? It looks pretty useless due to the fact that it doesn't work always right.
One useful trick in Java is to use the exactness of new BigDecimal(double) and of BigDecimal's toString to show the exact value of a double:
import java.math.BigDecimal;
public class Test {
public static void main(String[] args) {
System.out.println(new BigDecimal(3.14));
System.out.println(new BigDecimal(3.1400000000000001));
System.out.println(new BigDecimal(3.1400000000000006));
}
}
Output:
3.140000000000000124344978758017532527446746826171875
3.140000000000000124344978758017532527446746826171875
3.1400000000000005684341886080801486968994140625
There are a finite number of doubles, so only a specific subset of the real numbers are the exact value of a double. When you create a double literal, the decimal number you type is represented by the nearest of those values. When you output a double, by default, it is shown as the shortest decimal fraction that would round to it on input. You need to do something like the BigDecimal technique I used in the program to see the exact value.
In this case, both 3.14 and 3.1400000000000001 are closer to 3.140000000000000124344978758017532527446746826171875 than to any other double. The next exactly representable number above that is 3.1400000000000005684341886080801486968994140625
Floating point numbers are stored in binary: the decimal representation is just for human consumption.
Using Rick Regan's decimal to floating point converter 3.14 converts to:
11.001000111101011100001010001111010111000010100011111
and 3.1400000000000006 converts to
11.0010001111010111000010100011110101110000101001
which is indeed the next binary number to 53 significant bits.
Like #jgreve mentions this has to do due to the use of float & double primitives types in java, which leads to the so called rounding error. The primitive type int on the other hand is a fixed-point number meaning that it is able to "fit" within 32-bits. Doubles are not fixed-point, meaning that the result of double calculations must often be rounded in order to fit back into its finite representation, which leads sometimes (as presented in your case) to inconsistent values.
See the following two links for more info.
https://stackoverflow.com/a/322875/6012392
https://en.wikipedia.org/wiki/Double-precision_floating-point_format
A work around could be the following two, which gives a "direction" to the first double.
double n = 1.4;
double x = 1.5;
System.out.println(Math.nextAfter(n, x));
or
double n = 1.4;
double next = n + Math.ulp(n);
System.out.println(next);
But to handle floating point values it is recommended to use the BigDecimal class
I'm developing a time critical algorithm in Java and therefore am not using BigDecimal. To handle the rounding errors, I set an upper error bound instead, below which different floating point numbers are considered to be exactly the same. Now the problem is what should that bound be? Or in other words, what's the biggest possible rounding error that can occur, when performing computational operations with floating-point numbers (floating-point addition, subtraction, multiplication and division)?
With an experiment I've done, it seems that a bound of 1e-11 is enough.
PS: This problem is language independent.
EDIT: I'm using double data type. The numbers are generated with Random's nextDouble() method.
EDIT 2: It seems I need to calculate the error based on how the floating-point numbers I'm using are generated. The nextDouble() method looks like this:
public double nextDouble() {
return (((long)(next(26)) << 27) + next(27))
/ (double)(1L << 53); }
Based on the constants in this method, I should be able calculate the the biggest possible error that can occur for floating-point number generated with this method specifically (its machine epsilon?). Would be glad if someone could post the calculation .
The worst case rounding error on a single simple operation is half the gap between the pair of doubles that bracket the real number result of the operation. Results from Random's nextDouble method are "from the range 0.0d (inclusive) to 1.0d (exclusive)". For those numbers, the largest gap is about 1e-16 and the worst case rounding error is about 5e-17.
Here is a program that prints the gap for some sample numbers, including the largest result of Random's nextDouble:
public class Test {
public static void main(String[] args) {
System.out.println("Max random result gap: "
+ Math.ulp(Math.nextAfter(1.0, Double.NEGATIVE_INFINITY)));
System.out.println("1e6 gap: "
+ Math.ulp(1e6));
System.out.println("1e30 gap: "
+ Math.ulp(1e30));
}
}
Output:
Max random result gap: 1.1102230246251565E-16
1e6 gap: 1.1641532182693481E-10
1e30 gap: 1.40737488355328E14
Depending on the calculation you are doing, errors can accumulate across multiple operations, giving bigger total rounding error than you would predict from this simplistic single-operation approach. As Mark Dickinson said in a comment, "Numerical analysis is a bit more complicated than that."
This depends on:
Your algorithm
the magnitude of involved numbers
For example, consider the function f(x) = a * ( b - ( c+ d))
No big deal, or is it?
It turns out it is when d << c, b = c and a whatever, but let's just say it's big.
Let's say:
a = 10e200
b = c = 5
d = 10e-90
This is totally made up, but you get the point. The point is, the difference of magnitude between c and d mean that
c + d = c (small rounding error because d << c)
b - (c + d) = 0 (should be 10e-90)
a * (b - (c + d)) = 0 (where it really should be 10e110)
Long story short, some operations (notably subtractions) (can) kill you. Also, it is not so much the generating function that you need to look at, it is the operations that you do with the numbers (your algorithm).
class Test{
public static void main(String[] args){
float f1=3.2f;
float f2=6.5f;
if(f1==3.2){
System.out.println("same");
}else{
System.out.println("different");
}
if(f2==6.5){
System.out.println("same");
}else{
System.out.println("different");
}
}
}
output:
different
same
Why is the output like that? I expected same as the result in first case.
The difference is that 6.5 can be represented exactly in both float and double, whereas 3.2 can't be represented exactly in either type. and the two closest approximations are different.
An equality comparison between float and double first converts the float to a double and then compares the two. So the data loss.
You shouldn't ever compare floats or doubles for equality; because you can't really guarantee that the number you assign to the float or double is exact.
This rounding error is a characteristic feature of floating-point computation.
Squeezing infinitely many real numbers into a finite number of bits
requires an approximate representation. Although there are infinitely
many integers, in most programs the result of integer computations can
be stored in 32 bits.
In contrast, given any fixed number of bits,
most calculations with real numbers will produce quantities that
cannot be exactly represented using that many bits. Therefore the
result of a floating-point calculation must often be rounded in order
to fit back into its finite representation. This rounding error is the
characteristic feature of floating-point computation.
Check What Every Computer Scientist Should Know About Floating-Point Arithmetic for more!
They're both implementations of different parts of the IEEE floating point standard. A float is 4 bytes wide, whereas a double is 8 bytes wide.
As a rule of thumb, you should probably prefer to use double in most cases, and only use float when you have a good reason to. (An example of a good reason to use float as opposed to a double is "I know I don't need that much precision and I need to store a million of them in memory.") It's also worth mentioning that it's hard to prove you don't need double precision.
Also, when comparing floating point values for equality, you'll typically want to use something like Math.abs(a-b) < EPSILON where a and b are the floating point values being compared and EPSILON is a small floating point value like 1e-5. The reason for this is that floating point values rarely encode the exact value they "should" -- rather, they usually encode a value very close -- so you have to "squint" when you determine if two values are the same.
EDIT: Everyone should read the link #Kugathasan Abimaran posted below: What Every Computer Scientist Should Know About Floating-Point Arithmetic for more!
To see what you're dealing with, you can use Float and Double's toHexString method:
class Test {
public static void main(String[] args) {
System.out.println("3.2F is: "+Float.toHexString(3.2F));
System.out.println("3.2 is: "+Double.toHexString(3.2));
System.out.println("6.5F is: "+Float.toHexString(6.5F));
System.out.println("6.5 is: "+Double.toHexString(6.5));
}
}
$ java Test
3.2F is: 0x1.99999ap1
3.2 is: 0x1.999999999999ap1
6.5F is: 0x1.ap2
6.5 is: 0x1.ap2
Generally, a number has an exact representation if it equals A * 2^B, where A and B are integers whose allowed values are set by the language specification (and double has more allowed values).
In this case,
6.5 = 13/2 = (1+10/16)*4 = (1+a/16)*2^2 == 0x1.ap2, while
3.2 = 16/5 = ( 1 + 9/16 + 9/16^2 + 9/16^3 + . . . ) * 2^1 == 0x1.999. . . p1.
But Java can only hold a finite number of digits, so it cuts the .999. . . off at some point. (You may remember from math that 0.999. . .=1. That's in base 10. In base 16, it would be 0.fff. . .=1.)
class Test {
public static void main(String[] args) {
float f1=3.2f;
float f2=6.5f;
if(f1==3.2f)
System.out.println("same");
else
System.out.println("different");
if(f2==6.5f)
System.out.println("same");
else
System.out.println("different");
}
}
Try like this and it will work. Without 'f' you are comparing a floating with other floating type and different precision which may cause unexpected result as in your case.
It is not possible to compare values of type float and double directly. Before the values can be compared, it is necessary to either convert the double to float, or convert the float to double. If one does the former comparison, the conversion will ask "Does the the float hold the best possible float representation of the double's value?" If one does the latter conversion, the question will be "Does the float hold a perfect representation of the double's value". In many contexts, the former question is the more meaningful one, but Java assumes that all comparisons between float and double are intended to ask the latter question.
I would suggest that regardless of what a language is willing to tolerate, one's coding standards should absolutely positively forbid direct comparisons between operands of type float and double. Given code like:
float f = function1();
double d = function2();
...
if (d==f) ...
it's impossible to tell what behavior is intended in cases where d represents a value which is not precisely representable in float. If the intention is that f be converted to a double, and the result of that conversion compared with d, one should write the comparison as
if (d==(double)f) ...
Although the typecast doesn't change the code's behavior, it makes clear that the code's behavior is intentional. If the intention was that the comparison indicate whether f holds the best float representation of d, it should be:
if ((float)d==f)
Note that the behavior of this is very different from what would happen without the cast. Had your original code cast the double operand of each comparison to float, then both equality tests would have passed.
In general is not a good practice to use the == operator with floating points number, due to approximation issues.
6.5 can be represented exactly in binary, whereas 3.2 can't. That's why the difference in precision doesn't matter for 6.5, so 6.5 == 6.5f.
To quickly refresh how binary numbers work:
100 -> 4
10 -> 2
1 -> 1
0.1 -> 0.5 (or 1/2)
0.01 -> 0.25 (or 1/4)
etc.
6.5 in binary: 110.1 (exact result, the rest of the digits are just zeroes)
3.2 in binary: 11.001100110011001100110011001100110011001100110011001101... (here precision matters!)
A float only has 24 bits precision (the rest is used for sign and exponent), so:
3.2f in binary: 11.0011001100110011001100 (not equal to the double precision approximation)
Basically it's the same as when you're writing 1/5 and 1/7 in decimal numbers:
1/5 = 0,2
1,7 = 0,14285714285714285714285714285714.
Float has less precision than double, bcoz float is using 32bits inwhich 1 is used for Sign, 23 precision and 8 for Exponent . Where as double uses 64 bits in which 52 are used for precision, 11 for exponent and 1for Sign....Precision is important matter.A decimal number represented as float and double can be equal or unequal depends is need of precision( i.e range of numbers after decimal point can vary). Regards S. ZAKIR
Can every possible value of a float variable can be represented exactly in a double variable?
In other words, for all possible values X will the following be successful:
float f1 = X;
double d = f1;
float f2 = (float)d;
if(f1 == f2)
System.out.println("Success!");
else
System.out.println("Failure!");
My suspicion is that there is no exception, or if there is it is only for an edge case (like +/- infinity or NaN).
Edit: Original wording of question was confusing (stated two ways, one which would be answered "no" the other would be answered "yes" for the same answer). I've reworded it so that it matches the question title.
Yes.
Proof by enumeration of all possible cases:
public class TestDoubleFloat {
public static void main(String[] args) {
for (long i = Integer.MIN_VALUE; i <= Integer.MAX_VALUE; i++) {
float f1 = Float.intBitsToFloat((int) i);
double d = (double) f1;
float f2 = (float) d;
if (f1 != f2) {
if (Float.isNaN(f1) && Float.isNaN(f2)) {
continue; // ok, NaN
}
fail("oops: " + f1 + " != " + f2);
}
}
}
}
finishes in 12 seconds on my machine. 32 bits are small.
In theory, there is not such a value, so "yes", every float should be representable as a double.. Converting from a float to a double should involve just tacking four bytes of 00 on the end -- they are stored using the same format, just with different sized fields.
Yes, floats are a subset of doubles. Both floats and doubles have the form (sign * a * 2^b). The difference between floats and doubles is the number of bits in a & b. Since doubles have more bits available, assigning a float value to a double effectively means inserting extra 0 bits.
As everyone has already said, "no". But that's actually a "yes" to the question itself, i.e. every float can be exactly expressed as a double. Confusing. :)
If I'm reading the language specification correctly (and as everyone else is confirming), there is no such value.
That is, each claims only to hold only IEEE 754 standard values, so casts between the two should incur no change except in memory given.
(clarification: There would be no change as long as the value was small enough to be held in a float; obviously if the value was too many bits to be held in a float to begin with, casting from double to float would result in a loss of precision.)
#KenG: This code:
float a = 0.1F
println "a=${a}"
double d = a
println "d=${d}"
fails not because 0.1f can't be exactly represented. The question was "is there a float value that cannot be represented as a double", which this code doesn't prove. Although 0.1f can't be stored exactly, the value that a is given (which isn't 0.1f exactly) can be stored as a double (which also won't be 0.1f exactly). Assuming an Intel FPU, the bit pattern for a is:
0 01111011 10011001100110011001101
and the bit pattern for d is:
0 01111111011 100110011001100110011010 (followed by lots more zeros)
which has the same sign, exponent (-4 in both cases) and the same fractional part (separated by spaces above). The difference in the output is due to the position of the second non-zero digit in the number (the first is the 1 after the point) which can only be represented with a double. The code that outputs the string format stores intermediate values in memory and is specific to floats and doubles (i.e. there is a function double-to-string and another float-to-string). If the to-string function was optimised to use the FPU stack to store the intermediate results of the to-string process, the output would be the same for float and double since the FPU uses the same, larger format (80bits) for both float and double.
There are no float values that can't be stored identically in a double, i.e. the set of float values is a sub-set of the the set of double values.
Snark: NaNs will compare differently after (or indeed before) conversion.
This does not, however, invalidate the answers already given.
I took the code you listed and decided to try it in C++ since I thought it might execute a little faster and it is significantly easier to do unsafe casting. :-D
I found out that for valid numbers, the conversion works and you get the exact bitwise representation after the cast. However, for non-numbers, e.g. 1.#QNAN0, etc., the result will use a simplified representation of the non-number rather than the exact bits of the source. For example:
**** FAILURE **** 2140188725 | 1.#QNAN0 -- 0xa0000000 0x7ffa1606
I cast an unsigned int to float then to double and back to float. The number 2140188725 (0x7F90B035) results in a NAN and converting to double and back is still a NAN but not the exact same NAN.
Here is the simple C++ code:
typedef unsigned int uint;
for (uint i = 0; i < 0xFFFFFFFF; ++i)
{
float f1 = *(float *)&i;
double d = f1;
float f2 = (float)d;
if(f1 != f2)
printf("**** FAILURE **** %u | %f -- 0x%08x 0x%08x\n", i, f1, f1, f2);
if ((i % 1000000) == 0)
printf("Iteration: %d\n", i);
}
The answer to the first question is yes, the answer to the 'in other words', however is no. If you change the test in the code to be if (!(f1 != f2)) the answer to the second question becomes yes -- it will print 'Success' for all float values.
In theory every normal single can have the exponent and mantissa padded to create a double and then remove the padding and you return to the original single.
When you go from theory to reality is when you will have problems. I dont know if you were interested in theory or implementation. If it is implementation then you can rapidly get into trouble.
IEEE is a horrible format, my understanding it was intentionally designed to be so tough that nobody could meet it and allow the market to catch up to intel (this was a while back) allowing for more competition. If that is true it failed, either way we are stuck with this dreadful spec. Something like the TI format is far superior for the real world in so many ways. I have no connection to either company or any of these formats.
Thanks to this spec there are very few if any fpus that actually meet it (in hardware or even in hardware plus the operating system), and those that do often fail on the next generation. (google: TestFloat). The problems these days tend to lie in the int to float and float to int and not single to double and double to single as you have specified above. Of course what operation is the fpu going to perform to do that conversion? Add 0? Multiply by 1? Depends on the fpu and the compiler.
The problem with IEEE related to your question above is that there is more than one way a number, not every number but many numbers can be represented. If I wanted to break your code I would start with minus zero in the hope that one of the two operations would convert it to a plus zero. Then I would try denormals. And it should fail with a signaling nan, but you called that out as a known exception.
The problem is that equal sign, here is rule number one about floating point, never use an equal sign. Equals is a bit comparison not a value comparison, if you have two values represented in different ways (plus zero and minus zero for example) the bit comparison will fail even though its the same number. Greater than and less than are done in the fpu, equals is done with the integer alu.
I realize that you probably used the equal to explain the problem and not necessarily the code you wanted to succeed or fail.
If a floating-point type is viewed as representing a precise value, then as other posters have noted, every float value is representable as a double, but only a few values of double can be represented by float. On the other hand, if one recognizes that floating-point values are approximations, one will realize the real situation is reversed. If one uses a very precise instrument to measure something which is 3.437mm, one may correctly describe is size as 3.4mm. if one uses a ruler to measure the object as 3.4mm, it would be incorrect to describe its size as 3.400mm.
Even bigger problems exist at the top of the range. There is a float value that represents: "computed value exceeded 2^127 by an unknown amount", but there's no double value that indicates such a thing. Casting an "infinity" from single to double will yield a value "computed value exceeded 2^1023 by an unknown amount" which is off by a factor of over a googol.