Biggest possible rounding error when computing floating-point numbers - java

I'm developing a time critical algorithm in Java and therefore am not using BigDecimal. To handle the rounding errors, I set an upper error bound instead, below which different floating point numbers are considered to be exactly the same. Now the problem is what should that bound be? Or in other words, what's the biggest possible rounding error that can occur, when performing computational operations with floating-point numbers (floating-point addition, subtraction, multiplication and division)?
With an experiment I've done, it seems that a bound of 1e-11 is enough.
PS: This problem is language independent.
EDIT: I'm using double data type. The numbers are generated with Random's nextDouble() method.
EDIT 2: It seems I need to calculate the error based on how the floating-point numbers I'm using are generated. The nextDouble() method looks like this:
public double nextDouble() {
return (((long)(next(26)) << 27) + next(27))
/ (double)(1L << 53); }
Based on the constants in this method, I should be able calculate the the biggest possible error that can occur for floating-point number generated with this method specifically (its machine epsilon?). Would be glad if someone could post the calculation .

The worst case rounding error on a single simple operation is half the gap between the pair of doubles that bracket the real number result of the operation. Results from Random's nextDouble method are "from the range 0.0d (inclusive) to 1.0d (exclusive)". For those numbers, the largest gap is about 1e-16 and the worst case rounding error is about 5e-17.
Here is a program that prints the gap for some sample numbers, including the largest result of Random's nextDouble:
public class Test {
public static void main(String[] args) {
System.out.println("Max random result gap: "
+ Math.ulp(Math.nextAfter(1.0, Double.NEGATIVE_INFINITY)));
System.out.println("1e6 gap: "
+ Math.ulp(1e6));
System.out.println("1e30 gap: "
+ Math.ulp(1e30));
}
}
Output:
Max random result gap: 1.1102230246251565E-16
1e6 gap: 1.1641532182693481E-10
1e30 gap: 1.40737488355328E14
Depending on the calculation you are doing, errors can accumulate across multiple operations, giving bigger total rounding error than you would predict from this simplistic single-operation approach. As Mark Dickinson said in a comment, "Numerical analysis is a bit more complicated than that."

This depends on:
Your algorithm
the magnitude of involved numbers
For example, consider the function f(x) = a * ( b - ( c+ d))
No big deal, or is it?
It turns out it is when d << c, b = c and a whatever, but let's just say it's big.
Let's say:
a = 10e200
b = c = 5
d = 10e-90
This is totally made up, but you get the point. The point is, the difference of magnitude between c and d mean that
c + d = c (small rounding error because d << c)
b - (c + d) = 0 (should be 10e-90)
a * (b - (c + d)) = 0 (where it really should be 10e110)
Long story short, some operations (notably subtractions) (can) kill you. Also, it is not so much the generating function that you need to look at, it is the operations that you do with the numbers (your algorithm).

Related

Number of blocks to spit range of values over 64

I have the following piece of code:
long[] blocks = new long[(someClass.getMemberArray().length - 1) / 64 + 1];
Basically the someClass.getMemberArray() can return an array that could be much larger than 64 and the code tries to determine how many blocks of len 64 are needed for subsequent processing.
I am confused about the logic and how does this work. It seems to me that just doing:
long[] blocks = new long[(int) Math.ceil(someClass.getMemberArray().length / 64.0)];
should work too any looks simpler.
Can someone help me understanding the -1 and +1 reasoning in the original snippet, how it works and if the ceil would fail in some cases?
as you correctly commented, the -1/+1 is required to get the correct number of blocks, including only partially filled ones. It effectively rounds up.
(But it has something that could be considered a bug: if the array has length 0, which would required 0 blocks, it returns 1. This is because integer division usually truncates on most systems, i.e. rounds UP for negative numbers, so (0 - 1)/64 yields 0. However, this may be a feature if zero blocks for some reasons are not allowed. It definitively requires a comment though.)
The reasoning for the first, original line is that it only uses integer arithmetics, which should translate on just a few basic and fast machine instructions on mostcomputers.
The second solution involved casting floating-point arithmetic and casting. Traditionally, floating-point arithmetic was MUCH slower on most processors, which probably was the reasoning for the first solution. However, on modern CPUs with integrated floating-point support, the performance depends more on other things like cache lines and pipelining.
Personally, I don't really like both solutions, as it's not really obvious what they do. So I would suggest the following solution:
int arrayLength = someClass.getMemberArray().length;
int blockCount = ceilDiv(arrayLength, 64);
long[] blocks = new long[blockCount];
//...
/**
* Integer division, rounding up.
* #return the quotient a/b, rounded up.
*/
static int ceilDiv(int a, int b) {
assert b >= 0 : b; // Doesn't work for negative divisor.
// Divide.
int quotient = a / b;
// If a is not a multiple of b, round up.
if (a % b != 0) {
quotient++;
}
return quotient;
}
It's wordy, but at least it's clear what should happen, and it provides a general solution which works for all integers (except negative divisors). Unfortunately, most languages don't provide an elegant "round up integer division" solution.
I don't see why the -1 would be necessary, but the +1 is likely there to rectify the case where the result of the division gets rounded down to the nearest non-decimal value (which should be, well, every case except those where the result is without decimals)

Rounding double strangeness

it might be too late in the night, but I can't understand the behavior of this code:
public class DT {
static void theTest(double d){
double e = Math.floor(d/1630)*1630;
System.out.println(e-d);
}
public static void main(String[] args) {
theTest(2*1630);
theTest(2*1631);
theTest(2*1629);
theTest(8.989779443802325E18);
}
}
in my understangind, all 4 cases should be NON-positive, i.e. "e" is always <= "d",
but I do get following output:
0.0
-2.0
-1628.0
1024.0
Why??.
as this is same with FastMath, I suspect something double-specific? but could anyone explain me this?
When you get up into huge numbers, the doubles are spaced more widely than integers. When you do a division in this range, the result can be rounded up or down. So in your fourth test case, the result of the division d/1630 is actually rounded up to the nearest available double. Since this is a whole number, the call to floor does not change it. Multiplying it by 1630 then gives a result that is larger than d.
Edit
This effect kicks in at 2^52. Once you get past 2^52, there are no more non-integer doubles. Between 2^52 and 2^53, the doubles are just the integers. Above 2^53, the doubles are spaced more widely than the integers.
The result of the division in the question is 5515202112762162.576... which is between 2^52 and 2^53. It is rounded to the nearest double, which is the same as the nearest integer, which is 5515202112762163. Now, the floor does not change this number, and the multiplication by 1630 gives a result that is larger than d.
In summary, I guess the first sentence of the answer was a little misleading - you don't need the doubles to be spaced more widely than the integers for this effect to occur; you only need them to be spaced at least as widely as the integers.
With a value of d between 0 and 2^52 * 1630, the program in the question will never output a positive number.
NOTE: I think you are looking for the operation called fmod in other languages and % in Java. The e - d that you wish to compute could be computed allways of the correct sign and always lower than 1630 as -(d % 1630.0).
all 4 cases should be NON-positive
For an arbitrary double d, it is likely that Math.floor(d/1630)*1630 would be less than d, but not necessary.
In order:
d/1630 is the double nearest to the real d / 1630. It can be up to one half ULP above the real d / 1630, and it can be arbitrarily close to an integer.
When d is large enough, d/1630 is always an integer, because any large enough double is an integer. In other words, when d is large enough, Math.floor(d/1630) is identical to d/1630. This applies to your last example.
d / 1630 * 1630 is the double nearest to the real multiplication of d / 1630 by 1630. It is one half ULP from the real result.
The two rounding operations in d / 1630 * 1630 can both round up, and in this case, d / 1630 * 1630 is larger than d. It wouldn't be expected to be larger than d by more than one ULP(d).
If you want to compute a number that is guaranteed to be below the real d / 1630, you should either change the rounding mode to downward (not sure if Java lets you do this), or subtract one ULP from the result of d / 1630 computed in the default round-to-nearest rounding mode. You can do the latter with the function nextAfter.

Wrong result by Java Math.pow

If you try to run the following code
public class Main {
public static void main(String[] args) {
long a = (long)Math.pow(13, 15);
System.out.println(a + " " + a%13);
}
}
You will get "51185893014090752 8"
The correct value of 13^15 is 51185893014090757, i.e. greater than the result returned by Math.pow by 5. Any ideas of what may cause it?
You've exceeded the number of significant digits available (~15 to 16) in double-precision floating-point values. Once you do that, you can't expect the least significant digit(s) of your result to actually be meaningful/precise.
If you need arbitrarily precise arithmetic in Java, consider using BigInteger and BigDecimal.
The problem is that as you get to higher and higher double values, the gap between consecutive values increases - a double can't represent every integer value within its range, and that's what's going wrong here. It's returning the closest double value to the exact result.
This is not a problem of precision. The Math.pow method performs an approximation of the result. To get the correct result use the following code.
long b = 13;
for(int i = 0; i != 14; i ++) {
b = b * 13;
}
System.out.println(b);
The output is the expected result 51185893014090757L.
More generally, the Math.pow method usage should be avoided when the exponent is an integer. First, the result is an approximation, and second it is more costly to compute.
The implementation of Math.pow (and most other methods in the Math class) is based on the network library netlib as the package "Freely Distributable Math Library" (see StrictMath javadoc). The implementation in C is available at e_pow.c.
A double has finite precision, its mantissa is 52 bits, which roughly equals 15 to 16 decimals. So the number you're trying to calculate can't be represented (exactly) by a double any more.
The correct answer is to provide the closest number which can be represented by a double
Have you checked whether this is the case or not?
it is because of the limit of holding digits in long by casting to double, float you may be able but it will have some errors, you should yourself handle the digits of the calculation by saving them in an array that's not an easy way
but in python programming language you can have the result of any length, it is so powerful!
be successful!!!

Java:Why should we use BigDecimal instead of Double in the real world? [duplicate]

This question already has answers here:
Double vs. BigDecimal?
(7 answers)
Closed 7 years ago.
When dealing with real world monetary values, I am advised to use BigDecimal instead of Double.But I have not got a convincing explanation except, "It is normally done that way".
Can you please throw light on this question?
I think this describes solution to your problem: Java Traps: Big Decimal and the problem with double here
From the original blog which appears to be down now.
Java Traps: double
Many traps lay before the apprentice programmer as he walks the path of software development. This article illustrates, through a series of practical examples, the main traps of using Java's simple types double and float. Note, however, that to fully embrace precision in numerical calculations you a text book (or two) on the topic is required. Consequently, we can only scratch the surface of the topic. That being said, the knowledge conveyed here, should give you the fundamental knowledge required to spot or identify bugs in your code. It is knowledge I think any professional software developer should be aware of.
Decimal numbers are approximations
While all natural numbers between 0 - 255 can be precisely described using 8 bit, describing all real numbers between 0.0 - 255.0 requires an infinitely number of bits. Firstly, there exists infinitely many numbers to describe in that range (even in the range 0.0 - 0.1), and secondly, certain irrational numbers cannot be described numerically at all. For example e and π. In other words, the numbers 2 and 0.2 are vastly differently represented in the computer.
Integers are represented by bits representing values 2n where n is the position of the bit. Thus the value 6 is represented as 23 * 0 + 22 * 1 + 21 * 1 + 20 * 0 corresponding to the bit sequence 0110. Decimals, on the other hand, are described by bits representing 2-n, that is the fractions 1/2, 1/4, 1/8,... The number 0.75 corresponds to 2-1 * 1 + 2-2 * 1 + 2-3 * 0 + 2-4 * 0 yielding the bits sequence 1100 (1/2 + 1/4).
Equipped with this knowledge, we can formulate the following rule of thumb: Any decimal number is represented by an approximated value.
Let us investigate the practical consequences of this by performing a series of trivial multiplications.
System.out.println( 0.2 + 0.2 + 0.2 + 0.2 + 0.2 );
1.0
1.0 is printed. While this is indeed correct, it may give us a false sense of security. Coincidentally, 0.2 is one of the few values Java is able to represent correctly. Let's challenge Java again with another trivial arithmetical problem, adding the number 0.1 ten times.
System.out.println( 0.1f + 0.1f + 0.1f + 0.1f + 0.1f + 0.1f + 0.1f + 0.1f + 0.1f + 0.1f );
System.out.println( 0.1d + 0.1d + 0.1d + 0.1d + 0.1d + 0.1d + 0.1d + 0.1d + 0.1d + 0.1d );
1.0000001
0.9999999999999999
According to slides from Joseph D. Darcy's blog the sums of the two calculations are 0.100000001490116119384765625 and 0.1000000000000000055511151231... respectively. These results are correct for a limited set of digits. float's have a precision of 8 leading digits, while double has 17 leading digits precision. Now, if the conceptual mismatch between the expected result 1.0 and the results printed on the screens were not enough to get your alarm bells going, then notice how the numbers from mr. Darcy's slides does not seem to correspond to the printed numbers! That's another trap. More on this further down.
Having been made aware of mis-calculations in seemingly the simples possible scenarios, it is reasonable to contemplate on just how quickly the impression may kick in. Let us simplify the problem to adding only three numbers.
System.out.println( 0.3 == 0.1d + 0.1d + 0.1d );
false
Shockingly, the imprecision already kicks in at three additions!
Doubles overflow
As with any other simple type in Java, a double is represented by a finite set of bits. Consequently, adding a value or multiplying a double can yield surprising results. Admitedly, numbers have to be pretty big in order to overflow, but it happens. Let's try multiplying and then dividing a big number. Mathematical intuition says that the result is the original number. In Java we may get a different result.
double big = 1.0e307 * 2000 / 2000;
System.out.println( big == 1.0e307 );
false
The problem here is that big is first multiplied, overflowing, and then the overflowed number is divided. Worse, no exception or other kinds of warnings are raised to the programmer. Basically, this renders the expression x * y completely unreliable as no indication or guarantee is made in the general case for all double values represented by x, y.
Large and small are not friends!
Laurel and Hardy were often disagreeing about a lot of things. Similarly in computing, large and small are not friends. A consequence of using a fixed number of bits to represent numbers is that operating on really large and really small numbers in the same calculations will not work as expected. Let's try adding something small to something large.
System.out.println( 1234.0d + 1.0e-13d == 1234.0d );
true
The addition has no effect! This contradicts any (sane) mathematical intuition of addition, which says that given two numbers positive numbers d and f, then d + f > d.
Decimal numbers cannot be directly compared
What we have learned so far, is that we must throw away all intuition we have gained in math class and programming with integers. Use decimal numbers cautiously. For example, the statement for(double d = 0.1; d != 0.3; d += 0.1) is in effect a disguised never ending loop! The mistake is to compare decimal numbers directly with each other. You should adhere to the following guide lines.
Avoid equality tests between two decimal numbers. Refrain from if(a == b) {..}, use if(Math.abs(a-b) < tolerance) {..} where tolerance could be a constant defined as e.g. public static final double tolerance = 0.01
Consider as an alternative to use the operators <, > as they may more naturally describe what you want to express. For example, I prefer the form
for(double d = 0; d <= 10.0; d+= 0.1) over the more clumsy
for(double d = 0; Math.abs(10.0-d) < tolerance; d+= 0.1)
Both forms have their merits depending on the situation though: When unit testing, I prefer to express that assertEquals(2.5, d, tolerance) over saying assertTrue(d > 2.5) not only does the first form read better, it is often the check you want to be doing (i.e. that d is not too large).
WYSINWYG - What You See Is Not What You Get
WYSIWYG is an expression typically used in graphical user interface applications. It means, "What You See Is What You Get", and is used in computing to describe a system in which content displayed during editing appears very similar to the final output, which might be a printed document, a web page, etc. The phrase was originally a popular catch phrase originated by Flip Wilson's drag persona "Geraldine", who would often say "What you see is what you get" to excuse her quirky behavior (from wikipedia).
Another serious trap programmers often fall into, is thinking that decimal numbers are WYSIWYG. It is imperative to realize, that when printing or writing a decimal number, it is not the approximated value that gets printed/written. Phrased differently, Java is doing a lot of approximations behind the scenes, and persistently tries to shield you from ever knowing it. There is just one problem. You need to know about these approximations, otherwise you may face all sorts of mysterious bugs in your code.
With a bit of ingenuity, however, we can investigate what really goes on behind the scene. By now we know that the number 0.1 is represented with some approximation.
System.out.println( 0.1d );
0.1
We know 0.1 is not 0.1, yet 0.1 is printed on the screen. Conclusion: Java is WYSINWYG!
For the sake of variety, let's pick another innocent looking number, say 2.3. Like 0.1, 2.3 is an approximated value. Unsurprisingly when printing the number Java hides the approximation.
System.out.println( 2.3d );
2.3
To investigate what the internal approximated value of 2.3 may be, we can compare the number to other numbers in a close range.
double d1 = 2.2999999999999996d;
double d2 = 2.2999999999999997d;
System.out.println( d1 + " " + (2.3d == d1) );
System.out.println( d2 + " " + (2.3d == d2) );
2.2999999999999994 false
2.3 true
So 2.2999999999999997 is just as much 2.3 as the value 2.3! Also notice that due to the approximation, the pivoting point is at ..99997 and not ..99995 where you ordinarily round round up in math. Another way to get to grips with the approximated value is to call upon the services of BigDecimal.
System.out.println( new BigDecimal(2.3d) );
2.29999999999999982236431605997495353221893310546875
Now, don't rest on your laurels thinking you can just jump ship and only use BigDecimal. BigDecimal has its own collection of traps documented here.
Nothing is easy, and rarely anything comes for free. And "naturally", floats and doubles yield different results when printed/written.
System.out.println( Float.toString(0.1f) );
System.out.println( Double.toString(0.1f) );
System.out.println( Double.toString(0.1d) );
0.1
0.10000000149011612
0.1
According to the slides from Joseph D. Darcy's blog a float approximation has 24 significant bits while a double approximation has 53 significant bits. The morale is that In order to preserve values, you must read and write decimal numbers in the same format.
Division by 0
Many developers know from experience that dividing a number by zero yields abrupt termination of their applications. A similar behaviour is found is Java when operating on int's, but quite surprisingly, not when operating on double's. Any number, with the exception of zero, divided by zero yields respectively ∞ or -∞. Dividing zero with zero results in the special NaN, the Not a Number value.
System.out.println(22.0 / 0.0);
System.out.println(-13.0 / 0.0);
System.out.println(0.0 / 0.0);
Infinity
-Infinity
NaN
Dividing a positive number with a negative number yields a negative result, while dividing a negative number with a negative number yields a positive result. Since division by zero is possible, you'll get different result depending on whether you divide a number with 0.0 or -0.0. Yes, it's true! Java has a negative zero! Don't be fooled though, the two zero values are equal as shown below.
System.out.println(22.0 / 0.0);
System.out.println(22.0 / -0.0);
System.out.println(0.0 == -0.0);
Infinity
-Infinity
true
Infinity is weird
In the world of mathematics, infinity was a concept I found hard to grasp. For example, I never acquired an intuition for when one infinity were infinitely larger than another. Surely Z > N, the set of all rational numbers is infinitely larger than the set of natural numbers, but that was about the limit of my intuition in this regard!
Fortunately, infinity in Java is about as unpredictable as infinity in the mathematical world. You can perform the usual suspects (+, -, *, / on an infinite value, but you cannot apply an infinity to an infinity.
double infinity = 1.0 / 0.0;
System.out.println(infinity + 1);
System.out.println(infinity / 1e300);
System.out.println(infinity / infinity);
System.out.println(infinity - infinity);
Infinity
Infinity
NaN
NaN
The main problem here is that the NaN value is returned without any warnings. Hence, should you foolishly investigate whether a particular double is even or odd, you can really get into a hairy situation. Maybe a run-time exception would have been more appropriate?
double d = 2.0, d2 = d - 2.0;
System.out.println("even: " + (d % 2 == 0) + " odd: " + (d % 2 == 1));
d = d / d2;
System.out.println("even: " + (d % 2 == 0) + " odd: " + (d % 2 == 1));
even: true odd: false
even: false odd: false
Suddenly, your variable is neither odd nor even!
NaN is even weirder than Infinity
An infinite value is different from the maximum value of a double and NaN is different again from the infinite value.
double nan = 0.0 / 0.0, infinity = 1.0 / 0.0;
System.out.println( Double.MAX_VALUE != infinity );
System.out.println( Double.MAX_VALUE != nan );
System.out.println( infinity != nan );
true
true
true
Generally, when a double have acquired the value NaN any operation on it results in a NaN.
System.out.println( nan + 1.0 );
NaN
Conclusions
Decimal numbers are approximations, not the value you assign. Any intuition gained in math-world no longer applies. Expect a+b = a and a != a/3 + a/3 + a/3
Avoid using the ==, compare against some tolerance or use the >= or <= operators
Java is WYSINWYG! Never believe the value you print/write is approximated value, hence always read/write decimal numbers in the same format.
Be careful not to overflow your double, not to get your double into a state of ±Infinity or NaN. In either case, your calculations may be not turn out as you'd expect. You may find it a good idea to always check against those values before returning a value in your methods.
It's called loss of precision and is very noticeable when working with either very big numbers or very small numbers. The binary representation of decimal numbers with a radix is in many cases an approximation and not an absolute value. To understand why you need to read up on floating number representation in binary. Here is a link: http://en.wikipedia.org/wiki/IEEE_754-2008. Here is a quick demonstration:
in bc (An arbitrary precision calculator language) with precision=10:
(1/3+1/12+1/8+1/15) = 0.6083333332
(1/3+1/12+1/8) = 0.541666666666666
(1/3+1/12) = 0.416666666666666
Java double:
0.6083333333333333
0.5416666666666666
0.41666666666666663
Java float:
0.60833335
0.5416667
0.4166667
If you are a bank and are responsible for thousands of transactions every day, even though they are not to and from one and same account (or maybe they are) you have to have reliable numbers. Binary floats are not reliable - not unless you understand how they work and their limitations.
While BigDecimal can store more precision than double, this is usually not required. The real reason it used because it makes it clear how rounding is performed, including a number of different rounding strategies. You can achieve the same results with double in most cases, but unless you know the techniques required, BigDecimal is the way to go in these case.
A common example, is money. Even though money is not going to be large enough to need the precision of BigDecimal in 99% of use cases, it is often considered best practice to use BigDecimal because the control of rounding is in the software which avoids the risk that the developer will make a mistake in handling rounding. Even if you are confident you can handle rounding with double I suggest you use helper methods to perform the rounding which you test thoroughly.
This is primarily done for reasons of precision. BigDecimal stores floating point numbers with unlimited precision. You can take a look at this page that explains it well. http://blogs.oracle.com/CoreJavaTechTips/entry/the_need_for_bigdecimal
When BigDecimal is used, it can store a lot more data then Double, which makes it more accurate, and just an all around better choice for the real world.
Although it is a lot slower and longer, it's worth it.
Bet you wouldn't want to give your boss inaccurate info, huh?
Another idea: keep track of the number of cents in a long. This is simpler, and avoids the cumbersome syntax and slow performance of BigDecimal.
Precision in financial calculations is extra important because people get very irate when their money disappears due to rounding errors, which is why double is a terrible choice for dealing with money.

Can every float be expressed exactly as a double?

Can every possible value of a float variable can be represented exactly in a double variable?
In other words, for all possible values X will the following be successful:
float f1 = X;
double d = f1;
float f2 = (float)d;
if(f1 == f2)
System.out.println("Success!");
else
System.out.println("Failure!");
My suspicion is that there is no exception, or if there is it is only for an edge case (like +/- infinity or NaN).
Edit: Original wording of question was confusing (stated two ways, one which would be answered "no" the other would be answered "yes" for the same answer). I've reworded it so that it matches the question title.
Yes.
Proof by enumeration of all possible cases:
public class TestDoubleFloat {
public static void main(String[] args) {
for (long i = Integer.MIN_VALUE; i <= Integer.MAX_VALUE; i++) {
float f1 = Float.intBitsToFloat((int) i);
double d = (double) f1;
float f2 = (float) d;
if (f1 != f2) {
if (Float.isNaN(f1) && Float.isNaN(f2)) {
continue; // ok, NaN
}
fail("oops: " + f1 + " != " + f2);
}
}
}
}
finishes in 12 seconds on my machine. 32 bits are small.
In theory, there is not such a value, so "yes", every float should be representable as a double.. Converting from a float to a double should involve just tacking four bytes of 00 on the end -- they are stored using the same format, just with different sized fields.
Yes, floats are a subset of doubles. Both floats and doubles have the form (sign * a * 2^b). The difference between floats and doubles is the number of bits in a & b. Since doubles have more bits available, assigning a float value to a double effectively means inserting extra 0 bits.
As everyone has already said, "no". But that's actually a "yes" to the question itself, i.e. every float can be exactly expressed as a double. Confusing. :)
If I'm reading the language specification correctly (and as everyone else is confirming), there is no such value.
That is, each claims only to hold only IEEE 754 standard values, so casts between the two should incur no change except in memory given.
(clarification: There would be no change as long as the value was small enough to be held in a float; obviously if the value was too many bits to be held in a float to begin with, casting from double to float would result in a loss of precision.)
#KenG: This code:
float a = 0.1F
println "a=${a}"
double d = a
println "d=${d}"
fails not because 0.1f can't be exactly represented. The question was "is there a float value that cannot be represented as a double", which this code doesn't prove. Although 0.1f can't be stored exactly, the value that a is given (which isn't 0.1f exactly) can be stored as a double (which also won't be 0.1f exactly). Assuming an Intel FPU, the bit pattern for a is:
0 01111011 10011001100110011001101
and the bit pattern for d is:
0 01111111011 100110011001100110011010 (followed by lots more zeros)
which has the same sign, exponent (-4 in both cases) and the same fractional part (separated by spaces above). The difference in the output is due to the position of the second non-zero digit in the number (the first is the 1 after the point) which can only be represented with a double. The code that outputs the string format stores intermediate values in memory and is specific to floats and doubles (i.e. there is a function double-to-string and another float-to-string). If the to-string function was optimised to use the FPU stack to store the intermediate results of the to-string process, the output would be the same for float and double since the FPU uses the same, larger format (80bits) for both float and double.
There are no float values that can't be stored identically in a double, i.e. the set of float values is a sub-set of the the set of double values.
Snark: NaNs will compare differently after (or indeed before) conversion.
This does not, however, invalidate the answers already given.
I took the code you listed and decided to try it in C++ since I thought it might execute a little faster and it is significantly easier to do unsafe casting. :-D
I found out that for valid numbers, the conversion works and you get the exact bitwise representation after the cast. However, for non-numbers, e.g. 1.#QNAN0, etc., the result will use a simplified representation of the non-number rather than the exact bits of the source. For example:
**** FAILURE **** 2140188725 | 1.#QNAN0 -- 0xa0000000 0x7ffa1606
I cast an unsigned int to float then to double and back to float. The number 2140188725 (0x7F90B035) results in a NAN and converting to double and back is still a NAN but not the exact same NAN.
Here is the simple C++ code:
typedef unsigned int uint;
for (uint i = 0; i < 0xFFFFFFFF; ++i)
{
float f1 = *(float *)&i;
double d = f1;
float f2 = (float)d;
if(f1 != f2)
printf("**** FAILURE **** %u | %f -- 0x%08x 0x%08x\n", i, f1, f1, f2);
if ((i % 1000000) == 0)
printf("Iteration: %d\n", i);
}
The answer to the first question is yes, the answer to the 'in other words', however is no. If you change the test in the code to be if (!(f1 != f2)) the answer to the second question becomes yes -- it will print 'Success' for all float values.
In theory every normal single can have the exponent and mantissa padded to create a double and then remove the padding and you return to the original single.
When you go from theory to reality is when you will have problems. I dont know if you were interested in theory or implementation. If it is implementation then you can rapidly get into trouble.
IEEE is a horrible format, my understanding it was intentionally designed to be so tough that nobody could meet it and allow the market to catch up to intel (this was a while back) allowing for more competition. If that is true it failed, either way we are stuck with this dreadful spec. Something like the TI format is far superior for the real world in so many ways. I have no connection to either company or any of these formats.
Thanks to this spec there are very few if any fpus that actually meet it (in hardware or even in hardware plus the operating system), and those that do often fail on the next generation. (google: TestFloat). The problems these days tend to lie in the int to float and float to int and not single to double and double to single as you have specified above. Of course what operation is the fpu going to perform to do that conversion? Add 0? Multiply by 1? Depends on the fpu and the compiler.
The problem with IEEE related to your question above is that there is more than one way a number, not every number but many numbers can be represented. If I wanted to break your code I would start with minus zero in the hope that one of the two operations would convert it to a plus zero. Then I would try denormals. And it should fail with a signaling nan, but you called that out as a known exception.
The problem is that equal sign, here is rule number one about floating point, never use an equal sign. Equals is a bit comparison not a value comparison, if you have two values represented in different ways (plus zero and minus zero for example) the bit comparison will fail even though its the same number. Greater than and less than are done in the fpu, equals is done with the integer alu.
I realize that you probably used the equal to explain the problem and not necessarily the code you wanted to succeed or fail.
If a floating-point type is viewed as representing a precise value, then as other posters have noted, every float value is representable as a double, but only a few values of double can be represented by float. On the other hand, if one recognizes that floating-point values are approximations, one will realize the real situation is reversed. If one uses a very precise instrument to measure something which is 3.437mm, one may correctly describe is size as 3.4mm. if one uses a ruler to measure the object as 3.4mm, it would be incorrect to describe its size as 3.400mm.
Even bigger problems exist at the top of the range. There is a float value that represents: "computed value exceeded 2^127 by an unknown amount", but there's no double value that indicates such a thing. Casting an "infinity" from single to double will yield a value "computed value exceeded 2^1023 by an unknown amount" which is off by a factor of over a googol.

Categories

Resources