Need help in translating code from C to Java - java

From this article. Here's the code:
float InvSqrt(float x){ // line 0
float xhalf = 0.5f * x;
int i = *(int*)&x; // store floating-point bits in integer
i = 0x5f3759d5 - (i >> 1); // initial guess for Newton's method
x = *(float*)&i; // convert new bits into float
x = x*(1.5f - xhalf*x*x); // One round of Newton's method
return x;
}
...I can't even tell if that's C or C++. [okay apparently it's C, thanks] Could someone translate it to Java for me, please? It's (only, I hope) lines 2 and 4 that are confusing me.

You want to use these methods:
Float.floatToIntBits
Float.intBitsToFloat
And there may be issues with strictfp, etc.
It's roughly something like this: (CAUTION: this is untested!)
float InvSqrt(float x){ // line 0
float xhalf = 0.5f * x;
int i = Float.floatToIntBits(x); // store floating-point bits in integer
i = 0x5f3759d5 - (i >> 1); // initial guess for Newton's method
x = Float.intBitsToFloat(i); // convert new bits into float
x = x*(1.5f - xhalf*x*x); // One round of Newton's method
return x;
}

Those lines are used to convert between float and int as bit patterns. Java has static methods in java.lang.Float for that - everything else is identical.
static float InvSqrt(float x) { // line 0
float xhalf = 0.5f * x;
int i = Float.floatToIntBits(x); // store floating-point bits in integer
i = 0x5f3759d5 - (i >> 1); // initial guess for Newton's method
x = Float.intBitsToFloat(i); // convert new bits into float
x = x * (1.5f - xhalf * x * x); // One round of Newton's method
return x;
}

The code you quote is C, although the comments are C++-style.
What the code is doing involves knowledge of the way floating-point values are stored, at the bit level. The "magic number" 0x5f3759d5 has something do with a particular value.
The floating point x value's bits are accessed when i is initialized, because the address of x is dereferenced. So, i is loaded with the first 32 bits of the floating point value. On the next line, x is written with the contents of i, updating the working approximation value.
I have read that this code became popular when John Carmack released it with Id's open source Quake engine. The purpose of the code is to quickly calculate 1/Sqrt(x), which is used in lighting calculations of graphic engines.
I would not have been able to translate this code directly to Java because it uses "type punning" as described above -- when it accesses the float in memory as if it were an int. Java prevents that sort of activity, but as others have pointed out, the Float object provides methods around it.
The purpose of using this strange implementation in C was for it to be very fast. At the time it was written, I imagine a large improvement came from this method. I wonder if the difference is worth it today, when floating point operations have gotten faster.
Using the Java methods to convert float to integer bits and back may be slower than simply calculating the inverse square root directly using Java math function for square root.

Ok I'm going out on a limb here, because I know C but I don't know Java.
Literally rewriting this C code in Java is begging for trouble.
Even in C, the code is unportable.
Among other things it relies on:
The size of floating point numbers.
The size of integers.
The internal representation of floating point numbers.
The byte alignment of both floating point numbers and integers.
Right shift ( i.e. i>>1) being implemented using logical right shift
as opposed to arithmetic right shift (which would shift in a 1 on
integers with a high order 1 bit and thus no longer equate to divide
by 2).
I understand Java compiles to a bytecode rather than directly to
machine code. Implementers of byte code interpreters tune using
assumptions based on the spec for the byte code and an understanding
of what is output by the compiler from sensible input source
code.
Hacks like this don't fall under the umbrella "sensible input source".
There is no reason to expect the interpreter will perform
faster with your C hack, in fact there is a good chance
it will be slower.
My advice is: IGNORE The C code.
Look for a gain in efficiency that is Java centric.
The concept of the C hack is:
Approximate 1/square(x) by leveraging knowledge that the internal
representation of floating point numbers already has the
exponent broken out of the number, exponent(x)/2 is faster to
compute than root(x) if you already have exponent(x).
The hack then performs one iteration of newton's method
to reduce the error in the approximation. I Presume
one iteration reduced the error to something tolerable.
Perhaps the concept warrants investigation in Java,
but the details will depend on intimate knowledge of
how JAVA is implemented, not how C is implemented.

The lines you care about are pretty simple. Line 2 takes the bytes in float x, which are in some floating-point representation like IEEE754, and stores them in an integer, exactly the way they are. This will result in a totally different number, since integers and floats are represented differently in byte form. Line 4 does the opposite, and transfers the bytes in that int to the float again

Related

Exact conversion from float to int

I want to convert float value to int value or throw an exception if this conversion is not exact.
I've found the following suggestion: use Math.round to convert and then use == to check whether those values are equal. If they're equal, then conversion is exact, otherwise it is not.
But I've found an example which does not work. Here's code demonstrating this example:
String s = "2147483648";
float f = Float.parseFloat(s);
System.out.printf("f=%f\n", f);
int i = Math.round(f);
System.out.printf("i=%d\n", i);
System.out.printf("(f == i)=%s\n", (f == i));
It outputs:
f=2147483648.000000
i=2147483647
(f == i)=true
I understand that 2147483648 does not fit into integer range, but I'm surprised that == returns true for those values. Is there better way to compare float and int? I guess it's possible to convert both values to strings, but that would be extremely slow for such a primitive function.
floats are rather inexact concepts. They are also mostly pointless unless you're running on at this point rather old hardware, or interacting specifically with systems and/or protocols that work in floats or have 'use a float' hardcoded in their spec. Which may be true, but if it isn't, stop using floats and start using double - unless you have a fairly large float[] there is zero memory and performance difference, floats are just less accurate.
Your algorithm cannot fail when using int vs double - all ints are perfectly representable as double.
Let's first explain your code snippet
The underlying error here is the notion of 'silent casting' and how java took some intentional liberties there.
In computer systems in general, you can only compare like with like. It's easy to put in exact terms of bits and machine code what it means to determine whether a == b is true or false if a and b are of the exact same type. It is not at all clear when a and b are different things. Same thing applies to pretty much any operator; a + b, if both are e.g. an int, is a clear and easily understood operation. But if a is a char and b is, say, a double, that's not clear at all.
Hence, in java, all binary operators that involve different types are illegal. In basis, there is no bytecode to directly compare a float and a double, for example, or to add a string to an int.
However, there is syntax sugar: When you write a == b where a and b are different types, and java determines that one of two types is 'a subset' of the other, then java will simply silently convert the 'smaller' type to the 'larger' type, so that the operation can then succeed. For example:
int x = 5;
long y = 5;
System.out.println(x == y);
This works - because java realizes that converting an int to a long value is not ever going to fail, so it doesn't bother you with explicitly specifying that you intended the code to do this. In JLS terms, this is called a widening conversion. In contrast, any attempt to convert a 'larger' type to a 'smaller' type isn't legal, you have to explicitly cast:
long x = 5;
int y = x; // does not compile
int y = (int) x; // but this does.
The point is simply this: When you write the reverse of the above (int x = 5; long y = x;), the code is identical, it's just that compiler silently injects the (long) cast for you, on the basis that no loss will occur. The same thing happens here:
int x = 5;
long y = 10;
long z = x + y;
That compiles because javac adds some syntax sugar for you, specifically, that is compiled as if it says: long z = ((long) x) + y;. The 'type' of the expression x + y there is long.
Here's the key trick: Java considers converting an int to a float, as well as an int or long to a double - a widening conversion.
As in, javac will just assume it can do that safely without any loss and therefore will not enforce that the programmer explicitly acknowledges by manually adding the cast. However, int->float, as well as long->double are not actually entirely safe.
floats can represent every integral value between -2^23 to +2^23, and doubles can represent every integral value between -2^52 to +2^52 (source). But int can represent every integral value between -2^31 to +2^31-1, and longs -2^63 to +2^63-1. That means at the edges (very large negative/positive numbers), integral values exist that are representable in ints but not in floats, or longs but not in doubles (all ints are representable in double, fortunately; int -> double conversion is entirely safe). But java doesn't 'acknowledge' this, which means silent widening conversions can nevertheless toss out data (introduce rounding) silently.
That is what happens here: (f == i) is syntax sugared into (f == ((float) i)) and the conversion from int to float introduces the rounding.
The solution
Mostly, when using doubles and floats and nevertheless wishing for exact numbers, you've already messed up. These concepts fundamentally just aren't exact and this exactness cannot be sideloaded in by attempting to account for error bands, as the errors introduced due to the rounding behaviour of float and double cannot be tracked (not easily, at any rate). You should not be using float/double as a consequence. Either find an atomary unit and represent those in terms of int/long, or use BigDecimal. (example: To write bookkeeping software, do not store finance amounts as a double. do store them as 'cents' (or satoshis or yen or pennies or whatever the atomic unit is in that currency) in long, or, use BigDecimal if you really know what you are doing).
I want an answer anyway
If you're absolutely positive that using float (or even double) here is acceptable and you still want exactness, we have a few solutions.
Option 1 is to employ the power of BigDecimal:
new BigDecimal(someDouble).intValueExact()
This works, is 100% reliable (unless float to double conversion can knock a non-exact value into an exact one somehow, I don't think that can happen), and throws. It's also very slow.
An alternative is to employ our knowledge of how the IEEE floating point standard works.
A real simple answer is simply to run your algorithm as you wrote it, but to add an additional check: If the value your int gets is below -2^23 or above +2^23 then it probably isn't correct. However, there are still a smattering of numbers below -2^23 and +2^23 that are perfectly representable in both float and int, just, no longer every number at that point. If you want an algorithm that will accept those exact numbers as well, then it gets much more complicated. My advice is not to delve into that cesspool: If you have a process where you end up with a float that is anywhere near such extremes, and you want to turn them to int but only if that is possible without loss, you've arrived at a crazy question and you need to rewire the parts that you got you there instead!
If you really need that, instead of trying to numbercrunch the float, I suggest using the BigDecimal().intValueExact() trick if you truly must have this.

Number of blocks to spit range of values over 64

I have the following piece of code:
long[] blocks = new long[(someClass.getMemberArray().length - 1) / 64 + 1];
Basically the someClass.getMemberArray() can return an array that could be much larger than 64 and the code tries to determine how many blocks of len 64 are needed for subsequent processing.
I am confused about the logic and how does this work. It seems to me that just doing:
long[] blocks = new long[(int) Math.ceil(someClass.getMemberArray().length / 64.0)];
should work too any looks simpler.
Can someone help me understanding the -1 and +1 reasoning in the original snippet, how it works and if the ceil would fail in some cases?
as you correctly commented, the -1/+1 is required to get the correct number of blocks, including only partially filled ones. It effectively rounds up.
(But it has something that could be considered a bug: if the array has length 0, which would required 0 blocks, it returns 1. This is because integer division usually truncates on most systems, i.e. rounds UP for negative numbers, so (0 - 1)/64 yields 0. However, this may be a feature if zero blocks for some reasons are not allowed. It definitively requires a comment though.)
The reasoning for the first, original line is that it only uses integer arithmetics, which should translate on just a few basic and fast machine instructions on mostcomputers.
The second solution involved casting floating-point arithmetic and casting. Traditionally, floating-point arithmetic was MUCH slower on most processors, which probably was the reasoning for the first solution. However, on modern CPUs with integrated floating-point support, the performance depends more on other things like cache lines and pipelining.
Personally, I don't really like both solutions, as it's not really obvious what they do. So I would suggest the following solution:
int arrayLength = someClass.getMemberArray().length;
int blockCount = ceilDiv(arrayLength, 64);
long[] blocks = new long[blockCount];
//...
/**
* Integer division, rounding up.
* #return the quotient a/b, rounded up.
*/
static int ceilDiv(int a, int b) {
assert b >= 0 : b; // Doesn't work for negative divisor.
// Divide.
int quotient = a / b;
// If a is not a multiple of b, round up.
if (a % b != 0) {
quotient++;
}
return quotient;
}
It's wordy, but at least it's clear what should happen, and it provides a general solution which works for all integers (except negative divisors). Unfortunately, most languages don't provide an elegant "round up integer division" solution.
I don't see why the -1 would be necessary, but the +1 is likely there to rectify the case where the result of the division gets rounded down to the nearest non-decimal value (which should be, well, every case except those where the result is without decimals)

Why is math.pow not natively able to deal with ints? (floor/ceil, too)

I know that in Java (and probably other languages), Math.pow is defined on doubles and returns a double. I'm wondering why on earth the folks who wrote Java didn't also write an int-returning pow(int, int) method, which seems to this mathematician-turned-novice-programmer like a forehead-slapping (though obviously easily fixable) omission. I can't help but think that there's some behind-the-scenes reason based on the intricacies of CS that I just don't know, because otherwise... huh?
On a similar topic, ceil and floor by definition return integers, so how come they don't return ints?
Thanks to all for helping me understand this. It's totally minor, but has been bugging me for years.
java.lang.Math is just a port of what the C math library does.
For C, I think it comes down to the fact that CPU have special instructions to do Math.pow for floating point numbers (but not for integers).
Of course, the language could still add an int implementation. BigInteger has one, in fact. It makes sense there, too, because pow tends to result in rather big numbers.
ceil and floor by definition return integers, so how come they don't return ints
Floating point numbers can represent integers outside of the range of int. So if you take a double argument that is too big to fit into an int, there is no good way for floor to deal with it.
From a mathematical perspective, you're going to overflow your integer if it's larger than 231-1, and overflow your long if it's larger than 264-1. It doesn't take much to overflow it, either.
Doubles are nice in that they can represent numbers from ~10-308 to ~10308 with 53 bits of precision. There may be some fringe conversion issues (such as the next full integer in a double may not exactly be representable), but by and large you're going to get a much larger range of numbers than you would if you strictly dealt with integers or longs.
On a similar topic, ceil and floor by definition return integers, so how come they don't return ints?
For the same reason outlined above - overflow. If I have an integral value that's larger than what I can represent in a long, I'd have to use something that could represent it. A similar thing occurs when I have an integral value that's smaller than what I can represent in a long.
Optimal implementation of integer pow() and floating-point pow() are very different. And C's math library was probably developed around the time when floating-point coprocessors were a consideration. Optimal implementation of floating point operation is to shift the numbers closer to 1 (to force quicker conversion of the power series) and then shift the result back. For integer power, a more accurate result can be had in O(log(p)) time by doing something like this:
// p is a positive integer power set somewhere above, n is the number to raise to power p
int result = 1;
while( p != 0){
if (p & 1){
result *= n;
}
n = n*n;
p = p >> 1;
}
Because all ints can be upcast to a double without loss and the pow function on a double is no less efficient that that on an int.
The reason lies behind the implementation of Math.pow() (JNI of default implementation). The CPU has an exponentiation module which works with doubles as input and output. Why should Java convert that for you when you have much better control over this yourself?
For floor and ceil the reasons are the same, but note that:
(int) Math.floor(d) == (int) d; // d > 0
(int) Math.ceil(d) == -(int)(-d); // d < 0
For most cases (no warranty around or beyond Integer.MAX_VALUE or Integer.MIN_VALUE).
Java leaves you with
(int) Math.pow(a,b)
because the result of Math.pow may even be NaN or Infinity depending on input.

Best way to avoid number polarity reversal if multiplying two big numbers in Java

My question is related to this
How can I check if multiplying two numbers in Java will cause an overflow?
In my application, x and y are calculated on the fly and somewhere in my formula I have to multiply x and y.
int x=64371;
int y=64635;
System.out.println((x*y));
I get wrong output as -134347711
I can quickly fix above by the changing variable x and y from type int to long and get correct answer for above case. But, there is no gurantee that x and y won't grow beyond max capacity of long as well.
Question
Why does I get a negative number here, even though I am not storing the final result in any variable? (for curiosity sake)
Since, I won't know the value of x and y in advance, is there any quicker way to avoid this overflow. Maybe by dividing all x and y by a certain large constant for entire run of the application or should I take log of x and y before multiplying them? (actual question)
EDIT:
Clarification
The application runs on a big data set, which takes hours to complete. It would be nicer to have a solution which is not too slow.
Since the final result is used for comparison (they just need to be somewhat proportional to the original result) and it is acceptable to have +-5% error in the final value if that gives huge performance gain.
If you know that the numbers are likely to be large, use BigInteger instead. This is guaranteed not to overflow, and then you can either check whether the result is too large to fit into an int or long, or you can just use the BigInteger value directly.
BigInteger is an arbitrary-precision class, so it's going to be slower than using a direct primitive value (which can probably be stored in a processor register), so figure out whether you're realistically going to be overflowing a long (an int times an int will always fit in a long), and choose BigInteger if your domain really requires it.
You get a negative number because of an integer overflow: using two-s complement representation, Java interprets any integer with the most significant bit set to 1 as a negative.
There are very clever methods involving bit manipulation for detecting situations when an addition or subtraction would result in an overflow or an underflow. If you do not know how big your results are going to be, it is best to switch to BigInteger. Your code would look very different, though, because Java lacks operator overloading facilities that would make mathematical operations on BigInteger objects look familiar. The code will be somewhat slower, too. However, you will be guaranteed against overflows and underflows.
EDIT :
it is acceptable to have +-5% error in the final value if that gives huge performance gain.
+-5% error is a huge allowance for error! If this is indeed acceptable in your system, than using double or even float could work. These types are imprecise, but their range is enormously larger than that of an int, and they do not overflow so easily. You have to be extremely careful, though, because floating-point data types are inherently inexact. You need to always keep in mind the way the data is represented to avoid common precision problems.
Why does I get a negative number here, even though I am not storing
the final result in any variable? (for curiosity sake)
x and y are int types. When you multiply them, they are put into a piece of memory temporarily. The type of that is determined by the types of the originals. int*int will always yield an int. Even if it overflows. if you cast one of them to a long, then it will create a long for the multiplication, and you will not get an overflow.
Since, I won't know the value of x and y, is there any quicker way to
avoid this overflow. Maybe by dividing all x and y by a certain large
constant for entire run of the application or should I take log of x
and y before multiplying them? (actual question)
If x and y are positive then you can check
if(x*y<0)
{
//overflow
}
else
{
//do something with x*y
}
Unfortunately this is not fool-proof. You may overrun right into positive numbers again. for example: System.out.println(Integer.MAX_VALUE * 3); will output: 2147483645.
However, this technique will always work for adding 2 integers.
As others have said, BigInteger is sure not to overflow.
Negative value is just (64371 * 64635) - 2^32. Java not performs widening primitive conversion at run time.
Multiplication of ints always result in an int, even if it's not stored in a variable. Your product is 4160619585, which requires unsigned 32-bit (which Java does not have), or a larger word size (or BigInteger, as someone seem to have mentioned already).
You could add logs instead, but the moment you try to exp the result, you would get a number that won't round correctly into a signed 32-bit.
Since both multiplicands are int, doing the multiplication using long via casting would avoid an overflow in your specific case:
System.out.println(x * (long) y);
You don't want to use logarithms because transcendental functions are slow and floating point arithmetic is imprecise - the result is likely to not be equal to the correct integer answer.

Can every float be expressed exactly as a double?

Can every possible value of a float variable can be represented exactly in a double variable?
In other words, for all possible values X will the following be successful:
float f1 = X;
double d = f1;
float f2 = (float)d;
if(f1 == f2)
System.out.println("Success!");
else
System.out.println("Failure!");
My suspicion is that there is no exception, or if there is it is only for an edge case (like +/- infinity or NaN).
Edit: Original wording of question was confusing (stated two ways, one which would be answered "no" the other would be answered "yes" for the same answer). I've reworded it so that it matches the question title.
Yes.
Proof by enumeration of all possible cases:
public class TestDoubleFloat {
public static void main(String[] args) {
for (long i = Integer.MIN_VALUE; i <= Integer.MAX_VALUE; i++) {
float f1 = Float.intBitsToFloat((int) i);
double d = (double) f1;
float f2 = (float) d;
if (f1 != f2) {
if (Float.isNaN(f1) && Float.isNaN(f2)) {
continue; // ok, NaN
}
fail("oops: " + f1 + " != " + f2);
}
}
}
}
finishes in 12 seconds on my machine. 32 bits are small.
In theory, there is not such a value, so "yes", every float should be representable as a double.. Converting from a float to a double should involve just tacking four bytes of 00 on the end -- they are stored using the same format, just with different sized fields.
Yes, floats are a subset of doubles. Both floats and doubles have the form (sign * a * 2^b). The difference between floats and doubles is the number of bits in a & b. Since doubles have more bits available, assigning a float value to a double effectively means inserting extra 0 bits.
As everyone has already said, "no". But that's actually a "yes" to the question itself, i.e. every float can be exactly expressed as a double. Confusing. :)
If I'm reading the language specification correctly (and as everyone else is confirming), there is no such value.
That is, each claims only to hold only IEEE 754 standard values, so casts between the two should incur no change except in memory given.
(clarification: There would be no change as long as the value was small enough to be held in a float; obviously if the value was too many bits to be held in a float to begin with, casting from double to float would result in a loss of precision.)
#KenG: This code:
float a = 0.1F
println "a=${a}"
double d = a
println "d=${d}"
fails not because 0.1f can't be exactly represented. The question was "is there a float value that cannot be represented as a double", which this code doesn't prove. Although 0.1f can't be stored exactly, the value that a is given (which isn't 0.1f exactly) can be stored as a double (which also won't be 0.1f exactly). Assuming an Intel FPU, the bit pattern for a is:
0 01111011 10011001100110011001101
and the bit pattern for d is:
0 01111111011 100110011001100110011010 (followed by lots more zeros)
which has the same sign, exponent (-4 in both cases) and the same fractional part (separated by spaces above). The difference in the output is due to the position of the second non-zero digit in the number (the first is the 1 after the point) which can only be represented with a double. The code that outputs the string format stores intermediate values in memory and is specific to floats and doubles (i.e. there is a function double-to-string and another float-to-string). If the to-string function was optimised to use the FPU stack to store the intermediate results of the to-string process, the output would be the same for float and double since the FPU uses the same, larger format (80bits) for both float and double.
There are no float values that can't be stored identically in a double, i.e. the set of float values is a sub-set of the the set of double values.
Snark: NaNs will compare differently after (or indeed before) conversion.
This does not, however, invalidate the answers already given.
I took the code you listed and decided to try it in C++ since I thought it might execute a little faster and it is significantly easier to do unsafe casting. :-D
I found out that for valid numbers, the conversion works and you get the exact bitwise representation after the cast. However, for non-numbers, e.g. 1.#QNAN0, etc., the result will use a simplified representation of the non-number rather than the exact bits of the source. For example:
**** FAILURE **** 2140188725 | 1.#QNAN0 -- 0xa0000000 0x7ffa1606
I cast an unsigned int to float then to double and back to float. The number 2140188725 (0x7F90B035) results in a NAN and converting to double and back is still a NAN but not the exact same NAN.
Here is the simple C++ code:
typedef unsigned int uint;
for (uint i = 0; i < 0xFFFFFFFF; ++i)
{
float f1 = *(float *)&i;
double d = f1;
float f2 = (float)d;
if(f1 != f2)
printf("**** FAILURE **** %u | %f -- 0x%08x 0x%08x\n", i, f1, f1, f2);
if ((i % 1000000) == 0)
printf("Iteration: %d\n", i);
}
The answer to the first question is yes, the answer to the 'in other words', however is no. If you change the test in the code to be if (!(f1 != f2)) the answer to the second question becomes yes -- it will print 'Success' for all float values.
In theory every normal single can have the exponent and mantissa padded to create a double and then remove the padding and you return to the original single.
When you go from theory to reality is when you will have problems. I dont know if you were interested in theory or implementation. If it is implementation then you can rapidly get into trouble.
IEEE is a horrible format, my understanding it was intentionally designed to be so tough that nobody could meet it and allow the market to catch up to intel (this was a while back) allowing for more competition. If that is true it failed, either way we are stuck with this dreadful spec. Something like the TI format is far superior for the real world in so many ways. I have no connection to either company or any of these formats.
Thanks to this spec there are very few if any fpus that actually meet it (in hardware or even in hardware plus the operating system), and those that do often fail on the next generation. (google: TestFloat). The problems these days tend to lie in the int to float and float to int and not single to double and double to single as you have specified above. Of course what operation is the fpu going to perform to do that conversion? Add 0? Multiply by 1? Depends on the fpu and the compiler.
The problem with IEEE related to your question above is that there is more than one way a number, not every number but many numbers can be represented. If I wanted to break your code I would start with minus zero in the hope that one of the two operations would convert it to a plus zero. Then I would try denormals. And it should fail with a signaling nan, but you called that out as a known exception.
The problem is that equal sign, here is rule number one about floating point, never use an equal sign. Equals is a bit comparison not a value comparison, if you have two values represented in different ways (plus zero and minus zero for example) the bit comparison will fail even though its the same number. Greater than and less than are done in the fpu, equals is done with the integer alu.
I realize that you probably used the equal to explain the problem and not necessarily the code you wanted to succeed or fail.
If a floating-point type is viewed as representing a precise value, then as other posters have noted, every float value is representable as a double, but only a few values of double can be represented by float. On the other hand, if one recognizes that floating-point values are approximations, one will realize the real situation is reversed. If one uses a very precise instrument to measure something which is 3.437mm, one may correctly describe is size as 3.4mm. if one uses a ruler to measure the object as 3.4mm, it would be incorrect to describe its size as 3.400mm.
Even bigger problems exist at the top of the range. There is a float value that represents: "computed value exceeded 2^127 by an unknown amount", but there's no double value that indicates such a thing. Casting an "infinity" from single to double will yield a value "computed value exceeded 2^1023 by an unknown amount" which is off by a factor of over a googol.

Categories

Resources