This is the snippet of code in Java:
int i = 1234567890;
float f = i;
System.out.println(i - (int)f);
Why is that the output is not equal to 0? It performs widening, so it is not supposed to loose data. Then you just truncate the value.
Best regards
See the difference
int i = 1234567890;
float f = i;
System.out.println(i - f);
System.out.println((int)f);
System.out.println(f);
System.out.println(i-(int)f);
Ouput:
0.0
1234567936
1.23456794E9
-46
Your mistake is here:
It performs widening, so it is not supposed to loose data.
This statement is wrong. Widening does not mean that you do not lose data.
From the Java specification:
Widening primitive conversions do not lose information about the overall magnitude of a numeric value.
Conversion of an int or a long value to float, or of a long value to double, may result in loss of precision - that is, the result may lose some of the least significant bits of the value. In this case, the resulting floating-point value will be a correctly rounded version of the integer value, using IEEE 754 round-to-nearest mode (§4.2.4).
Emphasis mine.
The specification clearly states that the magnitude is not lost, but precision can be lost.
The word widening refers not to the precision of a data type, but to its range. Floats are wider than ints because they have a larger range.
int
4 bytes, signed (two's complement). -2,147,483,648 to 2,147,483,647.
float
4 bytes, IEEE 754. Covers a range from 1.40129846432481707e-45 to 3.40282346638528860e+38 (positive or negative).
As you can see, float has a larger range. However some integers cannot be represented exactly as floats. This representation error is what causes your result to differ from 0. The actual value stored in f is 1234567936.
You'd be shocked to find out that even this can be true:
(f==(f+1))==true
given f is a float large enough...
int i = 1234567890;
float f = i;
System.out.println((int)f);
System.out.println(i);
System.out.println(i - (int)f);
output is:
1234567936
1234567890
-46
see the difference.
compare integer float have larger value.
Widening can lose precision. An int has 32-bits of precision where as a float has a 25-bits of precision (a 24-bit mantissa and an implied top bit). Java considers float to be wider than the 64-bit long which is a bit mad IMHO
With an 8-bit difference in precision, the error can be up to about +/-64 as it rounds the int value to the nearest float representation.
int err = 0;
for (int i = Integer.MAX_VALUE; i > Integer.MAX_VALUE / 2; i--) {
int err2 = (int) (float) i - i;
if (Math.abs(err2) > err) {
System.out.println(i + ": " + err2);
err = Math.abs(err2);
}
}
prints
... deleted ...
2147483584: 63
2147483456: -64
Related
How does the output turn out to be '1'?
long number = 499_999_999_000_000_001L;
double converted = (double) number;
System.out.println(number - (long) converted);
TLDR: It's because of overflow bits
If you check java documentation Double.MAX_VALUE. You will observe that max double integer value supported by java is 2^53 ≅ 10^16 but your value becomes (4.99999999 * 10^17) after typecasting which is outside the range of double so because of overflow it is rounded. For better understanding run this code.
public class Main
{
public static void main(String[] args) {
long longNumber = 499_999_999_000_000_001L;
double doubleNumber = (double) longNumber;
long longConverted = (long)doubleNumber;
System.out.println(longNumber+" "+doubleNumber+" "+longConverted);
}
}
Its output will be:
499999999000000001 4.99999999E17 499999999000000000
All you need to do is a little "debugging".
Try running this code...
long number = 499_999_999_000_000_001L;
System.out.println(number);
double converted = (double) number;
System.out.println(converted);
System.out.println((long) converted);
System.out.println(number - (long) converted);
This is what it displays...
499999999000000001
4.99999999E17
499999999000000000
1
Do you want to know why the conversion back to long from double drops the 1?
If you do then I refer you to the java specifications.
EDIT
To be precise, refer to section Widening Primitive Conversion
A widening primitive conversion from ... long to double, may result in loss of precision - that is, the result may lose some of the least significant bits of the value. In this case, the resulting floating-point value will be a correctly rounded version of the integer value, using IEEE 754 round-to-nearest mode (§4.2.4).
This question already has answers here:
Why does Java implicitly (without cast) convert a `long` to a `float`?
(4 answers)
Closed 8 years ago.
if you call the following method of Java
void processIt(long a) {
float b = a; /*do I have loss here*/
}
do I have information loss when I assign the long variable to the float variable?
The Java language Specification says that the float type is a supertype of long.
Do I have information loss when I assign the long variable to the float variable?
Potentially, yes. That should be fairly clear from the fact that long has 64 bits of information, whereas float has only 32.
More specifically, as float values get bigger, the gap between successive values becomes more than 1 - whereas with long, the gap between successive values is always 1.
As an example:
long x = 100000000L;
float f1 = (float) x;
float f2 = (float) (x + 1);
System.out.println(f1 == f2); // true
In other words, two different long values have the same nearest representation in float.
This isn't just true of float though - it can happen with double too. In that case the numbers have to be bigger (as double has more precision) but it's still potentially lossy.
Again, it's reasonably easy to see that it has to be lossy - even though both long and double are represented in 64 bits, there are obviously double values which can't be represented as long values (trivially, 0.5 is one such) which means there must be some long values which aren't exactly representable as double values.
Yes, this is possible: if only for the reason that float has too few (typically 6-7) significant digits to deal with all possible numbers that long can represent (19 significant digits). This is in part due to the fact that float has only 32 bits of storage, and long has 64 (the other part is float's storage format † ). As per the JLS:
A widening conversion of an int or a long value to float, or of a long value to double, may result in loss of precision - that is, the result may lose some of the least significant bits of the value. In this case, the resulting floating-point value will be a correctly rounded version of the integer value, using IEEE 754 round-to-nearest mode (§4.2.4).
By example:
long i = 1000000001; // 10 significant digits
float f = i;
System.out.printf(" %d %n %.1f", i, f);
This prints (with the difference highlighted):
1000000001
1000000000.0
~ ← lost the number 1
It is worth noting this is also the case with int to float and long to double (as per that quote). In fact the only integer → floating point conversion that won't lose precision is int to double.
~~~~~~
† I say in part as this is also true for int widening to float which can also lose precision, despite both int and float having 32-bits. The same sample above but with int i has the same result as printed. This is unsurprising once you consider the way that float is structured; it uses some of the 32-bits to store the mantissa, or significand, so cannot represent all integer numbers in the same range as that of int.
Yes you will, for example...
public static void main(String[] args) {
long g = 2;
g <<= 48;
g++;
System.out.println(g);
float f = (float) g;
System.out.println(f);
long a = (long) f;
System.out.println(a);
}
... prints...
562949953421313
5.6294995E14
562949953421312
In Java, we can convert an int to float implicitly, which may result in loss of precision as shown in the example code below.
public class Test {
public static void main(String [] args) {
int intVal = 2147483647;
System.out.println("integer value is " + intVal);
double doubleVal = intVal;
System.out.println("double value is " + doubleVal);
float floatVal = intVal;
System.out.println("float value is " + floatVal);
}
}
The output is
integer value is 2147483647
double value is 2.147483647E9
float value is 2.14748365E9
What is the reason behind allowing implicit conversion of int to float, when there is a loss of precision?
You are probably wondering:
Why is this an implicit conversion when there is a loss of information? Shouldn't this be an explicit conversion?
And you of course have a good point. But the language designers decided that if the target type has a range large enough then an implicit conversion is allowed, even though there may be a loss of precision. Note that it is the range that is important, not the precision. A float has a greater range than an int, so it is an implicit conversion.
The Java specification says the following:
A widening conversion of an int or a long value to float, or of a long value to double, may result in loss of precision - that is, the result may lose some of the least significant bits of the value. In this case, the resulting floating-point value will be a correctly rounded version of the integer value, using IEEE 754 round-to-nearest mode.
Converting an integer type to a floating point type that uses the same number of bits may result in a loss of precision, but will be done automatically.
"Loss of precision" means that some of the less significant digits may become zeros, but the most important digits and the size of the number will remain. Recall that float has only about seven decimal digits of precision. For example, converting the int 123456789 to a float 123456700.0 shows a loss of precision.
I came across a strange corner of Java.(It seems strange to me)
double dd = 3.5;
float ff = 3.5f;
System.out.println(dd==ff);
o/p: true
double dd = 3.2;
float ff = 3.2f;
System.out.println(dd==ff);
o/p: false
I observed that if we compare any two values (a float and a double as I mentioned in the example) with .5 OR .0 like 3.5, 234.5, 645.0
then output is true i.e. two values are equal otherwise output is false though they are equals.
Even I tried to make method strictfp but no luck.
Am I missing out on something.
Take a look at What every computer scientist should know about floating point numbers.
Squeezing infinitely many real numbers into a finite number of bits requires an approximate representation....
--- Edit to show what the above quote means ---
You shouldn't ever compare floats or doubles for equality; because, you can't really guarantee that the number you assign to the float or double is exact.
So
float x = 3.2f;
doesn't result in a float with a value of 3.2. It results in a float with a value of 3.2 plus or minus some very small error. Say 3.19999999997f. Now it should be obvious why the comparison won't work.
To compare floats for equality sanely, you need to check if the value is "close enough" to the same value, like so
float error = 0.000001 * second;
if ((first >= second - error) || (first <= second + error)) {
// close enough that we'll consider the two equal
...
}
The difference is that 3.5 can be represented exactly in both float and double - whereas 3.2 can't be represented exactly in either type... and the two closest approximations are different.
Imagine we had two fixed-precision decimal types, one of which stored 4 significant digits and one of which stored 8 significant digits, and we asked each of them to store the number closest to "a third" (however we might do that). Then one would have the value 0.3333 and one would have the value 0.33333333.
An equality comparison between float and double first converts the float to a double and then compares the two - which would be equivalent to converting 0.3333 in our "small decimal" type to 0.33330000. It would then compare 0.33330000 and 0.33333333 for equality, and give a result of false.
floating point is a binary format and it can represent numbers as a sum of powers of 2. e.g. 3.5 is 2 + 1 + 1/2.
float 3.2f as an approximation of 3.2 is
2 + 1 + 1/8+ 1/16+ 1/128+ 1/256+ 1/2048+ 1/4096+ 1/32768+ 1/65536+ 1/524288+ 1/1048576+ 1/4194304 + a small error
However double 3.2d as an approximation of 3.2 is
2 + 1 + 1/8+ 1/16+ 1/128+ 1/256+ 1/2048+ 1/4096+ 1/32768+ 1/65536+ 1/524288+ 1/1048576+ 1/8388608+ 1/16777216+ 1/134217728+ 1/268435456+ 1/2147483648+ 1/4294967296+ 1/34359738368+ 1/68719476736+ 1/549755813888+ 1/1099511627776+ 1/8796093022208+ 1/17592186044416+ 1/140737488355328+ 1/281474976710656+ 1/1125899906842624 + a smaller error
When you use floating point, you need to use appropriate rounding. If you use BigDecimal instead (and many people do) it has rounding built in.
double dd = 3.2;
float ff = 3.2f;
// compare the difference with the accuracy of float.
System.out.println(Math.abs(dd - ff) < 1e-7 * Math.abs(ff));
BTW the code I used to print the fractions for double.
double f = 3.2d;
double f2 = f - 3;
System.out.print("2+ 1");
for (long i = 2; i < 1L << 54; i <<= 1) {
f2 *= 2;
if (f2 >= 1) {
System.out.print("+ 1/" + i);
f2 -= 1;
}
}
System.out.println();
The common implementation of floating point numbers, IEEE754, allows for the precise representation of only those numbers which have a short, finite binary expansion, i.e. which are a sum of finitely many (nearby) powers of two. All other numbers cannot be precisely represented.
Since float and double have different sizes, the representation in both types for a non-representable value are different, and thus they compare as unequal.
(The length of the binary string is the size of the mantissa, so that's 24 for float, 53 for double and 64 for the 80-bit extended-precision float (not in Java). The scale is determined by the exponent.)
This should work:
BigDecimal ddBD = new BigDecimal(""+dd);
BigDecimal ffBD = new BigDecimal(""+ff);
// test for equality
ddBO.equals(ffBD);
Always work with BigDecimal when you want to compare floats or doubles
and always use the BigDecimal constructor with the String parameter!
I have to convert a floating point to 32-bit fixed point in Java .
Not able to understand what is a 32-bit fixed point ?
Can any body help with algorithm ?
A fixed-point number is a representation of a real number using a certain number of bits of a type for the integer part, and the remaining bits of the type for the fractional part. The number of bits representing each part is fixed (hence the name, fixed-point). An integer type is usually used to store fixed-point values.
Fixed-point numbers are usually used in systems which don't have floating point support, or need more speed than floating point can provide. Fixed-point calculations can be performed using the CPU's integer instructions.
A 32-bit fixed-point number would be stored in an 32-bit type such as int.
Normally each bit in an (unsigned in this case) integer type would represent an integer value 2^n as follows:
1 0 1 1 0 0 1 0 = 2^7 + 2^5 + 2^4 + 2^1 = 178
2^7 2^6 2^5 2^4 2^3 2^2 2^1 2^0
But if the type is used to store a fixed-point value, the bits are interpreted slightly differently:
1 0 1 1 0 0 1 0 = 2^3 + 2^1 + 2^0 + 2^-3 = 11.125
2^3 2^2 2^1 2^0 2^-1 2^-2 2^-3 2^-4
The fixed point number in the example above is called a 4.4 fixed-point number, since there are 4 bits in the integer part and 4 bits in the fractional part of the number. In a 32 bit type the fixed-point value would typically be in 16.16 format, but also could be 24.8, 28.4 or any other combination.
Converting from a floating-point value to a fixed-point value involves the following steps:
Multiply the float by 2^(number of fractional bits for the type), eg. 2^8 for 24.8
Round the result (just add 0.5) if necessary, and floor it (or cast to an integer type) leaving an integer value.
Assign this value into the fixed-point type.
Obviously you can lose some precision in the fractional part of the number. If the precision of the fractional part is important, the choice of fixed-point format can reflect this - eg. use 16.16 or 8.24 instead of 24.8.
Negative values can also be handled in the same way if your fixed-point number needs to be signed.
If my Java were stronger I'd attempt some code, but I usually write such things in C, so I won't attempt a Java version. Besides, stacker's version looks good to me, with the minor exception that it doesn't offer the possibility of rounding. He even shows you how to perform a multiplication (the shift is important!)
A very simple example for converting to fixed point, it shows how to convert and multiplies PI by2. The resulting is converted back to double to demonstrate that the mantissa wasn't lost during calculation with integers.
You could expand that easily with sin() and cos() lookup tables etc.
I would recommend if you plan to use fixed point to look for a java fixed point library.
public class Fix {
public static final int FIXED_POINT = 16;
public static final int ONE = 1 << FIXED_POINT;
public static int mul(int a, int b) {
return (int) ((long) a * (long) b >> FIXED_POINT);
}
public static int toFix( double val ) {
return (int) (val * ONE);
}
public static int intVal( int fix ) {
return fix >> FIXED_POINT;
}
public static double doubleVal( int fix ) {
return ((double) fix) / ONE;
}
public static void main(String[] args) {
int f1 = toFix( Math.PI );
int f2 = toFix( 2 );
int result = mul( f1, f2 );
System.out.println( "f1:" + f1 + "," + intVal( f1 ) );
System.out.println( "f2:" + f2 + "," + intVal( f2 ) );
System.out.println( "r:" + result +"," + intVal( result));
System.out.println( "double: " + doubleVal( result ));
}
}
OUTPUT
f1:205887,3
f2:131072,2
r:411774,6
double: 6.283172607421875
A fixed-point type is one that has a fixed number of decimal/binary places after the radix point. Or more generally, a type that can store multiples of 1/N for some positive integer N.
Internally, fixed-point numbers are stored as the value multiplied by the scaling factor. For example, 123.45 with a scaling factor of 100 is stored as if it were the integer 12345.
To convert the internal value of a fixed-point number to floating point, simply divide by the scaling factor. To convert the other way, multiply by the scaling factor and round to the nearest integer.
The definition of 32-bit fixed point could vary. The general idea of fixed point is that you have some fixed number of bits before and another fixed number of bits after the decimal point (or binary point). For a 32-bit one, the most common split is probably even (16 before, 16 after), but depending on the purpose there's no guarantee of that.
As far as the conversion goes, again it's open to some variation -- for example, if the input number is outside the range of the target, you might want to do any number of different things (e.g., in some cases wraparound could make sense, but in others saturation might be preferred).