converting floating point to 32-bit fixed point in Java

converting floating point to 32-bit fixed point in Java - java

I have to convert a floating point to 32-bit fixed point in Java .
Not able to understand what is a 32-bit fixed point ?
Can any body help with algorithm ?

A fixed-point number is a representation of a real number using a certain number of bits of a type for the integer part, and the remaining bits of the type for the fractional part. The number of bits representing each part is fixed (hence the name, fixed-point). An integer type is usually used to store fixed-point values.
Fixed-point numbers are usually used in systems which don't have floating point support, or need more speed than floating point can provide. Fixed-point calculations can be performed using the CPU's integer instructions.
A 32-bit fixed-point number would be stored in an 32-bit type such as int.
Normally each bit in an (unsigned in this case) integer type would represent an integer value 2^n as follows:
1 0 1 1 0 0 1 0 = 2^7 + 2^5 + 2^4 + 2^1 = 178
2^7 2^6 2^5 2^4 2^3 2^2 2^1 2^0
But if the type is used to store a fixed-point value, the bits are interpreted slightly differently:
1 0 1 1 0 0 1 0 = 2^3 + 2^1 + 2^0 + 2^-3 = 11.125
2^3 2^2 2^1 2^0 2^-1 2^-2 2^-3 2^-4
The fixed point number in the example above is called a 4.4 fixed-point number, since there are 4 bits in the integer part and 4 bits in the fractional part of the number. In a 32 bit type the fixed-point value would typically be in 16.16 format, but also could be 24.8, 28.4 or any other combination.
Converting from a floating-point value to a fixed-point value involves the following steps:
Multiply the float by 2^(number of fractional bits for the type), eg. 2^8 for 24.8
Round the result (just add 0.5) if necessary, and floor it (or cast to an integer type) leaving an integer value.
Assign this value into the fixed-point type.
Obviously you can lose some precision in the fractional part of the number. If the precision of the fractional part is important, the choice of fixed-point format can reflect this - eg. use 16.16 or 8.24 instead of 24.8.
Negative values can also be handled in the same way if your fixed-point number needs to be signed.
If my Java were stronger I'd attempt some code, but I usually write such things in C, so I won't attempt a Java version. Besides, stacker's version looks good to me, with the minor exception that it doesn't offer the possibility of rounding. He even shows you how to perform a multiplication (the shift is important!)

A very simple example for converting to fixed point, it shows how to convert and multiplies PI by2. The resulting is converted back to double to demonstrate that the mantissa wasn't lost during calculation with integers.
You could expand that easily with sin() and cos() lookup tables etc.
I would recommend if you plan to use fixed point to look for a java fixed point library.
public class Fix {
public static final int FIXED_POINT = 16;
public static final int ONE = 1 << FIXED_POINT;
public static int mul(int a, int b) {
return (int) ((long) a * (long) b >> FIXED_POINT);
}
public static int toFix( double val ) {
return (int) (val * ONE);
}
public static int intVal( int fix ) {
return fix >> FIXED_POINT;
}
public static double doubleVal( int fix ) {
return ((double) fix) / ONE;
}
public static void main(String[] args) {
int f1 = toFix( Math.PI );
int f2 = toFix( 2 );
int result = mul( f1, f2 );
System.out.println( "f1:" + f1 + "," + intVal( f1 ) );
System.out.println( "f2:" + f2 + "," + intVal( f2 ) );
System.out.println( "r:" + result +"," + intVal( result));
System.out.println( "double: " + doubleVal( result ));
}
}
OUTPUT
f1:205887,3
f2:131072,2
r:411774,6
double: 6.283172607421875

A fixed-point type is one that has a fixed number of decimal/binary places after the radix point. Or more generally, a type that can store multiples of 1/N for some positive integer N.
Internally, fixed-point numbers are stored as the value multiplied by the scaling factor. For example, 123.45 with a scaling factor of 100 is stored as if it were the integer 12345.
To convert the internal value of a fixed-point number to floating point, simply divide by the scaling factor. To convert the other way, multiply by the scaling factor and round to the nearest integer.

The definition of 32-bit fixed point could vary. The general idea of fixed point is that you have some fixed number of bits before and another fixed number of bits after the decimal point (or binary point). For a 32-bit one, the most common split is probably even (16 before, 16 after), but depending on the purpose there's no guarantee of that.
As far as the conversion goes, again it's open to some variation -- for example, if the input number is outside the range of the target, you might want to do any number of different things (e.g., in some cases wraparound could make sense, but in others saturation might be preferred).

Related

Multiply an int in Java by itself

Why if I multiply int num = 2,147,483,647 by the same int num is it returning 1 as result? Note that I am in the limit of the int possible value.
I already try to catch the exception but still give the result as 1.

Before any multiplication java translates ints to binary numbers. So you are actually trying to multiply 01111111111111111111111111111111 by 01111111111111111111111111111111. The result of this is something like
1111111111111111111111111111111000000000000000000000000000000001. The int can hold just 32 bits, so in fact you get 00000000000000000000000000000001 which is =1 in decimal.

In integer arithmetic, Java doesn't throw an exception when an overflow occurs. Instead, it just the 32 least significant bits of the outcome, or equivalently, it "wraps around". That is, if you calculate 2147483647 + 1, the outcome is -2147483648.
2,147,483,647 squared happens to be, in binary:
11111111111111111111111111111100000000000000000000000000000001
The least significant 32 bits of the outcome are equal to the value 1.
If you want to calculate with values which don't fit in 32 bits, you have to use either long (if 64 bits are sufficient) or java.math.BigInteger (if not).

int cannot handle just any large value.Look here. In JAVA you have an exclusive class for this problem which comes quite handy
import java.math.BigInteger;
public class BigIntegerDemo {
public static void main(String[] args) {
BigInteger b1 = new BigInteger("987654321987654321000000000"); //change it to your number
BigInteger b2 = new BigInteger("987654321987654321000000000"); //change it to your number
BigInteger product = b1.multiply(b2);
BigInteger division = b1.divide(b2);
System.out.println("product = " + product);
System.out.println("division = " + division);
}
}
Source : Using BigInteger In JAVA

The Java Language Specification exactly rules what should happen in the given case.
If an integer multiplication overflows, then the result is the low-order bits of the mathematical product as represented in some sufficiently large two's-complement format. As a result, if overflow occurs, then the sign of the result may not be the same as the sign of the mathematical product of the two operand values.
It means, that when you multiply two ints, the result will be represented in a long value first (that type holds sufficient bits to represent the result). Then, because you assign it to an int variable, the lower bits are kept for your int.
The JLS also says:
Despite the fact that overflow, underflow, or loss of information may occur, evaluation of a multiplication operator * never throws a run-time exception.
That's why you never get an exception.
My guess: Store the result in a long, and check what happens if you downcast to int. For example:
int num = 2147483647;
long result = num * num;
if (result != (long)((int)result)) {
// overflow happened
}
To really follow the arithmetics, let's follow the calculation:
((2^n)-1) * ((2^n)-1) =
2^(2n) - 2^n - 2^n + 1 =
2^(2n) - 2^(n+1) + 1
In your case, n=31 (your number is 2^31 - 1). The result is 2^62 + 2^32 + 1. In bits it looks like this (split by the 32bit boundary):
01000000000000000000000000000001 00000000000000000000000000000001
From this number, you get the rightmost part, which equals to 1.

It seems that the issue is because the int can not handle such a large value. Based on this link from oracle regarding the primitive types, the maximum range of values allowed is 2^31 -1 (2,147,483,647) which is exactly the same value that you want to multiply.
So, in this case is recommended to use the next primitive type with greater capacity, for example you could change your "int" variables to "long" which have a bigger range between -2^63 to 2^63-1 (-9223372036854775808 to 9223372036854775807).
For example:
public static void main(String[] args) {
long num = 2147483647L;
long total = num * num;
System.out.println("total: " + total);
}
And the output is:
total: 4611686014132420609
I hope this can help you.
Regards.

According to this link, a Java 'int' signed is 2^31 - 1. Which is equal to 2,147,483,647.
So if you are already at the max for int, and if you multiply it by anything, I would expect an error.

Math in Java when precision is lost

The below algorithm works to identify a factor of a small number but fails completely when using a large one such as 7534534523.0
double result = 7; // 7534534523.0;
double divisor = 1;
for (int i = 2; i < result; i++){
double r = result / (double)i;
if (Math.floor(r) == r){
divisor = i;
break;
}
}
System.out.println(result + "/" + divisor + "=" + (result/divisor));
The number 7534534523.0 divided by 2 on a calculator can give a decimal part or round it (losing the 0.5). How can I perform such a check on large numbers? Do I have to use BigDecimal for this? Or is there another way?

If your goal is to represent a number with exactly n significant figures to the right of the decimal, BigDecimal is the class to use.
Immutable, arbitrary-precision signed decimal numbers. A BigDecimal consists of an arbitrary precision integer unscaled value and a 32-bit integer scale. If zero or positive, the scale is the number of digits to the right of the decimal point. If negative, the unscaled value of the number is multiplied by ten to the power of the negation of the scale. The value of the number represented by the BigDecimal is therefore (unscaledValue × 10-scale).
Additionally, you can have a better control over scale manipulation, rounding and format conversion.

I don't see what the problem is in your code. It works exactly like it should.
When I run your code I get this output:
7.534534523E9/77359.0=97397.0
That may have confused you, but its perfectly fine. It's just using scientific notation, but there is nothing wrong with that.
7.534534523E9 = 7.534534523 * 109 = 7,534,534,523
If you want to see it in normal notation, you can use System.out.format to print the result:
System.out.format("%.0f/%.0f=%.0f\n", result, divisor, result / divisor);
Shows:
7534534523/77359=97397
But you don't need double or BigDecimal to check if a number is divisible by another number. You can use the modulo operator on integral types to check if one number is divisible by another. As long as your numbers fit in a long, this works, otherwise you can move on to a BigInteger:
long result = 7534534523L;
long divisor = 1;
for (int i = 2; i < result; i++) {
if (result % i == 0) {
divisor = i;
break;
}
}
System.out.println(result + "/" + divisor + "=" + (result / divisor));

BigDecimal is the way to move ahead for preserving high precision in numbers.
DO NOT do not use constructor BigDecimal(double val) as the rounding is performed and the output is not always same. The same is mentioned in the implementation as well. According to it:
The results of this constructor can be somewhat unpredictable. One might assume that writing new BigDecimal(0.1) in Java creates a BigDecimal which is exactly equal to 0.1 (an unscaled value of 1, with a scale of 1), but it is actually equal to 0.1000000000000000055511151231257827021181583404541015625. This is because 0.1 cannot be represented exactly as a double (or, for that matter, as a binary fraction of any finite length). Thus, the value that is being passed in to the constructor is not exactly equal to 0.1, appearances notwithstanding.
ALWAYS try to use constructor BigDecimal(String val) as it preserves precision and gives same output each time.

why program can exactly display infinite repeating floating point number in java or other language

like a decimal number 0.1, represented as binary 0.00011001100110011...., this is a infinite repeating number.
when I write code like this:
float f = 0.1f;
the program will rounding it as binary 0 01111011 1001 1001 1001 1001 1001 101, this is not original number 0.1.
but when print this variable like this:
System.out.print(f);
I can get original number 0.1 rather than 0.100000001 or some other number. I think the program can't exactly represent "0.1", but it can display "0.1" exactly. How to do it?
I recover decimal number through add each bits of binary, it looks weird.
float f = (float) (Math.pow(2, -4) + Math.pow(2, -5) + Math.pow(2, -8) + Math.pow(2, -9) + Math.pow(2, -12) + Math.pow(2, -13) + Math.pow(2, -16) + Math.pow(2, -17) + Math.pow(2, -20) + Math.pow(2, -21) + Math.pow(2, -24) + Math.pow(2, -25));
float f2 = (float) Math.pow(2, -27);
System.out.println(f);
System.out.println(f2);
System.out.println(f + f2);
Output:
0.099999994
7.4505806E-9
0.1
in math, f1 + f2 = 0.100000001145... , not equals 0.1. Why the program would not get result like 0.100000001, I think it is more accurate.

Java's System.out.print prints just enough decimals that the resulting representation, if parsed as a double or float, converts to the original double or float value.
This is a good idea because it means that in a sense, no information is lost in this kind of conversion to decimal. On the other hand, it can give an impression of exactness which, as you make clear in your question, is wrong.
In other languages, you can print the exact decimal representation of the float or double being considered:
#include <stdio.h>
int main(){
printf("%.60f", 0.1);
}
result: 0.100000000000000005551115123125782702118158340454101562500000
In Java, in order to emulate the above behavior, you need to convert the float or double to BigDecimal (this conversion is exact) and then print the BigDecimal with enough digits. Java's attitude to floating-point-to-string-representing-a-decimal conversion is pervasive, so that even System.out.format is affected. The linked Java program, the important line of which is System.out.format("%.60f\n", 0.1);, shows 0.100000000000000000000000000000000000000000000000000000000000, although the value of 0.1d is not 0.10000000000000000000…, and a Java programmer could have been excused for expecting the same output as the C program.
To convert a double to a string that represents the exact value of the double, consider the hexadecimal format, that Java supports for literals and for printing.

I believe this is covered by Double.toString(double) (and similarly in Float#toString(float)):
How many digits must be printed for the fractional part of m or a? There must be at least one digit to represent the fractional part, and beyond that as many, but only as many, more digits as are needed to uniquely distinguish the argument value from adjacent values of type double. That is, suppose that x is the exact mathematical value represented by the decimal representation produced by this method for a finite nonzero argument d. Then d must be the double value nearest to x; or if two double values are equally close to x, then d must be one of them and the least significant bit of the significand of d must be 0.
(my emphasis)

Why does changing the sum order returns a different result?

Why does changing the sum order returns a different result?
23.53 + 5.88 + 17.64 = 47.05
23.53 + 17.64 + 5.88 = 47.050000000000004
Both Java and JavaScript return the same results.
I understand that, due to the way floating point numbers are represented in binary, some rational numbers (like 1/3 - 0.333333...) cannot be represented precisely.
Why does simply changing the order of the elements affect the result?

Maybe this question is stupid, but why does simply changing the order of the elements affects the result?
It will change the points at which the values are rounded, based on their magnitude. As an example of the kind of thing that we're seeing, let's pretend that instead of binary floating point, we were using a decimal floating point type with 4 significant digits, where each addition is performed at "infinite" precision and then rounded to the nearest representable number. Here are two sums:
1/3 + 2/3 + 2/3 = (0.3333 + 0.6667) + 0.6667
= 1.000 + 0.6667 (no rounding needed!)
= 1.667 (where 1.6667 is rounded to 1.667)
2/3 + 2/3 + 1/3 = (0.6667 + 0.6667) + 0.3333
= 1.333 + 0.3333 (where 1.3334 is rounded to 1.333)
= 1.666 (where 1.6663 is rounded to 1.666)
We don't even need non-integers for this to be a problem:
10000 + 1 - 10000 = (10000 + 1) - 10000
= 10000 - 10000 (where 10001 is rounded to 10000)
= 0
10000 - 10000 + 1 = (10000 - 10000) + 1
= 0 + 1
= 1
This demonstrates possibly more clearly that the important part is that we have a limited number of significant digits - not a limited number of decimal places. If we could always keep the same number of decimal places, then with addition and subtraction at least, we'd be fine (so long as the values didn't overflow). The problem is that when you get to bigger numbers, smaller information is lost - the 10001 being rounded to 10000 in this case. (This is an example of the problem that Eric Lippert noted in his answer.)
It's important to note that the values on the first line of the right hand side are the same in all cases - so although it's important to understand that your decimal numbers (23.53, 5.88, 17.64) won't be represented exactly as double values, that's only a problem because of the problems shown above.

Here's what's going on in binary. As we know, some floating-point values cannot be represented exactly in binary, even if they can be represented exactly in decimal. These 3 numbers are just examples of that fact.
With this program I output the hexadecimal representations of each number and the results of each addition.
public class Main{
public static void main(String args[]) {
double x = 23.53; // Inexact representation
double y = 5.88; // Inexact representation
double z = 17.64; // Inexact representation
double s = 47.05; // What math tells us the sum should be; still inexact
printValueAndInHex(x);
printValueAndInHex(y);
printValueAndInHex(z);
printValueAndInHex(s);
System.out.println("--------");
double t1 = x + y;
printValueAndInHex(t1);
t1 = t1 + z;
printValueAndInHex(t1);
System.out.println("--------");
double t2 = x + z;
printValueAndInHex(t2);
t2 = t2 + y;
printValueAndInHex(t2);
}
private static void printValueAndInHex(double d)
{
System.out.println(Long.toHexString(Double.doubleToLongBits(d)) + ": " + d);
}
}
The printValueAndInHex method is just a hex-printer helper.
The output is as follows:
403787ae147ae148: 23.53
4017851eb851eb85: 5.88
4031a3d70a3d70a4: 17.64
4047866666666666: 47.05
--------
403d68f5c28f5c29: 29.41
4047866666666666: 47.05
--------
404495c28f5c28f6: 41.17
4047866666666667: 47.050000000000004
The first 4 numbers are x, y, z, and s's hexadecimal representations. In IEEE floating point representation, bits 2-12 represent the binary exponent, that is, the scale of the number. (The first bit is the sign bit, and the remaining bits for the mantissa.) The exponent represented is actually the binary number minus 1023.
The exponents for the first 4 numbers are extracted:
sign|exponent
403 => 0|100 0000 0011| => 1027 - 1023 = 4
401 => 0|100 0000 0001| => 1025 - 1023 = 2
403 => 0|100 0000 0011| => 1027 - 1023 = 4
404 => 0|100 0000 0100| => 1028 - 1023 = 5
First set of additions
The second number (y) is of smaller magnitude. When adding these two numbers to get x + y, the last 2 bits of the second number (01) are shifted out of range and do not figure into the calculation.
The second addition adds x + y and z and adds two numbers of the same scale.
Second set of additions
Here, x + z occurs first. They are of the same scale, but they yield a number that is higher up in scale:
404 => 0|100 0000 0100| => 1028 - 1023 = 5
The second addition adds x + z and y, and now 3 bits are dropped from y to add the numbers (101). Here, there must be a round upwards, because the result is the next floating point number up: 4047866666666666 for the first set of additions vs. 4047866666666667 for the second set of additions. That error is significant enough to show in the printout of the total.
In conclusion, be careful when performing mathematical operations on IEEE numbers. Some representations are inexact, and they become even more inexact when the scales are different. Add and subtract numbers of similar scale if you can.

Jon's answer is of course correct. In your case the error is no larger than the error you would accumulate doing any simple floating point operation. You've got a scenario where in one case you get zero error and in another you get a tiny error; that's not actually that interesting a scenario. A good question is: are there scenarios where changing the order of calculations goes from a tiny error to a (relatively) enormous error? The answer is unambiguously yes.
Consider for example:
x1 = (a - b) + (c - d) + (e - f) + (g - h);
vs
x2 = (a + c + e + g) - (b + d + f + h);
vs
x3 = a - b + c - d + e - f + g - h;
Obviously in exact arithmetic they would be the same. It is entertaining to try to find values for a, b, c, d, e, f, g, h such that the values of x1 and x2 and x3 differ by a large quantity. See if you can do so!

This actually covers much more than just Java and Javascript, and would likely affect any programming language using floats or doubles.
In memory, floating points use a special format along the lines of IEEE 754 (the converter provides much better explanation than I can).
Anyways, here's the float converter.
http://www.h-schmidt.net/FloatConverter/
The thing about the order of operations is the "fineness" of the operation.
Your first line yields 29.41 from the first two values, which gives us 2^4 as the exponent.
Your second line yields 41.17 which gives us 2^5 as the exponent.
We're losing a significant figure by increasing the exponent, which is likely to change the outcome.
Try ticking the last bit on the far right on and off for 41.17 and you can see that something as "insignificant" as 1/2^23 of the exponent would be enough to cause this floating point difference.
Edit: For those of you who remember significant figures, this would fall under that category. 10^4 + 4999 with a significant figure of 1 is going to be 10^4. In this case, the significant figure is much smaller, but we can see the results with the .00000000004 attached to it.

Floating point numbers are represented using the IEEE 754 format, which provides a specific size of bits for the mantissa (significand). Unfortunately this gives you a specific number of 'fractional building blocks' to play with, and certain fractional values cannot be represented precisely.
What is happening in your case is that in the second case, the addition is probably running into some precision issue because of the order the additions are evaluated. I haven't calculated the values, but it could be for example that 23.53 + 17.64 cannot be precisely represented, while 23.53 + 5.88 can.
Unfortunately it is a known problem that you just have to deal with.

I believe it has to do with the order of evaulation. While the sum is naturally the same in a math world, in the binary world instead of A + B + C = D, it's
A + B = E
E + C = D(1)
So there's that secondary step where floating point numbers can get off.
When you change the order,
A + C = F
F + B = D(2)

To add a different angle to the other answers here, this SO answer shows that there are ways of doing floating-point math where all summation orders return exactly the same value at the bit level.

Are these two approaches to the smallest Double values in Java equivalent?

Alternative wording: When will adding Double.MIN_VALUE to a double in Java not result in a different Double value? (See Jon Skeet's comment below)
This SO question about the minimum Double value in Java has some answers which seem to me to be equivalent. Jon Skeet's answer no doubt works but his explanation hasn't convinced me how it is different from Richard's answer.
Jon's answer uses the following:
double d = // your existing value;
long bits = Double.doubleToLongBits(d);
bits++;
d = Double.longBitsToDouble();
Richards answer mentions the JavaDoc for Double.MIN_VALUE
A constant holding the smallest
positive nonzero value of type double,
2-1074. It is equal to the hexadecimal
floating-point literal
0x0.0000000000001P-1022 and also equal
to Double.longBitsToDouble(0x1L).
My question is, how is Double.logBitsToDouble(0x1L) different from Jon's bits++;?
Jon's comment focuses on the basic floating point issue.
There's a difference between adding
Double.MIN_VALUE to a double value,
and incrementing the bit pattern
representing a double. They're
entirely different operations, due to
the way that floating point numbers
are stored. If you try to add a very
little number to a very big number,
the difference may well be so small
that the closest result is the same as
the original. Adding 1 to the current
bit pattern, however, will always
change the corresponding floating
point value, by the smallest possible
value which is visible at that scale.
I don't see any difference to Jon's approach of incrementing a long, "bits++", with adding Double.MIN_VALUE. When will they produce different results?
I wrote the following code to test the differences. Maybe someone could provide more/better sample double numbers or use a loop to find a number where there is a difference.
double d = 3.14159269123456789; // sample double
long bits = Double.doubleToLongBits(d);
long bitsBefore = bits;
bits++;
long bitsAfter = bits;
long bitsDiff = bitsAfter - bitsBefore;
long bitsMinValue = Double.doubleToLongBits(Double.MIN_VALUE);
long bitsSmallValue = Double.doubleToLongBits(Double.longBitsToDouble(0x1L));
if (bitsMinValue == bitsSmallValue)
{
System.out.println("Double.doubleToLongBits(0x1L) is same as Double.doubleToLongBits(Double.MIN_VALUE)");
}
if (bitsDiff == bitsMinValue)
{
System.out.println("bits++ increments the same amount as Double.MIN_VALUE");
}
if (bitsDiff == bitsMinValue)
{
d = d + Double.MIN_VALUE;
System.out.println("Using Double.MIN_VALUE");
}
else
{
d = Double.longBitsToDouble(bits);
System.out.println("Using doubleToLongBits/bits++");
}
System.out.println("bits before: " + bitsBefore);
System.out.println("bits after: " + bitsAfter);
System.out.println("bits diff: " + bitsDiff);
System.out.println("bits Min value: " + bitsMinValue);
System.out.println("bits Small value: " + bitsSmallValue);
OUTPUT:
Double.doubleToLongBits(Double.longBitsToDouble(0x1L)) is same as Double.doubleToLongBits(Double.MIN_VALUE)
bits++ increments the same amount as Double.MIN_VALUE
Using doubleToLongBits/bits++
bits before: 4614256656636814345
bits after: 4614256656636814346
bits diff: 1
bits Min value: 1
bits Small value: 1

Okay, let's imagine it this way, sticking with decimal numbers. Suppose you have a floating decimal point type which allows you to represent 5 decimal digits, and a number between 0 and 3 for the exponent, to multiple the result by 1, 10, 100 or 1000.
So the smallest non-zero value is just 1 (i.e. mantissa=00001, exponent=0). The largest value is 99999000 (mantissa=99999, exponent=3).
Now, what happens when you add 1 to 50000000? You can't represent 50000001...the next representable number after 500000000 is 50001000. So if you try to add them together, the result is just going to be the closest value to the "true" result - which is still 500000000. That's like adding Double.MIN_VALUE to a large double.
My version (converting to bits, incrementing and then converting back) is like taking that 50000000, splitting into mantissa and exponent (m=50000, e=3) then incrementing it the smallest amount, to (m=50001, e=3) and then reassembling to 50001000.
Do you see how they're different?
Now here's a concrete example:
public class Test{
public static void main(String[] args) {
double before = 100000000000000d;
double after = before + Double.MIN_VALUE;
System.out.println(before == after);
long bits = Double.doubleToLongBits(before);
bits++;
double afterBits = Double.longBitsToDouble(bits);
System.out.println(before == afterBits);
System.out.println(afterBits - before);
}
}
This tries both approaches with a large number. The output is:
true
false
0.015625
Going through the output, that means:
Adding Double.MIN_VALUE didn't have any effect
Incrementing the bit did have an effect
The difference between afterBits and before is 0.015625, which is much bigger than Double.MIN_VALUE. No wonder the simple addition had no effect!

It's exactly as Jon said:
"If you try to add a very little
number to a very big number, the
difference may well be so small that
the closest result is the same as the
original."
For example:
// True:
(Double.MAX_VALUE + Double.MIN_VALUE) == Double.MAX_VALUE
// False:
Double.longBitsToDouble(Double.doubleToLongBits(Double.MAX_VALUE) + 1) == Double.MAX_VALUE)
MIN_VALUE is the smallest representable positive double, but that certainly does not imply that adding it to an arbitrary double results in a unequal one.
In contrast, adding 1 to the underlying bits results in a new bit pattern, and thus does result in a unequal double.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

converting floating point to 32-bit fixed point in Java - java

I have to convert a floating point to 32-bit fixed point in Java . Not able to understand what is a 32-bit fixed point ? Can any body help with algorithm ?

Related

Multiply an int in Java by itself

Math in Java when precision is lost

why program can exactly display infinite repeating floating point number in java or other language

Why does changing the sum order returns a different result?

Are these two approaches to the smallest Double values in Java equivalent?

Categories

Resources