modifying least significant bits in double (Java and C++)

modifying least significant bits in double (Java and C++) - java

How to set to zero N least significant bits of a double in Java and C++?
In my computations, the "...002" in 1699.3000000000002, is caused by numerical error, so I would like to eliminate it.

I'd guess that you are actually doing currency calculations. In which case using a binary data type like double is probably the root cause of your problems. Switch to a decimal type and you should be able to side-step such issues.

In Java, 1e-12*Math.rint(1e12*x) will round a double and return a double as the result.
In C++, you can write 1e-12*floor(1e12*x + 0.5).
Note, however, that these behave differently if 1012x is exactly between two integers. The Java version will round towards an even number, whereas the C++ version will always round upwards.

Related

How accurate is "double-precision floating-point format"?

Let's say, using java, I type
double number;
If I need to use very big or very small values, how accurate can they be?
I tried to read how doubles and floats work, but I don't really get it.
For my term project in intro to programming, I might need to use different numbers with big ranges of value (many orders of magnitude).
Let's say I create a while loop,
while (number[i-1] - number[i] > ERROR) {
//does stuff
}
Does the limitation of ERROR depend on the size of number[i]? If so, how can I determine how small can ERROR be in order to quit the loop?
I know my teacher explained it at some point, but I can't seem to find it in my notes.

Does the limitation of ERROR depend on the size of number[i]?
Yes.
If so, how can I determine how small can ERROR be in order to quit the loop?
You can get the "next largest" double using Math.nextUp (or the "next smallest" using Math.nextDown), e.g.
double nextLargest = Math.nextUp(number[i-1]);
double difference = nextLargest - number[i-1];
As Radiodef points out, you can also get the difference directly using Math.ulp:
double difference = Math.ulp(number[i-1]);
(but I don't think there's an equivalent method for "next smallest")

If you don't tell us what you want to use it for, then we cannot answer anything more than what is standard knowledge: a double in java has about 16 significant digits, (that's digits of the decimal numbering system,) and the smallest possible value is 4.9 x 10-324. That's in all likelihood far higher precision than you will need.
The epsilon value (what you call "ERROR") in your question varies depending on your calculations, so there is no standard answer for it, but if you are using doubles for simple stuff as opposed to highly demanding scientific stuff, just use something like 1 x 10-9 and you will be fine.

Both the float and double primitive types are limited in terms of the amount of data they can store. However, if you want to know the maximum values of the two types, then run the code below with your favourite IDE.
System.out.println(Float.MAX_VALUE);
System.out.println(Double.MAX_VALUE);
double data type is a double-precision 64-bit IEEE 754 floating point (digits of precision could be between 15 to 17 decimal digits).
float data type is a single-precision 32-bit IEEE 754 floating point (digits of precision could be between 6 to 9 decimal digits).
After running the code above, if you're not satisfied with their ranges than I would recommend using BigDecimal as this type doesn't have a limit (rather your RAM is the limit).

Wrong Output Dollar Amount To Coins [duplicate]

double r = 11.631;
double theta = 21.4;
In the debugger, these are shown as 11.631000000000000 and 21.399999618530273.
How can I avoid this?

These accuracy problems are due to the internal representation of floating point numbers and there's not much you can do to avoid it.
By the way, printing these values at run-time often still leads to the correct results, at least using modern C++ compilers. For most operations, this isn't much of an issue.

I liked Joel's explanation, which deals with a similar binary floating point precision issue in Excel 2007:
See how there's a lot of 0110 0110 0110 there at the end? That's because 0.1 has no exact representation in binary... it's a repeating binary number. It's sort of like how 1/3 has no representation in decimal. 1/3 is 0.33333333 and you have to keep writing 3's forever. If you lose patience, you get something inexact.
So you can imagine how, in decimal, if you tried to do 3*1/3, and you didn't have time to write 3's forever, the result you would get would be 0.99999999, not 1, and people would get angry with you for being wrong.

If you have a value like:
double theta = 21.4;
And you want to do:
if (theta == 21.4)
{
}
You have to be a bit clever, you will need to check if the value of theta is really close to 21.4, but not necessarily that value.
if (fabs(theta - 21.4) <= 1e-6)
{
}

This is partly platform-specific - and we don't know what platform you're using.
It's also partly a case of knowing what you actually want to see. The debugger is showing you - to some extent, anyway - the precise value stored in your variable. In my article on binary floating point numbers in .NET, there's a C# class which lets you see the absolutely exact number stored in a double. The online version isn't working at the moment - I'll try to put one up on another site.
Given that the debugger sees the "actual" value, it's got to make a judgement call about what to display - it could show you the value rounded to a few decimal places, or a more precise value. Some debuggers do a better job than others at reading developers' minds, but it's a fundamental problem with binary floating point numbers.

Use the fixed-point decimal type if you want stability at the limits of precision. There are overheads, and you must explicitly cast if you wish to convert to floating point. If you do convert to floating point you will reintroduce the instabilities that seem to bother you.
Alternately you can get over it and learn to work with the limited precision of floating point arithmetic. For example you can use rounding to make values converge, or you can use epsilon comparisons to describe a tolerance. "Epsilon" is a constant you set up that defines a tolerance. For example, you may choose to regard two values as being equal if they are within 0.0001 of each other.
It occurs to me that you could use operator overloading to make epsilon comparisons transparent. That would be very cool.
For mantissa-exponent representations EPSILON must be computed to remain within the representable precision. For a number N, Epsilon = N / 10E+14
System.Double.Epsilon is the smallest representable positive value for the Double type. It is too small for our purpose. Read Microsoft's advice on equality testing

I've come across this before (on my blog) - I think the surprise tends to be that the 'irrational' numbers are different.
By 'irrational' here I'm just referring to the fact that they can't be accurately represented in this format. Real irrational numbers (like π - pi) can't be accurately represented at all.
Most people are familiar with 1/3 not working in decimal: 0.3333333333333...
The odd thing is that 1.1 doesn't work in floats. People expect decimal values to work in floating point numbers because of how they think of them:
1.1 is 11 x 10^-1
When actually they're in base-2
1.1 is 154811237190861 x 2^-47
You can't avoid it, you just have to get used to the fact that some floats are 'irrational', in the same way that 1/3 is.

One way you can avoid this is to use a library that uses an alternative method of representing decimal numbers, such as BCD

If you are using Java and you need accuracy, use the BigDecimal class for floating point calculations. It is slower but safer.

Seems to me that 21.399999618530273 is the single precision (float) representation of 21.4. Looks like the debugger is casting down from double to float somewhere.

You cant avoid this as you're using floating point numbers with fixed quantity of bytes. There's simply no isomorphism possible between real numbers and its limited notation.
But most of the time you can simply ignore it. 21.4==21.4 would still be true because it is still the same numbers with the same error. But 21.4f==21.4 may not be true because the error for float and double are different.
If you need fixed precision, perhaps you should try fixed point numbers. Or even integers. I for example often use int(1000*x) for passing to debug pager.

Dangers of computer arithmetic

If it bothers you, you can customize the way some values are displayed during debug. Use it with care :-)
Enhancing Debugging with the Debugger Display Attributes

Refer to General Decimal Arithmetic
Also take note when comparing floats, see this answer for more information.

According to the javadoc
"If at least one of the operands to a numerical operator is of type double, then the
operation is carried out using 64-bit floating-point arithmetic, and the result of the
numerical operator is a value of type double. If the other operand is not a double, it is
first widened (§5.1.5) to type double by numeric promotion (§5.6)."
Here is the Source

Best way to avoid number polarity reversal if multiplying two big numbers in Java

My question is related to this
How can I check if multiplying two numbers in Java will cause an overflow?
In my application, x and y are calculated on the fly and somewhere in my formula I have to multiply x and y.
int x=64371;
int y=64635;
System.out.println((x*y));
I get wrong output as -134347711
I can quickly fix above by the changing variable x and y from type int to long and get correct answer for above case. But, there is no gurantee that x and y won't grow beyond max capacity of long as well.
Question
Why does I get a negative number here, even though I am not storing the final result in any variable? (for curiosity sake)
Since, I won't know the value of x and y in advance, is there any quicker way to avoid this overflow. Maybe by dividing all x and y by a certain large constant for entire run of the application or should I take log of x and y before multiplying them? (actual question)
EDIT:
Clarification
The application runs on a big data set, which takes hours to complete. It would be nicer to have a solution which is not too slow.
Since the final result is used for comparison (they just need to be somewhat proportional to the original result) and it is acceptable to have +-5% error in the final value if that gives huge performance gain.

If you know that the numbers are likely to be large, use BigInteger instead. This is guaranteed not to overflow, and then you can either check whether the result is too large to fit into an int or long, or you can just use the BigInteger value directly.
BigInteger is an arbitrary-precision class, so it's going to be slower than using a direct primitive value (which can probably be stored in a processor register), so figure out whether you're realistically going to be overflowing a long (an int times an int will always fit in a long), and choose BigInteger if your domain really requires it.

You get a negative number because of an integer overflow: using two-s complement representation, Java interprets any integer with the most significant bit set to 1 as a negative.
There are very clever methods involving bit manipulation for detecting situations when an addition or subtraction would result in an overflow or an underflow. If you do not know how big your results are going to be, it is best to switch to BigInteger. Your code would look very different, though, because Java lacks operator overloading facilities that would make mathematical operations on BigInteger objects look familiar. The code will be somewhat slower, too. However, you will be guaranteed against overflows and underflows.
EDIT :
it is acceptable to have +-5% error in the final value if that gives huge performance gain.
+-5% error is a huge allowance for error! If this is indeed acceptable in your system, than using double or even float could work. These types are imprecise, but their range is enormously larger than that of an int, and they do not overflow so easily. You have to be extremely careful, though, because floating-point data types are inherently inexact. You need to always keep in mind the way the data is represented to avoid common precision problems.

Why does I get a negative number here, even though I am not storing
the final result in any variable? (for curiosity sake)
x and y are int types. When you multiply them, they are put into a piece of memory temporarily. The type of that is determined by the types of the originals. int*int will always yield an int. Even if it overflows. if you cast one of them to a long, then it will create a long for the multiplication, and you will not get an overflow.
Since, I won't know the value of x and y, is there any quicker way to
avoid this overflow. Maybe by dividing all x and y by a certain large
constant for entire run of the application or should I take log of x
and y before multiplying them? (actual question)
If x and y are positive then you can check
if(x*y<0)
{
//overflow
}
else
{
//do something with x*y
}
Unfortunately this is not fool-proof. You may overrun right into positive numbers again. for example: System.out.println(Integer.MAX_VALUE * 3); will output: 2147483645.
However, this technique will always work for adding 2 integers.
As others have said, BigInteger is sure not to overflow.

Negative value is just (64371 * 64635) - 2^32. Java not performs widening primitive conversion at run time.

Multiplication of ints always result in an int, even if it's not stored in a variable. Your product is 4160619585, which requires unsigned 32-bit (which Java does not have), or a larger word size (or BigInteger, as someone seem to have mentioned already).
You could add logs instead, but the moment you try to exp the result, you would get a number that won't round correctly into a signed 32-bit.

Since both multiplicands are int, doing the multiplication using long via casting would avoid an overflow in your specific case:
System.out.println(x * (long) y);
You don't want to use logarithms because transcendental functions are slow and floating point arithmetic is imprecise - the result is likely to not be equal to the correct integer answer.

java receive incorrect BeanShell result

I have very big numbers and complex string equations to solve in java. For that I use BeanShell. In this equations can also contain bitwise binary operations e.g.
(log(100)*7-9) & (30/3-7)
should be 1.
As I said I need to handle huge numbers for that I add the L to each number which works fine so far. But here I have the problem when computing something like 3/2 I just receive 1 and not 1.5. Then I've tried to add a D for double values to each number which gives me the 1.5 but here I receive an error on the binary operations and or xor etc. because of course they can only be applied to Integer values.
Is there a way to receive double values when needed and still perform the binary operations (of course only when I have integer values)?

I would declare the operands as doubles at the outset. It seems to me leaving Beanshell to promote/demote based on the answer introduces uncertainty into your code.
You can cast the double to an int and perform the binary operation every time. As this is a narrowing primitive conversion you should familiar yourself with the Java Language Specs.
(Another possibly helpful answer)

Why does Math.round return a long but Math.floor return a double?

Why the inconsistency?

There is no inconsistency: the methods are simply designed to follow different specifications.
long round(double a)
Returns the closest long to the argument.
double floor(double a)
Returns the largest (closest to positive infinity) double value that is less than or equal to the argument and is equal to a mathematical integer.
Compare with double ceil(double a)
double rint(double a)
Returns the double value that is closest in value to the argument and is equal to a mathematical integer
So by design round rounds to a long and rint rounds to a double. This has always been the case since JDK 1.0.
Other methods were added in JDK 1.2 (e.g. toRadians, toDegrees); others were added in 1.5 (e.g. log10, ulp, signum, etc), and yet some more were added in 1.6 (e.g. copySign, getExponent, nextUp, etc) (look for the Since: metadata in the documentation); but round and rint have always had each other the way they are now since the beginning.
Arguably, perhaps instead of long round and double rint, it'd be more "consistent" to name them double round and long rlong, but this is argumentative. That said, if you insist on categorically calling this an "inconsistency", then the reason may be as unsatisfying as "because it's inevitable".
Here's a quote from Effective Java 2nd Edition, Item 40: Design method signatures carefully:
When in doubt, look to the Java library APIs for guidance. While there are plenty of inconsistencies -- inevitable, given the size and scope of these libraries -- there are also fair amount of consensus.
Distantly related questions
Why does int num = Integer.getInteger("123") throw NullPointerException?
Most awkward/misleading method in Java Base API ?
Most Astonishing Violation of the Principle of Least Astonishment

floor would have been chosen to match the standard c routine in math.h (rint, mentioned in another answer, is also present in that library, and returns a double, as in java).
but round was not a standard function in c at that time (it's not mentioned in C89 - c identifiers and standards; c99 does define round and it returns a double, as you would expect). it's normal for language designers to "borrow" ideas, so maybe it comes from some other language? fortran 77 doesn't have a function of that name and i am not sure what else would have been used back then as a reference. perhaps vb - that does have Round but, unfortunately for this theory, it returns a double (php too). interestingly, perl deliberately avoids defining round.
[update: hmmm. looks like smalltalk returns integers. i don't know enough about smalltalk to know if that is correct and/or general, and the method is called rounded, but it might be the source. smalltalk did influence java in some ways (although more conceptually than in details).]
if it's not smalltalk, then we're left with the hypothesis that someone simply chose poorly (given the implicit conversions possible in java it seems to me that returning a double would have been more useful, since then it can be used both while converting types and when doing floating point calculations).
in other words: functions common to java and c tend to be consistent with the c library standard at the time; the rest seem to be arbitrary, but this particular wrinkle may have come from smalltalk.

I agree, that it is odd that Math.round(double) returns long. If large double values are cast to long (which is what Math.round implicitly does), Long.MAX_VALUE is returned. An alternative is using Math.rint() in order to avoid that. However, Math.rint() has a somewhat strange rounding behavior: ties are settled by rounding to the even integer, i.e. 4.5 is rounded down to 4.0 but 5.5 is rounded up to 6.0). Another alternative is to use Math.floor(x+0.5). But be aware that 1.5 is rounded to 2 while -1.5 is rounded to -1, not -2. Yet another alternative is to use Math.round, but only if the number is in the range between Long.MIN_VALUE and Long.MAX_VALUE. Double precision floating point values outside this range are integers anyhow.
Unfortunately, why Math.round() returns long is unknown. Somebody made that decision, and he probably never gave an interview to tell us why. My guess is, that Math.round was designed to provide a better way (i.e., with rounding) for converting doubles to longs.

Like everyone else here I also don't know the answer, but thought someone might find this useful. I noticed that if you want to round a double to an int without casting, you can use the two round implementations long round(double) and int round(float) together:
double d = something;
int i = Math.round(Math.round(d));

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.