I am building a system to read tables from heterogeneous documents and would like to know the best way of managing (columns of) floating point numbers. Where the column can be represented as real numbers I will use List<Double> (I'm using Java but experience from other languages would be useful.) I also wish to serialize the table as a CSV file. Thus a table might look like:
"material", "mass (g)", "volume (cm3)",
"iron", 7.8, 1.0,
"aluminium", 27.3, 9.9,
and column2 (1-based) would be represented by a List<Double>
{new Double(7.8), new Double(27.3)}
I may also wish to compute the density (mass/volume) and derive a new column ("density (g.cml-3)") as a List
{new Double(7.8), new Double(2.76)}
However the input values are sometimes missing, unusual or represented by fuzzy concepts. Some transformations may throw exceptions (which I would catch and replace by one of the above). Examples include:
1.0E+10000
>10
10 / 0.0 (i.e. divide by zero)
Math.sqrt(-1.)
Math.tan(Math.PI/2.0)
I have the following options in Java for unusual values of a list element
null reference
Double.NaN
Double.MAX_VALUE
Double.POSITIVE_INFINITY
Are there protocols for when the Java unusual values above should be used? I have read this question on how they behave. (I would like to rely on chaining of their operations). And if there are protocols can the values be serialized and read back in? (e.g. does Java parse "0x7ff0000000000000L" to a number equal to Double.POSITIVE_INFINITY
I am prepared for some loss of precision in specification (there are often errors in OCR, missing digits etc. so this is a "good enough" exercise).
You have three problems that you ought to separate to some extent:
What representation should you use for table entries, which might be numbers, numbered quantities of some units, or other things?
How might floating-point infinities and NaNs serve you?
How can floating-point objects be serialized (written to a file and read from a file)?
Regarding these:
You have not specified enough information here for good advice about how to represent table entries. From what you describe, there is no reason to use floating point at all. This is because you have not specified what operations you want to perform on the entries other than reading and writing them. If you do not need to do arithmetic, there is no reason to bother converting values to floating point, or to any other number-arithmetic system. You could simply maintain the entries as their original text. This makes serialization trivial.
Floating-point infinities act like mathematical infinity, by design. Infinity plus a number other than infinity remains infinity, et cetera. You should use floating-point infinities to represent mathematical infinities. You should avoid using floating-point infinities to represent overflows, unless you do not care about losing the values that overflow. Floating-point NaNs are intended to represent “not a number”. It is often used to represent something like “An error occurred, so we do not have a number here to give you. You should do something else in this place.” Then it is up to the application to supply the something else, perhaps by having supplementary information from another source or in a parallel data structure. Errors include things such as taking the square root of a negative number or failing to initialize some data. (E.g., some underlying software initializes floating-point data to NaNs, so that, if you do not initialize it yourself, NaNs remain.) You should generally treat NaNs as “empty places” that you must not use rather than as tokens representing something.
When writing and reading floating-point values, you should take care to convert the values exactly or ensure that the errors you introduce in conversion are tolerable. If you must convert to text (human-readable numerals) rather than writing in “binary” (bytes with arbitrary values), then it may be preferable to write in a notation that uses a numeric base compatible with the native radix of the floating-point system (e.g., hexadecimal floating-point numerals for binary floating-point representations, such as 0x3.4p-2 for .8125). If this is not feasible, then you need to produce enough digits (when converting to decimal) to represent the floating-point value accurately enough to recover the original value when reading it, and you need to ensure the conversion software converts without introducing additional errors. You must also handle special values such as infinities and NaNs.
(Note that Math.tan(Math.PI/2) is not infinity and does not cause an exception because Math.PI/2 is not exactly π/2, so its tangent is finite, not infinity.)
Related
This question is kind of language-agnostic but the code is written in Java.
We have all heard that comparing floating-point numbers for equality is generally wrong. But what if I wanted to compare two exact same literal float values (or strings representing exact same literal values converted to floats)?
I'm quite sure that the numbers will be exactly equal (well, because they must be equal in binary—how can the exact same thing result in two different binary numbers?!) but I wanted to be sure.
Case 1:
void test1() {
float f1 = 4.7;
float f2 = 4.7;
print(f1 == f2);
}
Case 2:
class Movie {
String rating; // for some reason the type is String
}
void test2() {
movie1.rating = "4.7";
movie2.rating = "4.7";
float f1 = Float.parse(movie1.rating);
float f2 = Float.parse(movie2.rating);
print(f1 == f2);
}
In both situations, the expression f1 == f2 should result in true. Am I right? Can I safely compare ratings for equality if they have the same literal float or string values?
There's a rule of thumb that you should apply to all programming rules of thumb (rule of thumbs?):
They are oversimplified, and will result in boneheaded decision making if pushed too far. IF you do not fully -grok- the intent behind the rule of thumb, you will mess up. Perhaps the rule of thumb remains a net positive (applying it without thought will improve things more than it will make them worse), but it will cause damage, and in any case it cannot be used as an argument in a debate.
So, with that in mind, clearly, there is no point in asking the question:
"Giving that the rule of thumb 'do not use == to compare floats' exists, is it ALWAYS bad?".
The answer is the extremely obvious: Duh, no. It's not ALWAYS bad, because rules of thumb pretty much by definition, if not by common sense, never ALWAYS apply.
So let's break it down then.
WHY is there a rule of thumb that you shouldn't == compare floats?
Your question suggests you already know this: It's because doing any math on floating points as represented by IEEE754 concepts such as java's double or float are inexact (vs. concepts like java's BigDecimal, which is exact *).
Do what you should always do when faced with a rule of thumb that, upon grokking why the rule of thumb exists and realizing it does not apply to your scenario: Completely ignore it.
Perhaps your question boils down to: I THINK I grok the rule of thumb, but perhaps I'm missing something; aside from the 'floating point math introduces small deviations which mess up == comparison', which does not apply to this case, are there any other reasons for this rule of thumb that I am not aware of?
In which case, my answer is: As far as I know, no.
*) But BigDecimal has its own equality problems, such as: Are two BigDecimal objects that represent the same mathematical number precisely, but which are configured to render at a different scale 'equal'? That depends on whether your viewpoint is that they are numbers or objects representing an exact decimal point number along with some meta properties including how to render it and how to round things if explicitly asked to do so. For what it is worth, the equals implementation of BD, which has to make a sophie's choice and choose between 2 equally valid interpretations of what equality means, chooses 'I represent a number', not 'I represent a number along with a bunch of metadata'. The same sophie's choice exists in all JPA/Hibernate stacks: Does a JPA object represent 'a row in the database' (thus equality being defined solely by the primary key value, and if not saved yet, two objects cannot be equal, not even to itself, unless the same reference identity), or does it represent the thing that the row represents, e.g. a student, and not 'a row in the DB that represents a student', in which case unid is the one field that does NOT matter for identity, and all the others (name, birthdate, social security number, etc) do. equality is hard.
Yes. Compile time constants that are the same are evaluated consistently.
If you think about it, they must be the same, because there’s only one compiler and it converts literals to their floating point representation deterministically.
Yes, you can compare floats like this. The thing is that even if 4.7 isn't 4.7 when converted to a float, it will be converted consistently to the same value.
In general it is not wrong per se to compare floats like this. But for more complex math, you might want to use Math.round() or set a "sameness" difference span that the two should be within to be counted as "the same".
There is also an arbitrariness to fixed point numbers. For instance
1,000,000,001
is bigger than
1.000,000,000
Are these two numbers different? It depends on the precision you need. But for most purposes, these numbers are functionally the same
This question is kind of language-agnostic…
Actually, there is no floating-point issue here, and the answer depends entirely on the language.
There is no floating-point issue because IEEE-754 is clear: Two floating-point datums (finite numbers, infinities, and/or NaNs) compare as equal if and only if they correspond to the same real number.
There are language issues because how literals are mapped to floating-point numbers and how source text is mapped to operations differs from language to language. For example, C 2018 6.4.4.2 5 says:
All floating constants of the same source form77) shall convert to the same internal format with the same value.
And footnote 77 says:
1.23, 1.230, 123e-2, 123e-02, and 1.23L are all different source forms and thus need not convert to the same internal format and value.
Thus the C standard permits 1.23 == 1.230 to evaluate to false. (There are historical reasons this was permitted, leaving it as a quality-of-implementation issue.) If by “same” literal float value, you mean the exact same source text, then this problem does not occur in C; the exact same source text must produce the same floating-point value each time in a particular C implementation. However, this example teaches us to be cautious.
C also allows implementations flexibility in how floating-point operations are performed: It allows an implementation to use more than the nominal precision in evaluating expressions, and it allows using different precisions in different parts of the same expression. So 1./3. == 1./3. could evaluate to false.
Some languages, like Python, do not have a good formal specification and are largely silent about how floating-point operations are performed. It is conceivable a Python implementation could use excess precision available in processor registers to convert the source text 1.3 to a long double or similar type, then save it somewhere as a double, then convert the source text 1.3 to a long double, then retrieve the double to compare it to the long double still in registers and get a result indicating inequality.
This sort of issue does not occur in implementations I am aware of, but, when asking a question like this, asking whether a rule always holds, regardless of language, leaves the door open for possible exceptions.
The Java™ Tutorials state that "this data type [double] should never be used for precise values, such as currency." Is the fact that an ORM / DSL is returning floating point numbers for database columns storing values to be used to calculate monetary amounts a problem? I'm using QueryDSL and I'm dealing with money. QueryDSL is returning a Double for any number with a precision up to 16 and a BigDecimal thereafter. This concerns me as I'm aware that floating point arithmetic isn't suitable for currency calculations.
From this QueryDSL issue I'm led to believe that Hibernate does the same thing; see OracleDialect. Why does it use a Double rather than a BigDecimal? Is it safe to retrieve the Double and construct a BigDecimal, or is there a chance that a number with a precision of less than 16 could be incorrectly represented? Is it only when performing arithmetic operations that a Double can have floating-point issues, or are there values to which it cannot be accurately initialised?
Using floating point numbers for storing money is a bad idea indeed. Floating points can approximate an operation result, but that's not what you want when dealing with money.
The easiest way to fix it, in a database portable way, is to simply store cents. This is the proffered way of dealing with currency operations in financial operations. Pay attention that most databases use the half-away from zero rounding algorithm, so make sure that's appropriate in your context.
When it comes to money you should always ask a local accountant, especially for the rounding part. Better safe then sorry.
Now back to your questions:
Is it safe to retrieve the Double and construct a BigDecimal, or is
there a chance that a number with a precision of less than 16 could be
incorrectly represented?
This is a safe operation as long as your database uses at most a 16 digit precision. If it uses a higher precision, you'd need to override the OracleDialect and
Is it only when performing arithmetic operations that a Double can
have floating-point issues, or are there values to which it cannot be
accurately initialised?
When performing arithmetic operations you must always take into consideration the monetary rounding anyway, and that applies to BigDecimal as well. So if you can guarantee that the database value doesn't loose any decimal when being cast to a java Double, you are fine to create a BigDecimal from it. Using BigDecimal pays off when applying arithmetic operations to the database loaded value.
As for the threshold of 16, according to Wiki:
The 11 bit width of the exponent allows the representation of numbers
with a decimal exponent between 10−308 and 10308, with full 15–17
decimal digits precision. By compromising precision, subnormal
representation allows values smaller than 10−323.
There seems to be several concerns mentioned in the question, comments, and answers by Robert Bain. I've collected and paraphrased some of these.
Is it safe to use a double to store a precise value?
Yes, provided the number of significant-digits (precision) is small enough.
From wikipedia
If a decimal string with at most 15 significant digits is converted to IEEE 754 double precision representation and then converted back to a string with the same number of significant digits, then the final string should match the original.
But new BigDecimal(1000.1d) has the value 1000.1000000000000227373675443232059478759765625, why not 1000.1?
In the quote above I added emphasis - when converted from a double the number of significant digits must be specified, e.g.
new BigDecimal(1000.1d, new MathContext(15))
Is it safe to use a double for arbitrary arithmetic on precise values?
No, each intermediate value used in the calculation could introduce additional error.
Using a double to store exact values should be seen as an optimization. It introduces risk that if care is not taken, precision could be lost. Using a BigDecimal is much less likely to have unexpected consequences and should be your default choice.
Is it correct that QueryDSL returns a double for precise value?
It is not necessarily incorrect, but is probably not desirable. I would suggest you engage with the QueryDSL developers... but I see you have already raised an issue and they intend to change this behavior.
After much deliberation, I must conclude that the answer to my own question:
Is the fact that an ORM / DSL is returning floating point numbers for database columns storing values to be used to calculate monetary amounts a problem?
put simply, is yes. Please read on.
Is it safe to retrieve the Double and construct a BigDecimal, or is there a chance that a number with a precision of less than 16 could be incorrectly represented?
A number with a precision of less than 16 decimal digits is incorrectly represented in the following example.
BigDecimal foo = new BigDecimal(1000.1d);
The BigDecimal value of foo is 1000.1000000000000227373675443232059478759765625. 1000.1 has a precision of 1 and is being misrepresented from precision 14 of the BigDecimal value.
Is it only when performing arithmetic operations that a Double can have floating-point issues, or are there values to which it cannot be accurately initialised?
As per the example above, there are values to which it cannot be accurately initialised. As The Java™ Tutorials clearly states, "This data type [float / double] should never be used for precise values, such as currency. For that, you will need to use the java.math.BigDecimal class instead."
Interestingly, calling BigDecimal.valueOf(someDouble) appeared at first to magically resolve things but upon realising that it calls Double.toString() then reading Double's documentation it became apparent that this is not appropriate for exact values either.
In conclusion, when dealing with exact values, floating point numbers are never appropriate. As such, in my mind, ORMs / DSLs should be mapping to BigDecimal unless otherwise specified, given that most database use will involve the calculation of exact values.
Update:
Based on this conclusion, I've raised this issue with QueryDSL.
It is not only about arithmetic operations, but also about pure read&write.
Oracle NUMBER and BigDecimal do both use decadic base. So when you read number from database and then you store it back you can be sure, that the same number was written. (Unless it exceeds Oracle's limit of 38 digits).
If you convert NUMBER into binary base (Double) and then you convert it back do decadic then you might expect problems. And also this operation must be much slower.
double r = 11.631;
double theta = 21.4;
In the debugger, these are shown as 11.631000000000000 and 21.399999618530273.
How can I avoid this?
These accuracy problems are due to the internal representation of floating point numbers and there's not much you can do to avoid it.
By the way, printing these values at run-time often still leads to the correct results, at least using modern C++ compilers. For most operations, this isn't much of an issue.
I liked Joel's explanation, which deals with a similar binary floating point precision issue in Excel 2007:
See how there's a lot of 0110 0110 0110 there at the end? That's because 0.1 has no exact representation in binary... it's a repeating binary number. It's sort of like how 1/3 has no representation in decimal. 1/3 is 0.33333333 and you have to keep writing 3's forever. If you lose patience, you get something inexact.
So you can imagine how, in decimal, if you tried to do 3*1/3, and you didn't have time to write 3's forever, the result you would get would be 0.99999999, not 1, and people would get angry with you for being wrong.
If you have a value like:
double theta = 21.4;
And you want to do:
if (theta == 21.4)
{
}
You have to be a bit clever, you will need to check if the value of theta is really close to 21.4, but not necessarily that value.
if (fabs(theta - 21.4) <= 1e-6)
{
}
This is partly platform-specific - and we don't know what platform you're using.
It's also partly a case of knowing what you actually want to see. The debugger is showing you - to some extent, anyway - the precise value stored in your variable. In my article on binary floating point numbers in .NET, there's a C# class which lets you see the absolutely exact number stored in a double. The online version isn't working at the moment - I'll try to put one up on another site.
Given that the debugger sees the "actual" value, it's got to make a judgement call about what to display - it could show you the value rounded to a few decimal places, or a more precise value. Some debuggers do a better job than others at reading developers' minds, but it's a fundamental problem with binary floating point numbers.
Use the fixed-point decimal type if you want stability at the limits of precision. There are overheads, and you must explicitly cast if you wish to convert to floating point. If you do convert to floating point you will reintroduce the instabilities that seem to bother you.
Alternately you can get over it and learn to work with the limited precision of floating point arithmetic. For example you can use rounding to make values converge, or you can use epsilon comparisons to describe a tolerance. "Epsilon" is a constant you set up that defines a tolerance. For example, you may choose to regard two values as being equal if they are within 0.0001 of each other.
It occurs to me that you could use operator overloading to make epsilon comparisons transparent. That would be very cool.
For mantissa-exponent representations EPSILON must be computed to remain within the representable precision. For a number N, Epsilon = N / 10E+14
System.Double.Epsilon is the smallest representable positive value for the Double type. It is too small for our purpose. Read Microsoft's advice on equality testing
I've come across this before (on my blog) - I think the surprise tends to be that the 'irrational' numbers are different.
By 'irrational' here I'm just referring to the fact that they can't be accurately represented in this format. Real irrational numbers (like π - pi) can't be accurately represented at all.
Most people are familiar with 1/3 not working in decimal: 0.3333333333333...
The odd thing is that 1.1 doesn't work in floats. People expect decimal values to work in floating point numbers because of how they think of them:
1.1 is 11 x 10^-1
When actually they're in base-2
1.1 is 154811237190861 x 2^-47
You can't avoid it, you just have to get used to the fact that some floats are 'irrational', in the same way that 1/3 is.
One way you can avoid this is to use a library that uses an alternative method of representing decimal numbers, such as BCD
If you are using Java and you need accuracy, use the BigDecimal class for floating point calculations. It is slower but safer.
Seems to me that 21.399999618530273 is the single precision (float) representation of 21.4. Looks like the debugger is casting down from double to float somewhere.
You cant avoid this as you're using floating point numbers with fixed quantity of bytes. There's simply no isomorphism possible between real numbers and its limited notation.
But most of the time you can simply ignore it. 21.4==21.4 would still be true because it is still the same numbers with the same error. But 21.4f==21.4 may not be true because the error for float and double are different.
If you need fixed precision, perhaps you should try fixed point numbers. Or even integers. I for example often use int(1000*x) for passing to debug pager.
Dangers of computer arithmetic
If it bothers you, you can customize the way some values are displayed during debug. Use it with care :-)
Enhancing Debugging with the Debugger Display Attributes
Refer to General Decimal Arithmetic
Also take note when comparing floats, see this answer for more information.
According to the javadoc
"If at least one of the operands to a numerical operator is of type double, then the
operation is carried out using 64-bit floating-point arithmetic, and the result of the
numerical operator is a value of type double. If the other operand is not a double, it is
first widened (§5.1.5) to type double by numeric promotion (§5.6)."
Here is the Source
I have something similar to a spreadsheet column in mind. A spreadsheet column has transparent data typing: text or any kinds of numbers.
But no matter how the typing is implemented internally, they allow roundoff-safe operations; eg adding up a column of hundreds of numbers with decimal points, and other arithmetic operations. And they do it efficiently too.
What way of handling numbers can make them:
transparent to the user
round-off safe
support efficient arithmetic, aggregation, sorting
handled by datastores and applications with Java primitive types?
I have in mind, using a 64b long datatype that is internally multiplied by 1000 to provide 3 decimal places. For example 123.456 is internally stored as 123456, `1 is stored as 1000. Reinventing floating point numbers seems clunky; I have to reinvent multiplication, for example.
Miscellany: I actually have in mind a document tagging system. A number tag is conceptually similar to a spreadsheet column that is used to store numbers.
I do want to know how spreadsheets handle it, and I would have titled the question as such.
I am using two datastores that uses Java primitive types. Point #4 wasnt hypothetical.
Unless you really need to use primatives, BigDecimal should handle that for you.
Excel uses double precision floats internally, then rounds the display portion in each cell according to the formatting options. It uses the double values for any calculations (unless the Precision as Displayed option is enabled - in which case it uses the rounded displayed value) and then rounds the result when displayed.
You could certainly use a long normalized to the max number of decimals you want to support - but then you're stuck with fixed-precision. That may or may not be acceptable. If you can use BigDecimal, that could work - but I don't think that qualifies as a Java primitive type.
System.out.println((26.55f/3f));
or
System.out.println((float)( (float)26.55 / (float)3.0 ));
etc.
returns the result 8.849999. not 8.85 as it should.
Can anyone explain this or should we all avoid using floats?
What Every Programmer Should Know About Floating-Point Arithmetic:
Q: Why don’t my numbers, like 0.1 + 0.2
add up to a nice round 0.3, and
instead I get a weird result like
0.30000000000000004?
A: Because internally, computers use a
format (binary floating-point) that
cannot accurately represent a number
like 0.1, 0.2 or 0.3 at all.
In-depth explanations at the linked-to site
Take a look at Wikipedia's article on Floating Point, specifically the Accuracy Problems section.
The fact that floating-point numbers
cannot precisely represent all real
numbers, and that floating-point
operations cannot precisely represent
true arithmetic operations, leads to
many surprising situations. This is
related to the finite precision with
which computers generally represent
numbers.
The article features a couple examples that should provide more clarity.
Explaining is easy: floating point is a binary format and so can only represent exactly values that are an integer multiple of 1.0 / (2 to the Nth power) for some natural integer N. 26.55 does not have this property, therefore it cannot be represented exactly.
If you need exact representation (e.g. your code is about accounting and money, where every fraction of a cent matters), then you must indeed avoid floats in favor of other types that do guarantee exact representation of the values you need (depending on your application, for example, just doing all accounting in terms of integer numbers of cents might suffice). Floats (when used appropriately and advisedly!-) are perfectly fine for engineering and scientific computations, where the input values are never "infinitely precise" in any case and therefore the computationally cumbersome burden of exact representation is absolutely not worth carrying.
Well, we should all avoid using floats wherever realistic, but that's a story for another day.
The issue is that floating point numbers cannot exactly represent most numbers we think of as trivial in presentation. 8.850000 probably cannot be represented exactly by a float; and possibly not by a double either. This is because they aren't actually decimal numbers; but a binary representation.