After becoming more engaged with training new engineers as well as reading Jon Skeet's DevDays presentation I have begun to recognize many engineers aren't clear when to use which numeric datatypes when. I appreciate the role a formal computer science degree plays in helping with this, but I see a lot of new engineers showing uncertainty because they have never worked with large data sets, or financial software, or programming phyiscs or statistics problems, or complex datastore issues.
My experience is that people really grok concepts when they are explained within context. I am looking for good examples of real programming problems where certain data is best represented using data type. Try to stay away from the textbook examples if possible. I am tagging this with Java, but feel free to give examples in other languages and retag:
Integer, Long, Double, Float, BigInteger, etc...
I really don't think you need examples or anything complex. This is simple:
Is it a whole number?
Can it be > 2^63? BigInteger
Can it be > 2^31? long
Otherwise int
Is it a decimal number?
Is an approximate value ok?
double
Does it need to be exact? (example: monetary amounts!)
BigDecimal
(When I say ">", I mean "greater in absolute value", of course.)
I've never used a byte or char to represent a number, and I've never used a short, period. That's in 12 years of Java programming. Float? Meh. If you have a huge array and you are having memory problems, I guess.
Note that BigDecimal is somewhat misnamed; your values do not have to be large at all to need it.
BigDecimal is the best when it comes to maintaining accurate floating point calculations, and being able to specify the desired accuracy. I believe float (and to some extent double) offer performance benefits over BigDecimal, but at the cost of accuracy and usability.
One important point you might want to articulate is that it's almost always an error to compare floating-point numbers for equality. For example, the following code is very likely to fail:
double euros = convertToEuros(item.getCostInDollars());
if (euros == 10.0) {
// this line will most likely never be reached
}
This is one of many reasons why you want to use discrete numbers to represent currency.
When you absolutely must compare floating-point numbers, you can only do so approximately; something to the extent of:
double euros = convertToEuros(item.getCostInDollars());
if (Math.abs(euros - 10.0) < EPSILON) {
// this might work
}
As for practical examples, my usual rule of thumb is something like this:
double: think long and hard before using it; is the pain worth it ?
float: don't use it
byte: most often used as byte[] to represent some raw binary data
int: this is your best friend; use it to represent most stuff
long: use this for timestamps and database IDs
BigDecimal and BigInteger: if you know about these, chances are you know what you're doing already, so you don't need my advice
I realize that these aren't terribly scientific rules of thumb, but if your target audience are not computer scientists, it might be best to stick to basics.
normally numeric if we're talking machine independenat (32/64bit) data type size are as below,
integer: 4 bytes
long: 8 bytes
decimal/float: 4bytes
double : 8bytes
and the sizes reduced to half for signed values (eg: for 4bytes, unsigned=4billions, signed=2billions)
bigInt (depends on language implementation) sometimes up to 10bytes.
for high volumes data archiving (such as search engine) i would highly recommended byte and short to save spaces.
byte: 1 byte, (0-256 unsigned, -128 - 128 signed)
short: 2 byte (65k unsigned)
let's say you want to save record about AGE, since nobody ever lives over 150, so you used data type BYTE (read above for size) but if you use INTEGER you already wasted extra 3bytes and seriously tell me wth live over 4billions yrs.
VInt's in Lucene are the devil. The small benefit in size is outweighed hugely by the performance penalty in reading them byte-by-byte.
A good thing to talk about is the space versus time trade off. Saving 200mb was great in 1996, but in 2010, thrashing IO buffers reading a byte at a time is terrible.
Related
The BigDecimal class is the standard way of dealing with monetary units in Java. However, when storing extremely large quantities of data (think millions ~ billions of entries per user) the extra storage space must be considered when compared to primitives such as int: according to this answer a single BigDecimal corresponds to about 36 + Ceiling(log2(n)/8.0) bytes (not including some metadata descriptors and such) whereas an int is most usually 4 bytes.
When storing millions of entries, this would of course result in a very noticeable increase in not only memory usage but also storage space (e.g. using MongoDB with a descriptor of the type, or the PostgreSQL numeric type which seems to correspond to at least 8 bytes, I am not familiar with e.g. Cassandra so I am not sure what the storage implications there would be).
An alternative to using the BigDecimal type would be be to store the integer amount of cents (or whatever the smallest denomination is chosen to be, i.e. $1 == 10000 hundredths of a cent, as required for precision). This would not only reduce the strain on the program but also reduce the storage required for all but the greatest values in the dataset (which would be outliers anyway and will likely need to be handled separately).
Is this a viable alternative? Are there any pitfalls that must be avoided in this case? Is this approach compliant with current standards (e.g. external audit)?
Note: This only pertains to storing the data, the data would still be displayed with proper formatting to the user depending on various factors (i.e. Locale, e.g. $31,383.22 for US).
On the database side DECIMAL poses no problem (maybe in the NOSQL databases though).
On the java side BigDecimal is no problem too, if you do not keep mass data in memory. Also mind, BigDecimal for normal numbers, in the int-range, is comparable with a String. Those are acceptable small objects, java can deal well with.
Cents is feasible, but one does not entirely escape BigDecimal. Financial calculations like taxes are in most countries required in some precision, like 6 decimals.
Also the standard java components do not provide a "virtual" decimal point.
From standard output/input, JSF, to JasperReports etcetera.
It should be mentioned that BigDecimal usage is verbose too.
So I would start with BigDecimal, to get a working system fast, and only on massive "spreadsheet" work revert to Cents.
I'm working on a real time application that deals with money in different currencies and exchange rates using BigDecimal, however I'm facing some serious performance issues and I want to change the underlying representation.
I've read again and again that a good and fast way of representing money in Java is by storing cents (or whatever the required precision is) using a long. As one of the comments pointed out, there are some libs with wrappers that do just that, such as FastMoney from JavaMoney.
Two questions here.
Is it always safe to store money as a long (or inside a wrapper) and keep everything else (like exchange rates) as doubles? In other words, won't I run into basically the same issues as having everything in doubles if I do Math.round(money * rate) (money being cents and rate being a double)?
FastMoney and many other libs only support operations between them and primitive types. How am I supposed to get an accurate representation of let's say the return of an investment if I can't do profit.divide(investment) (both being FastMoney). I guess the idea is I convert both to doubles and then divide them, but that would be inaccurate right?
The functionality you are looking for is already implemented in the JavaMoney Library.
It has a FastMoney class that does long arithmetic which is exactly what you have asked for.
For New Java Developers - Why long and not double?
Floating point arithmetic in Java leads to some unexpected errors in precision due to their implementation. Hence it is not recommended in financial calculations.
Also note that this is different from the precision loss in long arithmetic calculations which is due to the fractional portion not being stored in the long. This can be prevented during implementation by moving the fractional portion to another long (e.g. 1.05 -> 1 dollar and 5 cents).
References
A Quick Tutorial
Project Website
Just a day before I participated in the qualification round of Google Code Jam. This is my first experience of such an online coding contest. It was really fun.
There were three problems given of which i was able to solve two. But on one of the problems I was asked to work with values that are really huge. I am a Java guy and I thought I would go for double variable. Unfortunately, the precision of double also was not enough. Moreover, I attended this during the closing stage, I was not having the time to dig much into it (plus solving 1 is enough to qualify to the next stage).
My question is this, How to have a precision mechanism that is greater than double. My coding experience is in Java, so it would be great if you could please answer in that lines.
Thanks
Java has BigDecimall for arbitrary-precision arithmetic - but it's much, much slower than using double.
It's also possible that the problem in question was supposed to be soved by using algebraic transformations and e.g. work with logarithms.
If the problem requires integers you can use BigInteger.
Also, long is slightly better than double for integers, with 63 bits compared to 53 bits of precision (assuming positive numbers).
You can use arbitrary precision numbers, such as BigDecimal - it's slower but as precise as you specify.
Can you help me clarify the usages of the float primitive in Java?
My understanding is that converting a float value to double and vice-versa can be problematic. I read (rather long time ago and not sure that it's actual true anymore with new JVMs) that float's performance is much worse than double's. And of course floats have less precision than double.
I also remember that when I worked with AWT and Swing I had some problems with using float or double (like using Point2D.Float or Point2D.Double).
So, I see only 2 advantages of float over double:
It needs only 4 bytes while double needs 8 bytes
The Java Memory Model (JMM) guarantees that assignment operation is atomic with float variables while it's not atomic with double's.
Are there any other cases where float is better then double? Do you use float in your applications?
The reason for including the float type is to some extent historic: it represents a standard IEEE floating point representation from the days when shaving 4 bytes off the size of a floating point number in return for extremely poor precision was a tradeoff worth making.
Nowadays, uses for float are pretty limited. But, for example, having the data type can make it easier to write code that needs interoperability with older systems that do use float.
As far as performance is concerned, I think the float and double are essentially identical except for the performance of divisions. Generally, whichever you use, the processor converts to its internal format, does the calculation, then converts back again, and the actual calculation effectively takes a fixed time. In the case of divisions, on Intel processors at least, as I recall the time taken to do a division is generally one clock cycle per 2 bits of precision, so that whether you use float or double does make a difference.
Unless you really really have a strong reason for using it, in new code, I would generally avoid 'float'.
Those two reasons you just gave are huge.
If you have a 3D volume that's 1k by 1k by 64, and then have many timepoints of that data, and then want to make a movie of maximum intensity projections, the fact that float is half the size of double could be the difference between finishing quickly and thrashing because you ran out of memory.
Atomicity is also huge, from a threading standpoint.
There's always going to be a tradeoff between speed/performance and accuracy. If you have a number that's smaller than 2^31 and an integer, then an integer is always a better representation of that number than a float, just because of the precision loss. You'll have to evaluate your needs and use the appropriate types for your problems.
I think you nailed it when you mention storage, with floats being half the size.
Using floats may show improved performance over doubles for applications processing large arrays of floating point numbers such that memory bandwith is the limiting factor. By switching to float[] from double[] and halving the data size, you effectively double the throughput, because twice as many values can be fetched in a given time. Although the cpu has a little more work to do converting the float to a double, this happens in parallel with the memory fetch, with the fetch taking longer.
For some applications the loss of precision might be worth trading for the gain in performance. Then again... :-)
So yes, advantages of floats:
Only requires 4 bytes
Atomic assignment
Arithmetic should be faster, especially on 32bit architectures, since there are specific float byte codes.
Ways to mitigate these when using doubles:
Buy more RAM, it's really cheap.
Use volatile doubles if you need atomic assignment.
Do tests, verify the performance of each, if one really is faster there isn't a lot you can do about it.
Someone else mentioned that this is similar to the short vs int argument, but it is not. All integer types (including boolean) except long are stored as 4 byte integers in the Java memory model unless stored in an array.
It is true that doubles might in some cases have faster operations than floats. However, this requires that everything fits in the L1-cache. With floats you can have twice as much in a cache-line. This can make some programs run almost twice as fast.
SSE instructions can also work with 4 floats in parallel instead of 2, but I doubt that the JIT actually uses those. I might be wrong though.
Considering the basic data types like char, int, float, double etc..in any standard language C/C++, Java etc
Is there anything like.."operating on integers are faster than operating on characters".. by operating I mean assignment, arithmetic op/ comparison etc.
Are data types slower than one another?
For almost anything you're doing this has almost no effect, but purely for informational purposes, it is usually fastest to work with data types whose size is machine word size (i.e. 32 bits on x86 and 64-bits on amd64). Additionally, SSE/MMX instructions give you benefits as well if you can group these and work on them at the same time
Rules for this are a bit like rules for English spelling and/or grammar. The rules are broken at least as often as they're followed.
Just for example, for years "everybody has known" that floating point operations are slower than integers, especially for more complex operations like multiply and divide. In reality, some processors do some integer operations (especially multiplication and division) by converting the operands to floating point, doing the operation in floating point, then converting the result back to an integer. As you'd expect from that, the floating point operation is actually faster (though only a little bit).
Most of the time, however, it doesn't matter much -- in a lot of cases, it's quite reasonable to think of the operations on the processor itself as free, and concern yourself primarily with optimizing your use of bandwidth to memory. Of course, doing that well is often even harder...
yes , some data types are definitely slower than others. For example , floats are more complicated than int's and thus may incur additional penalties when doing divides and multiplies. It all depends on how your hardware is setup and what kind of instructions it supports.
Data types which is longer than the machine word size will also be slower because it takes more cycles to perform operations.
depending on what you do, the difference can be quite large, especially when working with floats versus double versus long double.
In modern processors it comes down to simd instructions, which have certain length, most commonly 128 bit. so four float versus two double numbers.
However some processors only have 32 bit simd instructions(PPC) and GPU hardware has a factor of eight performance difference between float and double.
when you add trigonometric , exponential, and square root functions into the mix, float numbers are going to have better performance overall given number of factors.
Almost all of the answers on this page are mostly right. The answer, however, varies wildly depending upon your hardware, language, compiler, and VM (in managed languages like Java). On most CPUs, your best performance will be to do the operations on a data type that fits the native operand size of your CPU. In some cases, some compilers will optimize this for you, however.
On most modern desktop CPUs the difference between floating point and integer operations has become pretty trivial. However, on older hardware and a lot of embedded systems the difference in all of these factors can still be really, really big.
The important thing is to know the specifics of your target architecture and your tools.
This answer relates to the Java case (only).
The literal answer is that the relative speed of the primitive types and operators depends on your processor hardware and your JVM implementation.
But a better answer is that it usually doesn't make a a lot of difference to performance what representations you use. Indeed, any clever data type optimizations you do to make your code run fast on your current machine / JVM may turn out to be anti-optimizations on a different machine / JVM combination.
In general, it is better to pick a data type that represents your data in a correct and natural way, and leave it to the compiler to sort out the details. However, if you are creating large arrays of a primitive type, it is worth knowing that Java uses compact representations for arrays of boolean, byte and short.