So I'm somewhat new to programming and I'm been curious about the many data types in Java. So to start, I've been focusing mostly on the ones to do with numbers.
Specifically, I've been looking at int, long. I've noticed that longs can have a much larger range of values than integers. Because of this, I've just been wondering why we don't just use longs all the time, instead of using integers most often.
Yes, more bits take up more memory... also some data types are faster for computers to do math on than others (eg: integer math is faster than floating point math)
Yes, that's essentially it; there are several flavors of "integer" type a and several flavors of "decimal" (floating-point) type. It's hard for people just starting out in programming to believe it, with hardware being so cheap, but there was a time that the difference between these types was the difference between fitting in the computer's memory or not. Nowadays the only time you still have those sorts of constraints would be enterprise-level systems or something that's still trying like an embedded system or minimal computer (though even the Raspberry Pi is outgrowing this classification).
Still, is good practice to limit yourself to the smallest reasonable variant of the data type your using. Memory is still an issue at the scale of, say, an older mobile device, or one running lots of apps. long is super-crazy-long for most common contexts, and the extra room is just wasted resources if you're not going to be dealing with numbers at that scale.
Related
ArrayList(int initialCapacity)
and other collections in java work on int index.
Can't there be cases where int is not enough and there might be need for more than range of int?
UPDATE:
Java 10 or some other version would have to develop new Collection framework for this. As using long with present Collections would break the backward compatibility. Isn't it?
There can be in theory, but at present such large arrays (arrays with indexes outside the range of an integer) aren't supported by the JVM, and thus ArrayList doesn't support this either.
Is there a need for it? This isn't part of the question per-se, but seems to come up a lot so I'll address it anyway. The short answer is in most situations, no, but in certain ones, yes. The upper value of an int in Java is 2,147,483,647, a tad over 2 billion. If this were an array of bytes we were talking about, that puts the upper limit at slightly over 2GB in terms of the amount of bytes we can store in an array. Back when Java was conceived and it wasn't unusal for a typical machine to have a thousand times less memory than that, it clearly wasn't too much of an issue - but now even a low end (desktop/laptop) machine has more memory than that, let alone a big server, so clearly it's no longer a limitation that no-one can ever reach. (Yes, we could pack the bytes into a wrapper object and make an array of those, but that's not the point we're addressing here.) If we switch to the long data type, then that pushes the upper limit of a byte array to well over 9.2 Exabytes (over 9 billion GB.) That puts us firmly back into "we don't need to sensibly worry about that limit" territory for at least the foreseeable future.
So, is Java making this change? One of the plans for Java 10 is due to tackle "big data" which may well include support for arrays with long based indexes. Obviously this is a long way off, but Oracle is at least thinking about it:
On the table for JDK 9 is a move to make the Java Virtual Machine (JVM) hypervisor-aware as well as to improve its performance, while JDK 10 could move from 32-bit to 64-bit addressable arrays for larger data sets.
You could theoretically work around this limitation by using your own collection classes which used multiple arrays to store their data, thus bypassing the implicit limit of an int - so it is somewhat possible if you really need this functionality now, just rather messy at present.
In terms of backwards compatibility if this feature comes in? Well you obviously couldn't just change all the ints to longs, there would need to be some more boilerplate there and, depending on implementation choices, perhaps even new collection types for these large collections (considering I doubt they'd find their way into most Java code, this may well be the best option.) Regardless, the point is that while backwards compatibility is of course a concern, there are a number of potential ways around this so it's not a show stopper by any stretch of the imagination.
In fact you are right, Collections such as Array lists supports only int values for the moment, but if you like to bypass this constraint, you may use Maps and Sets, where the Key can be anything you want, and thus, you can have as many entries as you like. But i personally think that int values are enough for structures like arrays, but if i'd like to get more, i think i would use a Derby table, a database becomes more useful in such cases .
Currently, I am serializing some long data using DataOutput.writeLong(long). The issue with this is obvious: there are many many cases where the longs will be quite small. I was wondering what the most performant varint implementation is? I've seen the strategy from protocol buffers, and testing on Random long data (which probably isn't the right distribution to test against), I'm seeing a pretty big performance drop (about 3-4x slower). Is this to be expected? Are there any good strategies for serializing longs as quickly as possible while still saving space?
Thanks for your help!
How about using the standard DataOutput format for serializing and using a generic compression algorithm such as GZIPOutputStream for compression?
The protocol buffer encoding is actually pretty good but isn't helpful with random longs - it is mostly useful if your longs are probably going to be small positive or negative numbers (let's say in the +/- 1000 range 95% of the time).
Numbers in this range will typically get encoded in 1, 2 or 3 bytes compared with 8 for a normal long. Try it with this sort of input on a large set of longs, you can often often get 50-70% space savings.
Of course calculating this encoding has some performance overhead, but if you are using this for serialisation then CPU time will not be your bottleneck anyway - so you can effectively ignore the encoding cost.
I'm working on a project ( in Scala ), where I have a need to manipulate some very large numbers; far too big to be represented by the integral types. Java provides the BigInteger and BigDecimal classes (and scala provides a nice thin wrapper around them). However, I noticed that these libraries are substantially slower than other arbitrary precision libraries that I've used in the past (i.e. http://www.ginac.de/CLN/), and the speed difference seems larger than what can be attributed to the language alone.
I did some profiling of my program, and 44% of the execution time is being spent in the BigInteger multiply method. I'd like to speed up my program a bit, so I'm looking for a faster and more efficient option than the BigInteger class (and its Scala wrapper). I've looked at LargeInteger ( from JScience) and Aint (from Afloat). However, both seem to perform more slowly than the standard BigInteger class.
Does anyone know of a Java (or available on the JVM) arbitrary precision math library with a focus on high performance integer multiplication and addition?
I'm a bit late...well I only know the apfloat library, available in both C++ and Java.
Apfloat-Library:
Unfortunately, I think you are out of luck for a Java native library. I have not found one. I recommend wrapping GMP, which has excellent arbitrary precision performance, using JNI. There is JNI overhead, but if you're in the 1500 digit range, that should be small compared to the difference in algorithmic complexity. You can find various wrappings of GMP for Java (I believe the most popular one is here).
My Question is regarding the performance of Java versus compiled code, for example, C++/fortran/assembly in high-performance numerical applications.
I know this is a contentious topic, but I am looking for specific answers/examples. Also community wiki. I have asked similar questions before, but I think I put it broadly and did not get answers I was looking for.
double precision matrix-matrix multiplication, commonly known as dgemm in blas library, is able to achieve nearly 100 percent peak CPU performance (in terms of floating operations per second).
There are several factors which allow achieving that performance:
cache blocking, to achieve maximum memory locality
loop unrolling to minimize control overhead
vector instructions, such as SSE
memory prefetching
guarantee no memory aliasing
I have have seen lots of benchmarks using assembly, C++, Fortran, Atlas, vendor BLAS (typical cases are a matrix of dimension 512 and above).
On the other hand, I have heard that the principle byte compiled languages/implementations such as Java can be fast or nearly as fast as machine compiled languages. However, I have not seen definite benchmarks showing that it is so. On the contrary, it seems (from my own research) byte compiled languages are much slower.
Do you have good matrix-matrix multiplication benchmarks for Java/C #?
does just-in-time compiler (actual implementation, not hypothetical) able to produce instructions which satisfy points I have listed?
Thanks
with regards to performance:
every CPU has peak performance, depending on the number of instructions processor can execute per second. For example, modern 2 GHz Intel CPU can achieve 8 billion double precision add/multiply a second, resulting in 8 gflops peak performance. Matrix-matrix multiply is one of the algorithms which is able to achieve nearly full performance with regards number of operations per second, the main reason being a higher ratio of computing over memory operations (N^3/N^2). Numbers I am interested in a something on the order N > 500.
with regards to implementation: higher-level details such as blocking is done at the source code level. Lower-level optimization is handled by the compiler, perhaps with compiler hints with regards to alignment/alias. Byte compiled implementation can be written using block approach as well, so in principle source code details for decent implementation will be very similar.
A comparison of VC++/.NET 3.5/Mono 2.2 in a pure matrix multiplication scenario:
Source
Mono with Mono.Simd goes a long way towards closing the performance gap with the hand-optimized C++ here, but the C++ version is still clearly the fastest. But Mono is at 2.6 now and might be closer and I would expect that if .NET ever gets something like Mono.Simd, it could be very competitive as there's not much difference between .NET and the sequential C++ here.
All factors your specify is probably done by manual memory/code optimization for your specific task. But JIT compiler haven't enough information about your domain to make code optimal as you make it by hand, and can apply only general optimization rules. As a result it will be slower that C/C++ matrix manipulation code (but can utilize 100% of CPU, if you want it :)
Addressing the SSE issue: Java is using SSE instructions since J2SE 1.4.2.
in a pure math scenario (calculating 25 types or algebraic surfaces 3d coords) c++ beats java in a 2.5 ratio
Java cannot compete to C in matrix multiplications, one reason is that it checks on each array access whether the array bounds are exceeded. Further Java's math is slow, it does not use the processor's sin(), cos().
Currently, primary keys in our system are 10 digits longs, just over the limit for Java Integers. I want to avoid any maintenance problems down the road caused by numeric overflow in these keys, but at the same time I do not want to sacrifice much system performance to store infinitely large numbers that I will never need.
How do you handle managing the size of a primary key? Am I better off sticking with Java integers, for the performance benefit over the larger Long, and increasing the size when needed, or should I bite the bullet, go with Java Long for most of my PKs, and never have to worry about overflowing the sequence size?
I've always gone with long keys (number(18,0) in database) because they simply remove the possibility of this situation happening in pretty much all situations (extreme data hoarding style applications aside). Having the same data-type across all tables for the key means you can share that field across all of your model objects in a parent class, as well as having consistent code your your SQL getters, and so on.
It seems like the answer depends on how likely you are to overflow the Java integers with your data. And there's no way to know that without some idea of what your data is.
The performance benefit would be negligible, so my advice would be to go with the long keys. Having to deal with that down the road would likely be a major hassle.
It's a balance between the cost of storing and using Long integers, versus the likelihood of overflowing a 32-bit integer.
Consider that an unsigned 32-bit integer stores over 4 billion values. If you think you are going to average more than 1 new row every second in this table for the next 136 years, then you need to use a Long.
32 bit integers in java are signed integers, so only 2 billion. If for some reason, your SEQUENCE keeps jumping now and then, then there will be some gaps between your PKs.
It does NOT hurt to have a long (Remember that the Y2K problem happened because some COBOL developers thought that they will save some bytes in dates ??) :-)
Therefore, I always use Long.