ArrayList(int initialCapacity)
and other collections in java work on int index.
Can't there be cases where int is not enough and there might be need for more than range of int?
UPDATE:
Java 10 or some other version would have to develop new Collection framework for this. As using long with present Collections would break the backward compatibility. Isn't it?
There can be in theory, but at present such large arrays (arrays with indexes outside the range of an integer) aren't supported by the JVM, and thus ArrayList doesn't support this either.
Is there a need for it? This isn't part of the question per-se, but seems to come up a lot so I'll address it anyway. The short answer is in most situations, no, but in certain ones, yes. The upper value of an int in Java is 2,147,483,647, a tad over 2 billion. If this were an array of bytes we were talking about, that puts the upper limit at slightly over 2GB in terms of the amount of bytes we can store in an array. Back when Java was conceived and it wasn't unusal for a typical machine to have a thousand times less memory than that, it clearly wasn't too much of an issue - but now even a low end (desktop/laptop) machine has more memory than that, let alone a big server, so clearly it's no longer a limitation that no-one can ever reach. (Yes, we could pack the bytes into a wrapper object and make an array of those, but that's not the point we're addressing here.) If we switch to the long data type, then that pushes the upper limit of a byte array to well over 9.2 Exabytes (over 9 billion GB.) That puts us firmly back into "we don't need to sensibly worry about that limit" territory for at least the foreseeable future.
So, is Java making this change? One of the plans for Java 10 is due to tackle "big data" which may well include support for arrays with long based indexes. Obviously this is a long way off, but Oracle is at least thinking about it:
On the table for JDK 9 is a move to make the Java Virtual Machine (JVM) hypervisor-aware as well as to improve its performance, while JDK 10 could move from 32-bit to 64-bit addressable arrays for larger data sets.
You could theoretically work around this limitation by using your own collection classes which used multiple arrays to store their data, thus bypassing the implicit limit of an int - so it is somewhat possible if you really need this functionality now, just rather messy at present.
In terms of backwards compatibility if this feature comes in? Well you obviously couldn't just change all the ints to longs, there would need to be some more boilerplate there and, depending on implementation choices, perhaps even new collection types for these large collections (considering I doubt they'd find their way into most Java code, this may well be the best option.) Regardless, the point is that while backwards compatibility is of course a concern, there are a number of potential ways around this so it's not a show stopper by any stretch of the imagination.
In fact you are right, Collections such as Array lists supports only int values for the moment, but if you like to bypass this constraint, you may use Maps and Sets, where the Key can be anything you want, and thus, you can have as many entries as you like. But i personally think that int values are enough for structures like arrays, but if i'd like to get more, i think i would use a Derby table, a database becomes more useful in such cases .
Related
Let's assume I want to store (integer) x/y-values, what is considered more efficient: Storing it in a primitive value like long (which fits perfect, due sizeof(long) = 2*sizeof(int)) using bit-operations like shift, or and a mask, or creating a Point-Class?
Keep in mind that I want to create and store many(!) of these points (in a loop). Would be there a perfomance issue when using classes? The only reason I would prefer storing in primtives over storing in class is the garbage-collector. I guess generating new objects in a loop would trigger the gc way too much, is it correct?
Of course packing those as long[] is going to take less memory (though it is going to be contiguous). For each Object (a Point) you will pay at least 12 bytes more for the two headers.
On other hand, if you are creating them in a loop and thus escape analysis can prove they don't escape, it can apply an optimization called "scalar replacement" (thought is it very fragile); where your Objects will not be allocated at all. Instead those objects will be "desugared" to fields.
The general rule is that you should code the way it is the most easy to maintain and read that code. If and only if you see performance issues (via a profiler let's say or too many pauses), only then you should look at GC logs and potentially optimize code.
As an addendum, jdk code itself is full of such long where each bit means different things - so they do pack them. But then, me and I doubt you, are jdk developers. There such things matter, for us - I have serious doubts.
I'm programming something in Java, for context see this question: Markov Model descision process in Java
I have two options:
byte[MAX][4] mypatterns;
or
ArrayList mypatterns
I can use a Java ArrayList and append a new arrays whenever I create them, or use a static array by calculating all possible data combinations, then looping through to see which indexes are 'on or off'.
Essentially, I'm wondering if I should allocate a large block that may contain uninitialized values, or use the dynamic array.
I'm running in fps, so looping through 200 elements every frame could be very slow, especially because I will have multiple instances of this loop.
Based on theory and what I have heard, dynamic arrays are very inefficient
My question is: Would looping through an array of say, 200 elements be faster than appending an object to a dynamic array?
Edit>>>
More information:
I will know the maxlength of the array, if it is static.
The items in the array will frequently change, but their sizes are constant, therefore I can easily change them.
Allocating it statically will be the likeness of a memory pool
Other instances may have more or less of the data initialized than others
You right really, I should use a profiler first, but I'm also just curious about the question 'in theory'.
The "theory" is too complicated. There are too many alternatives (different ways to implement this) to analyse. On top of that, the actual performance for each alternative will depend on the the hardware, JIT compiler, the dimensions of the data structure, and the access and update patterns in your (real) application on (real) inputs.
And the chances are that it really doesn't matter.
In short, nobody can give you an answer that is well founded in theory. The best we can give is recommendations that are based on intuition about performance, and / or based on software engineering common sense:
simpler code is easier to write and to maintain,
a compiler is a more consistent1 optimizer than a human being,
time spent on optimizing code that doesn't need to be optimized is wasted time.
1 - Certainly over a large code-base. Given enough time and patience, human can do a better job for some problems, but that is not sustainable over a large code-base and it doesn't take account of the facts that 1) compilers are always being improved, 2) optimal code can depend on things that a human cannot take into account, and 3) a compiler doesn't get tired and make mistakes.
The fastest way to iterate over bytes is as a single arrays. A faster way to process these are as int or long types as process 4-8 bytes at a time is faster than process one byte at a time, however it rather depends on what you are doing. Note: a byte[4] is actually 24 bytes on a 64-bit JVM which means you are not making efficient use of your CPU cache. If you don't know the exact size you need you might be better off creating a buffer larger than you need even if you are not using all the buffer. i.e. in the case of the byte[][] you are using 6x time the memory you really need already.
Any performance difference will not be visible, when you set initialCapacity on ArrayList. You say that your collection's size can never change, but what if this logic changes?
Using ArrayList you get access to a lot of methods such as contains.
As other people have said already, use ArrayList unless performance benchmarks say it is a bottle neck.
If I have to store 3 integer values and would like to just retrieve the same , no calculation is required.Which one of the following would be a better option?
int i,j,k;
or
int [] arr = new int[3];
Array would be allocating 3 continuous blocks of memory (after allocation of space by JVM) or randomly assigning variables to some memory location (which I guess would consume lesser time for JVM as compared to array).
Apologies if the question is too trivial.
The answer is: It depends.
You shouldn't think too much about the performance implications for this case. the performance difference between the two is not big enough to notice.
What you really need to be on the look out for is readability and maintainability.
if i, j, and k, all essentially mean the same thing, and you're going to be using them the same way, and you feel like you might want to iterate over them, then it might make sense to use an array, so that you can iterate over them more easily.
if they're different values, with different meanings, and you're going to be using them differently, than it does not makes sense to include them in an array. They should each have their own identity, and their own descriptive variable name.
Choose whichever makes most sense semantically:
If these variables are three for a fundamental reason (maybe they are coordinates in the 3D space of a 3D game engine), then use three separate variables (because making, say, a 4D game engine is not a trivial change).
If these variables are three now but they could be trivially changed to be four tomorrow, it's reasonable to consider an array (or, better yet, a new type that contains them).
In terms of performance, traditionally local variables are faster than arrays. Under specific circumstances, the array may be allocated on the stack. Under specific circumstances, bound checks can be removed.
But don't make decisions based on performance, unless you have first done everything else correctly first and you have thorough tests and this particular piece of code is a performance-critical hot-spot and you're sure that it is the bottleneck of your application at the moment.
It depends on how would you access them. Array is of course an overhead, because you will first calculate a reference to a value and then get it. So if these values are totally unrelated, array is bad, and it may even count as code obfuscation. But naming variables like i, j, k is sort of obfuscation, too. Obfuscation is better to do automatically at build stage, there are tools like Proguard™ which can do it.
The two are not the same at all and are for different purpose.
in the first example you gave int i,j,k; you are pushing the values on to the stack,
The stack is for short term use and small data sizes i.e. function call arguments and iterator states.
The second example you gave int [] arr = new int[3]; the new keyword is allocating actual memory for the heap hat was giving to the process by the operating system.
The stack is optimized for short term use and all (most) all CPUs have a registers that are dedicated to point at the stack location and base making the stack a grate place for small dirty variables. The stack is also limited in size (by theory), its only a few KB in size (average case).
The heap on he other hand is proper memory allocation for large data types and proper memory management.
So, the two may be used for the same thing but it dose not mean it's right.
Arrays/Objects/Dicts go in allocated memory from he heap, function arguments (and iterator indexes usually) go on the stack.
It depends, but most probably, using distinct variables is the way to go.
In general, don't do micro-optimizations. Nobody will ever notice any difference in performance. Readable and maintainable code is what really matters in high-level languages.
See this article on micro-optimizations.
I work on a small company where I work to build some banking software. Now, I have to build some data structure like:
Array [Int-Max] [2] // Large 2D array
Save that to disk and load it next day for future work.
Now, as I only know Java (and little bit C), they always insist me to use C++ or C. As per their suggestion:
They have seen Array [Int-Max] [2] in Java will take nearly 1.5 times more memory than C and C++ takes some what reasonable memory footprint than Java.
C and C++ can handle arbitrarily large files where as Java can't.
As per their suggestion, as database/data-structure become large Java just becomes infeasible. As we have to work on such large database/data-structure, C/C++ is always preferable.
Now my question is,
Why is C or C++ always preferable on large database/data-structure over Java ? Because, C may be, but C++ is also an OOP. So, how it get advantage over Java ?
Should I stay on Java or their suggestion (switch to C++) will be helpful in future on large database/data-structure environment ? Any suggestion ?
Sorry, I have very few knowledge of all those and just started to work on a project, so really confused. Because until now I have just build some school project, have no idea about relatively large project.
why C/C++ is always preferable on large database/data-structure over
Java ? Because, C may be, but C++ is also an OOP. So, how it get
advantage over Java ?
Remember that a java array (of objects)1 is actually an array of references. For simplicity let's look at a 1D array:
java:
[ref1,ref2,ref3,...,refN]
ref1 -> object1
ref2 -> object2
...
refN -> objectN
c++:
[object1,object2,...,objectN]
The overhead of references is not needed in the array when using the C++ version, the array holds the objects themselves - and not only their references. If the objects are small - this overhead might indeed be significant.
Also, as I already stated in comments - there is another issue when allocating small objects in C++ in arrays vs java. In C++, you allocate an array of objects - and they are contiguous in the memory, while in java - the objects themselves aren't. In some cases, it might cause the C++ to have much better performance, because it is much more cache efficient then the java program. I once addressed this issue in this thread
2) Should I stay on Java or their suggestion (switch to C++) will be
helpful in future on large database/data-structure environment ? Any
suggestion ?
I don't believe we can answer it for you. You should be aware of all pros and cons (memory efficiency, libraries you can use, development time, ...) of each for your purpose and make a decision. Don't be afraid to get advises from seniors developers in your company who have more information about the system then we are.
If there was a simple easy and generic answer to this questions - we engineers were not needed, wouldn't we?
You can also profile your code with the expected array size and a stub algorithm before implementing the core and profile it to see what the real difference is expected to be. (Assuming the array is indeed the expected main space consumer)
1: The overhead I am describing next is not relevant for arrays of primitives. In these cases (primitives) the arrays are arrays of values, and not of references, same as C++, with minor overhead for the array itself (length field, for example).
It sounds like you are in inexperienced programmer in a new job. The chances are that "they" have been in the business a long time, and know (or at least think they know) more about the domain and its programming requirements than you do.
My advice is to just do what they insist that you do. If they want the code in C or C++, just write it in C or C++. If you think you are going to have difficulties because you don't know much C / C++ ... warn them up front. If they still insist, they can wear the responsibility for any problems and delays their insistence causes. Just make sure that you do your best ... and try not to be a "squeaky wheel".
1) They have seen Array [Int-Max] [Int-Max] in Java will take nearly 1.5 times more memory than C and C++ takes some what reasonable memory footprint than Java.
That is feasible, though it depends on what is in the arrays.
Java can represent large arrays of most primitive types using close to optimal amounts of memory.
On the other hand, arrays of objects in Java can take considerably more space than in C / C++. In C++ for example, you would typically allocate a large array using new Foo[largeNumber] so that all of the Foo instances are part of the array instance. In Java, new Foo[largeNumber] is actually equivalent to new Foo*[largeNumber]; i.e. an array of pointers, where each pointer typically refers to a different object / heap node. It is easy to see how this can take a lot more space.
2) C/C++ can handle arbitrarily large file where as Java can't.
There is a hard limit to the number of elements in a single 1-D Java array ... 2^31. (You can work around this limit, but it will make your code more complicated.)
On the other hand if you are talking about simply reading and writing files, Java can handle individual files up to 2^63 bytes ... which is more than you could possibly ever want.
1) why C/C++ is always preferable on large database/data-structure over Java ? Because, C may be, but C++ is also an OOP. So, how it get advantage over Java ?
Because of the hard limit. The limit is part of the JLS and the JVM specification. It is nothing to do with OOP per se.
2) Should I stay on Java or their suggestion (switch to C++) will be helpful in future on large database/data-structure environment ? Any suggestion ?
Go with their suggestion. If you are dealing with in-memory datasets that are that large, then their concerns are valid. And even if their concerns are (hypothetically) a bit overblown it is not a good thing to be battling your superiors / seniors ...
1) They have seen Array [Int-Max] [Int-Max] in Java will take nearly 1.5 times more memory than C and C++ takes some what reasonable memory footprint than Java.
That depends on the situation. If you create an new int[1] or new int[1000] there is almost no difference in Java or C++. If you allocate data on the stack, it has a high relative difference as Java doesn't use the stack for such data.
I would first ensure this is not micro-tuning the application. Its worth remembering that one day of your time is worth (assuming you get minimum wage) is about 2.5 GB. So unless you are saving 2.5 GB per day by doing this, suspect its not worth chasing.
2) C/C++ can handle arbitrarily large file where as Java can't.
I have memory mapped a 8 TB file in a pure Java program, so I have no idea what this is about.
There is a limit where you cannot map more than 2 GB or have more than 2 billion elements in an array. You can work around this by having more than one (e.g. up to 2 billion of those)
As we have to work on such large database/data-structure, C/C++ is always preferable.
I regularly load 200 - 800 GB of data with over 5 billion entries into a single Java process (sometime more than one at a time on the same machine)
1) why C/C++ is always preferable on large database/data-structure over Java ?
There is more experience on how to do this in C/C++ than there is in Java, and their experience of how to do this is only in C/C++.
Because, C may be, but C++ is also an OOP. So, how it get advantage over Java ?
When using large datasets, its more common to use a separate database in the Java world (embedded databases are relatively rare)
Java just calls the same system calls you can in C, so there is no real difference in terms of what you can do.
2) Should I stay on Java or their suggestion (switch to C++) will be helpful in future on large database/data-structure environment ? Any suggestion ?
At the end of the day, they pay you and sometimes technical arguments are not really what matters. ;)
Hmmm. Is there a primer anywhere on memory usage in Java? I would have thought Sun or IBM would have had a good article on the subject but I can't find anything that looks really solid. I'm interested in knowing two things:
at runtime, figuring out how much memory the classes in my package are using at a given time
at design time, estimating general memory overhead requirements for various things like:
how much memory overhead is required for an empty object (in addition to the space required by its fields)
how much memory overhead is required when creating closures
how much memory overhead is required for collections like ArrayList
I may have hundreds of thousands of objects created and I want to be a "good neighbor" to not be overly wasteful of RAM. I mean I don't really care whether I'm using 10% more memory than the "optimal case" (whatever that is), but if I'm implementing something that uses 5x as much memory as I could if I made a simple change, I'd want to use less memory (or be able to create more objects for a fixed amount of memory available).
I found a few articles (Java Specialists' Newsletter and something from Javaworld) and one of the builtin classes java.lang.instrument.getObjectSize() which claims to measure an "approximation" (??) of memory use, but these all seem kind of vague...
(and yes I realize that a JVM running on two different OS's may be likely to use different amounts of memory for different objects)
I used JProfiler a number of years ago and it did a good job, and you could break down memory usage to a fairly granular level.
As of Java 5, on Hotspot and other VMs that support it, you can use the Instrumentation interface to ask the VM the memory usage of a given object. It's fiddly but you can do it.
In case you want to try this method, I've added a page to my web site on querying the memory size of a Java object using the Instrumentation framework.
As a rough guide in Hotspot on 32 bit machines:
objects use 8 bytes for
"housekeeping"
fields use what you'd expect them to
use given their bit length (though booleans tend to be allocated an entire byte)
object references use 4 bytes
overall obejct size has a
granularity of 8 bytes (i.e. if you
have an object with 1 boolean field
it will use 16 bytes; if you have an
object with 8 booleans it will also
use 16 bytes)
There's nothing special about collections in terms of how the VM treats them. Their memory usage is the total of their internal fields plus -- if you're counting this -- the usage of each object they contain. You need to factor in things like the default array size of an ArrayList, and the fact that that size increases by 1.5 whenever the list gets full. But either asking the VM or using the above metrics, looking at the source code to the collections and "working it through" will essentially get you to the answer.
If by "closure" you mean something like a Runnable or Callable, well again it's just a boring old object like any other. (N.B. They aren't really closures!!)
You can use JMP, but it's only caught up to Java 1.5.
I've used the profiler that comes with newer versions of Netbeans a couple of times and it works very well, supplying you with a ton of information about memory usage and runtime of your programs. Definitely a good place to start.
If you are using a pre 1.5 VM - You can get the approx size of objects by using serialization. Be warned though.. this can require double the amount of memory for that object.
See if PerfAnal will give you what you are looking for.
This might be not the exact answer you are looking for, but the bosts of the following link will give you very good pointers. Other Question about Memory
I believe the profiler included in Netbeans can moniter memory usage also, you can try that