Efficient array expansion in Java

Efficient array expansion in Java - java

In C/C++ we have realloc which will efficiently allocate additional space for an existing collection. I guess it is sub linear (or even constant) in complexity.
Is there a way to achieve the same in Java? Here are the items that I looked at,
Array resize is not possible,
Copying an array to another of bigger size is linear in complexity. Looked at both System.arrayCopy as well as Arrays.copyOf
ArrayList must be same as point 2 above.
Note : My requirement is to expand possibly an extremely large array to even further.

realloc is likely to be O(n) in practice, since it sometimes/often involves memory copying. In that sense, it's equivalent in theoretical complexity to allocating a new array in Java.
Now Java always zeros the newly allocated memory which may take it a bit longer, but OTOH the GC has insanely fast memory allocations so it may even be faster than realloc in some cases. I'd expect a strategy that involves allocating new arrays in Java to be overall roughly comparable in speed to realloc. Possibly Java is better for smaller arrays, C/C++ would have the edge for big arrays, but YMMV. You'd have to benchmark in your particular implementation and workload to be sure.
So overall:
Don't worry about it, just reallocate new arrays in Java
If you do this a lot, be sure to recreate arrays with more space than you need so that you don't need to reallocate with each single element added (this is what the Java ArrayList does internally.
Final but important point: unless you are writing very low level code, you probably shouldn't be worrying about this anyway. Just use one of the fine collection classes that already exist (Java Collections, Google Collections, Trove etc.) and let them handle all of this stuff for you.

Related

Why Java ArrayLists do not shrink automatically

Long time ago I watched a video lecture from the Princeton Coursera MOOC: Introduction to algorithms, which can be found here. It explains the cost of resizing an ArrayList like structure while adding or removing the elements from it. It turns out that if we want to supply resizing to our data structure we will go from O(n) to amortized O(n) for add and remove operations.
I have been using Java ArrayList for a couple of years. I've been always sure that they grow and shrink automatically. Only recently, to my great surprise, I was proven wrong in this post. Java ArrayLists do not shrink (even though, of course they do grow) automatically.
Here are my questions:
In my opinion providing shrinking in ArrayLists does not make any harm as the performance is already amortized O(n). Why did Java creators did not include this feature into the design?
I know that other data structures like HashMaps also do not shrink automatically. Is there any other data structure in Java which is build on top of arrays that supports automatic shrinking?
What are the tendencies in other languages? How does automatic shrinking look like in case of lists, dictionaries, maps, sets in Python/C# etc. If they go in the opposite direction to what Java does, then my question is: why?

The comments already cover most of what you are asking. Here some thoughts on your questions:
When creating a structure like the ArrayList in Java, the developers make certain decisions regarding runtime / performance. They obviously decided to exclude shrinking from the “normal” operations to avoid the additional runtime, which is needed.
The question is why you would want to shrink automatically. The ArrayList does not grow that much (the factor is about 1.5; newCapacity = oldCapacity + (oldCapacity >> 1), to be exact). Maybe you also insert in the middle and not just append at the end. Then a LinkedList (which is not based on an array -> no shrinking needed) might be better. It really depends on your use case. If you think you really need everything an ArrayList does, but it has to shrink when removing elements (I doubt you really need this), just extend ArrayList and override the methods. But be careful! If you shrink at every removal, you are back at O(n).
The C# List and the C++ vector behave the same concerning shrinking a list at removal of elements. But the factors of automatic growing vary. Even some Java-implementations use different factors.

Another issue with automatic shrinking is that you could get into really horrible 'pathological' situations where each insert and delete to the list causes the growing or shrinking the backing array.
For example, if the backing array's initial capacity is 10, such that the array would grow upon the insertion of the 11th element (capacity would grow to 15), a natural implementation would be to shrink the backing array once the list size dropped below 11. If you had a list whose length kept changing between 10 and 11, you'd be constantly changing the backing array. Not only would that add a runtime overhead, but you could start putting lots of pressure on the garbage collector if every operation resulted in either 10 or 15 objects becoming garbage.

Although arraylist with shrinking is still amortized O(n) time complexity, it involves more operation.
By shrinking you just save some memory space by adding some calculation, which is not a wise decision because the Moore's law says computer space double every 2 years. So the time is more valuable in algorithm than space.

Minimizing overhead for large trees in java

I need to implement a large octree in Java. The tree will be very large so I will need to use a paging mechanism and split the tree into many small files for long-term storage.
My concern is that Java objects have a very high space overhead. If I were using C, I would be able to store the 8 reference pointers and any other data with only a byte or so of overhead to store the node type.
Is there any way that I can approach this level of overhead in Java?
I would be tempted to just use a single byte array per file. I could then use offsets in place of pointers (this is how I plan on storing the files). However, even when limiting the max file size, that would easily leave me with arrays too large to fit in contiguous memory, particularly if that memory becomes heavily fragmented. This would also lead to large time overheads for adding new nodes as the entire space would need to be reallocated. A ByteBuffer might resolve the first problem (I am not entirely sure of this), however it would not solve the second, as the size of a ByteBuffer is static.
For the moment I will just stick to using node objects. If anyone knows a more space efficient solution with a low time cost, please let me know.

Which is faster: Array list or looping through all data combinations?

I'm programming something in Java, for context see this question: Markov Model descision process in Java
I have two options:
byte[MAX][4] mypatterns;
or
ArrayList mypatterns
I can use a Java ArrayList and append a new arrays whenever I create them, or use a static array by calculating all possible data combinations, then looping through to see which indexes are 'on or off'.
Essentially, I'm wondering if I should allocate a large block that may contain uninitialized values, or use the dynamic array.
I'm running in fps, so looping through 200 elements every frame could be very slow, especially because I will have multiple instances of this loop.
Based on theory and what I have heard, dynamic arrays are very inefficient
My question is: Would looping through an array of say, 200 elements be faster than appending an object to a dynamic array?
Edit>>>
More information:
I will know the maxlength of the array, if it is static.
The items in the array will frequently change, but their sizes are constant, therefore I can easily change them.
Allocating it statically will be the likeness of a memory pool
Other instances may have more or less of the data initialized than others

You right really, I should use a profiler first, but I'm also just curious about the question 'in theory'.
The "theory" is too complicated. There are too many alternatives (different ways to implement this) to analyse. On top of that, the actual performance for each alternative will depend on the the hardware, JIT compiler, the dimensions of the data structure, and the access and update patterns in your (real) application on (real) inputs.
And the chances are that it really doesn't matter.
In short, nobody can give you an answer that is well founded in theory. The best we can give is recommendations that are based on intuition about performance, and / or based on software engineering common sense:
simpler code is easier to write and to maintain,
a compiler is a more consistent1 optimizer than a human being,
time spent on optimizing code that doesn't need to be optimized is wasted time.
1 - Certainly over a large code-base. Given enough time and patience, human can do a better job for some problems, but that is not sustainable over a large code-base and it doesn't take account of the facts that 1) compilers are always being improved, 2) optimal code can depend on things that a human cannot take into account, and 3) a compiler doesn't get tired and make mistakes.

The fastest way to iterate over bytes is as a single arrays. A faster way to process these are as int or long types as process 4-8 bytes at a time is faster than process one byte at a time, however it rather depends on what you are doing. Note: a byte[4] is actually 24 bytes on a 64-bit JVM which means you are not making efficient use of your CPU cache. If you don't know the exact size you need you might be better off creating a buffer larger than you need even if you are not using all the buffer. i.e. in the case of the byte[][] you are using 6x time the memory you really need already.

Any performance difference will not be visible, when you set initialCapacity on ArrayList. You say that your collection's size can never change, but what if this logic changes?
Using ArrayList you get access to a lot of methods such as contains.
As other people have said already, use ArrayList unless performance benchmarks say it is a bottle neck.

Memory consumption of java Collection.sort()

I have an ArrayList filled with 1.5 million objects of some class. When I sort this list by usage of the Collection.sort method the allocated memory of the JVM increases dramatically.
So my questions are:
Is that normal? What could be reasons for that? Is this a matter of the garbage collector working too slowly or not being started often enough? Do the objects in the list have to fulfill certain specifications to consume less memory during sort (besides not containing that much data)?
Thx!

In order to sort a List, the default sorting implementation first creates an array-copy of all elements that are to be sorted. This causes the additional heap consumption that you observe while sorting. This copying is necessary since a generic sorting algorithm has no knowledge of the list's structure, for example if it is random-access or not.
For Java 8, the sorting implementation was however changed to be delegated to each implementation of a List. This became possible with using default methods. For an ArrayList, this additional overhead could be removed by implementing a more efficient sorting algorithm. An upgrade to Java 8 would therefore most likely resolve your problem.
There is nothing wrong with garbage collection for your problem. Large arrays are unfortunately heavy to handle because they probably do not fit into the young generation and can eventually trigger a full collection.
Furthermore, as mentioned in the comments, the actual sorting is performed via Tim Sort since Java 7 by the Arrays::sort implementation. Tim sort requires additional heap space. From the javadoc:
Temporary storage requirements vary from a small constant for nearly sorted
input arrays to n/2 object references for randomly ordered input arrays.
If this is not applicable for your use case, you can switch back to the previous merge-sort implementation by setting the system property java.util.Arrays.useLegacyMergeSort to true.
After all, Tim sort is however still more efficient than merge sort as merge sort requires another full array copy.

Better int declration

If I have to store 3 integer values and would like to just retrieve the same , no calculation is required.Which one of the following would be a better option?
int i,j,k;
or
int [] arr = new int[3];
Array would be allocating 3 continuous blocks of memory (after allocation of space by JVM) or randomly assigning variables to some memory location (which I guess would consume lesser time for JVM as compared to array).
Apologies if the question is too trivial.

The answer is: It depends.
You shouldn't think too much about the performance implications for this case. the performance difference between the two is not big enough to notice.
What you really need to be on the look out for is readability and maintainability.
if i, j, and k, all essentially mean the same thing, and you're going to be using them the same way, and you feel like you might want to iterate over them, then it might make sense to use an array, so that you can iterate over them more easily.
if they're different values, with different meanings, and you're going to be using them differently, than it does not makes sense to include them in an array. They should each have their own identity, and their own descriptive variable name.

Choose whichever makes most sense semantically:
If these variables are three for a fundamental reason (maybe they are coordinates in the 3D space of a 3D game engine), then use three separate variables (because making, say, a 4D game engine is not a trivial change).
If these variables are three now but they could be trivially changed to be four tomorrow, it's reasonable to consider an array (or, better yet, a new type that contains them).
In terms of performance, traditionally local variables are faster than arrays. Under specific circumstances, the array may be allocated on the stack. Under specific circumstances, bound checks can be removed.
But don't make decisions based on performance, unless you have first done everything else correctly first and you have thorough tests and this particular piece of code is a performance-critical hot-spot and you're sure that it is the bottleneck of your application at the moment.

It depends on how would you access them. Array is of course an overhead, because you will first calculate a reference to a value and then get it. So if these values are totally unrelated, array is bad, and it may even count as code obfuscation. But naming variables like i, j, k is sort of obfuscation, too. Obfuscation is better to do automatically at build stage, there are tools like Proguard™ which can do it.

The two are not the same at all and are for different purpose.
in the first example you gave int i,j,k; you are pushing the values on to the stack,
The stack is for short term use and small data sizes i.e. function call arguments and iterator states.
The second example you gave int [] arr = new int[3]; the new keyword is allocating actual memory for the heap hat was giving to the process by the operating system.
The stack is optimized for short term use and all (most) all CPUs have a registers that are dedicated to point at the stack location and base making the stack a grate place for small dirty variables. The stack is also limited in size (by theory), its only a few KB in size (average case).
The heap on he other hand is proper memory allocation for large data types and proper memory management.
So, the two may be used for the same thing but it dose not mean it's right.
Arrays/Objects/Dicts go in allocated memory from he heap, function arguments (and iterator indexes usually) go on the stack.

It depends, but most probably, using distinct variables is the way to go.
In general, don't do micro-optimizations. Nobody will ever notice any difference in performance. Readable and maintainable code is what really matters in high-level languages.
See this article on micro-optimizations.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.