Memory consumption of java Collection.sort()

Memory consumption of java Collection.sort() - java

I have an ArrayList filled with 1.5 million objects of some class. When I sort this list by usage of the Collection.sort method the allocated memory of the JVM increases dramatically.
So my questions are:
Is that normal? What could be reasons for that? Is this a matter of the garbage collector working too slowly or not being started often enough? Do the objects in the list have to fulfill certain specifications to consume less memory during sort (besides not containing that much data)?
Thx!

In order to sort a List, the default sorting implementation first creates an array-copy of all elements that are to be sorted. This causes the additional heap consumption that you observe while sorting. This copying is necessary since a generic sorting algorithm has no knowledge of the list's structure, for example if it is random-access or not.
For Java 8, the sorting implementation was however changed to be delegated to each implementation of a List. This became possible with using default methods. For an ArrayList, this additional overhead could be removed by implementing a more efficient sorting algorithm. An upgrade to Java 8 would therefore most likely resolve your problem.
There is nothing wrong with garbage collection for your problem. Large arrays are unfortunately heavy to handle because they probably do not fit into the young generation and can eventually trigger a full collection.
Furthermore, as mentioned in the comments, the actual sorting is performed via Tim Sort since Java 7 by the Arrays::sort implementation. Tim sort requires additional heap space. From the javadoc:
Temporary storage requirements vary from a small constant for nearly sorted
input arrays to n/2 object references for randomly ordered input arrays.
If this is not applicable for your use case, you can switch back to the previous merge-sort implementation by setting the system property java.util.Arrays.useLegacyMergeSort to true.
After all, Tim sort is however still more efficient than merge sort as merge sort requires another full array copy.

Related

What is the purpose of a Stream.Builder in Java

To put it differently, what do I gain by using Stream.Builder.add() to add items to the builder and then using Stream.Builder.build(), versus adding the items in a collection or array and creating a Stream from that?
I assume there is a benefit somewhere in some circumstances but it's not obvious to me...

Assuming the machine has enough memory, using Stream.Builder allows to add more than Integer.MAX_VALUE elements to it.
Internally, Stream.Builder uses a SpinedBuffer, which is a non public class.
From SpinedBuffer docs:
An ordered collection of elements. Elements can be added, but not removed.
Goes through a building phase, during which elements can be added, and a
traversal phase, during which elements can be traversed in order but no
further modifications are possible.
One or more arrays are used to store elements. The use of a multiple
arrays has better performance characteristics than a single array used by
ArrayList, as when the capacity of the list needs to be increased
no copying of elements is required. This is usually beneficial in the case
where the results will be traversed a small number of times.
So, it also avoids ArrayList resizing.

It has a very interesting internal data structure that is highly optimized for insertions, without the possibility of removal or random access.
Your array and/or Collection pay an additional price for all other functionality that they support.

Why Java ArrayLists do not shrink automatically

Long time ago I watched a video lecture from the Princeton Coursera MOOC: Introduction to algorithms, which can be found here. It explains the cost of resizing an ArrayList like structure while adding or removing the elements from it. It turns out that if we want to supply resizing to our data structure we will go from O(n) to amortized O(n) for add and remove operations.
I have been using Java ArrayList for a couple of years. I've been always sure that they grow and shrink automatically. Only recently, to my great surprise, I was proven wrong in this post. Java ArrayLists do not shrink (even though, of course they do grow) automatically.
Here are my questions:
In my opinion providing shrinking in ArrayLists does not make any harm as the performance is already amortized O(n). Why did Java creators did not include this feature into the design?
I know that other data structures like HashMaps also do not shrink automatically. Is there any other data structure in Java which is build on top of arrays that supports automatic shrinking?
What are the tendencies in other languages? How does automatic shrinking look like in case of lists, dictionaries, maps, sets in Python/C# etc. If they go in the opposite direction to what Java does, then my question is: why?

The comments already cover most of what you are asking. Here some thoughts on your questions:
When creating a structure like the ArrayList in Java, the developers make certain decisions regarding runtime / performance. They obviously decided to exclude shrinking from the “normal” operations to avoid the additional runtime, which is needed.
The question is why you would want to shrink automatically. The ArrayList does not grow that much (the factor is about 1.5; newCapacity = oldCapacity + (oldCapacity >> 1), to be exact). Maybe you also insert in the middle and not just append at the end. Then a LinkedList (which is not based on an array -> no shrinking needed) might be better. It really depends on your use case. If you think you really need everything an ArrayList does, but it has to shrink when removing elements (I doubt you really need this), just extend ArrayList and override the methods. But be careful! If you shrink at every removal, you are back at O(n).
The C# List and the C++ vector behave the same concerning shrinking a list at removal of elements. But the factors of automatic growing vary. Even some Java-implementations use different factors.

Another issue with automatic shrinking is that you could get into really horrible 'pathological' situations where each insert and delete to the list causes the growing or shrinking the backing array.
For example, if the backing array's initial capacity is 10, such that the array would grow upon the insertion of the 11th element (capacity would grow to 15), a natural implementation would be to shrink the backing array once the list size dropped below 11. If you had a list whose length kept changing between 10 and 11, you'd be constantly changing the backing array. Not only would that add a runtime overhead, but you could start putting lots of pressure on the garbage collector if every operation resulted in either 10 or 15 objects becoming garbage.

Although arraylist with shrinking is still amortized O(n) time complexity, it involves more operation.
By shrinking you just save some memory space by adding some calculation, which is not a wise decision because the Moore's law says computer space double every 2 years. So the time is more valuable in algorithm than space.

What is the fastest java collection for retrieving large numbers of DTOs?

I'm returning large numbers of collections from a DTO object and was wondering if anyone could point me in right direction. Any type of collection will do, but I don't know which one is best suited for the task of returning a large number of objects.
I know this can change based on threading and the like, but I'm at least looking for general guidance and benchmarks. Also, I'm required to stay within standard Java collections (no third-party libraries).

As irreputable says: If you need a simple collection, than ArrayList should perform good because it is based on an Array which is fast by definition using the System functions.
If you set the initial capacity to a higher value (don't know what you call a large number), than it will be even faster because it reduces the amount of incremental reallocation.
Any other collection has some kind of an overhead like looking for hashcodes or beeing synchronized.

An ArrayList initialized at the correct size (if you know how many DTOs you'll be adding, or an upper bound) is the simplest and smallest Collection you'll find. By setting its size at initialization, it won't need to resize its internal array, an operation which produces garbage. It's better that directly using an array, which is really low level, and which you'd need to manage manually if it needs resizing (that's what the ArrayList does for you).
To create a pre-sized ArrayList, use the ArrayList(int capacity) constructor.

Speeding up a linked list?

I'm a student and fairly new to Java. I was looking over the different speeds achieved by the two collections in Java, Linked List, and ArrayList. I know that an ArrayList is much much faster at looking up and placing in values into its indexes. My question is:
how can one make a linked list faster, if at all possible?
Thanks for any help.
zmahir

When talking about speed, perhaps you mean complexity. Insertion and retrieval operations for ArrayList (and arrays) are O(1), while for LinkedList they are O(n). And this cannot be changed - it is 'by definition'.
O(n) means that in order to insert an object at a given position, or retrieve it, you must traverse, in the worst case, all (n) the items in the list. Hence n operations. For ArrayList this is only one operation.

You probably can't. You don't know the size (well, ok you can), nor the location of each element. To find element 100 in a linked list, you need to start with item 1, find it's link to item 2, etc. until you find 100. This makes inserting into this list a tedious job.
There are many alternatives depending on your exact goals. You can use b-trees or similar methods to split the large linked list into smaller ones. Or use hashlists if you want to quickly find items. Or use simple arrays. But if you want a list that performs like an ArrayList, why not use an ArrayList?

You can split off regions which are linked to the main linked list, so this gives you entry points directly inside the list so you don't have to walk up to them. See the subList method here: http://download.oracle.com/javase/1.4.2/docs/api/java/util/AbstractList.html. This is useful if you have a number of 'sentences' made out of words, say. You can use a separate linked list to iterate over the sentences, which are sublists of the main linked list.
You can also use a ListIterator when adding, removing, or accessing elements. This helps greatly with increasing the speed of sequential access. See the listIterator method for this, and the class: http://download.oracle.com/javase/1.4.2/docs/api/java/util/ListIterator.html.

Speed of a linked list could be improved by using skip lists: http://igoro.com/archive/skip-lists-are-fascinating/

a linked list uses pointers to walk through the items, so for example if you asked for the 5th item, the runtime will start from the first item and walks through each pointer until it reaches the 5th item.
there is really not much you can do about it. a linked list may not be a good choice if you need fast acces to items. although there are some optimizations for it such as creating a circular linked list or a double linked list where you can walk back and forth the list but this really depends on the business logic and the application requirements.
my advise is to avoid linked lists if it does not match your needs and changing to a different data structure might be the best approach.

As a general rule, data structures are designed to do certain things well. LinkedLists are designed to be faster than ArrayLists at inserting elements and removing elements and about the same as ArrayLists at iterating across the list in order. When you change the way a LinkedList works, you make it no longer a true LinkedList, so there's not really any way to modify them to be faster at something and still be a LinkedList.
You'll need to examine the way you're using this particular collection and decide whether a LinkedList is really the best data structure for your purposes. If you share with us how you're using it, and why you need it to be faster, then we can advise you on which data structure you ought to consider using.

Lots of people smarter than you or I have looked at the implementation of the Java collection classes. If there were an optimization to be made, they would have found it and already made it.
Since the collection classes are pretty much as optimized as they can be, our primary task should be to choose the correct one.
When choosing your collection type, don't forget about things like HashSet. If order doesn't matter, and you don't need to put duplicates in the collection, then HashSet may be appropriate.

I'm a student and fairly new to Java. ... how can one make a linked list faster, if at all possible?
The standard Java collection type (indeed all data structures implemented in any language!) represent compromises on various "measures" such as:
The amount of memory needed to represent the data structure.
The time taken to perform various operations; e.g. for a "list" the operations of interest are insertion, removal, indexing, contains, iteration and so on.
How easy or hard it is to integrate / reuse the collection type; see below.
So for instance:
ArrayList offers lower memory overheads, fast indexing (O(1)), but slow contains, random insertion and removal (O(N)).
LinkedList has higher memory overheads, slow indexing and contains (O(N)), but faster removal (O(1)) under certain circumstances.
The various performance measures are typically determines by the maths of the various data structures. For example, if you have a chain of nodes, the only way to get the ith node is to step through them from the beginning. This involves following i pointers.
Sometimes you can modify the data structures to improve one aspect of the performance. But this typically comes at the cost of some other aspect of the performance. (For example, you could add a separate index to make indexing of a linked list faster. But the cost of maintaining the index on insertion / deletion would mean that you'd probably be better of using an ArrayList.)
In some cases the integration / reuse requirements have significant impact on performance.
For example, it is theoretically possible to optimize a linked list's space usage by adding a next field to the list element type, combining the element and node objects and saving 16 or so bytes per list entry. However, this would make the list type less general (the member/element class would need to implement a specific interface), and has the restriction that an element can belong to at most one list at any time. These restrictions are so limiting that this approach is rarely used in Java.
For a second example, consider the problem of inserting at a given position in a linked list. For the LinkedList class, this is normally an O(N) operation, because you have to step through the list to find the position. In theory, if an application could find and remember a position, it should be able to perform the insertion at that position in O(1). Unfortunately, neither the List APIs provides no way to "remember" a position.
While neither of these examples is a fundamental roadblock to a developer "doing his own thing", they illustrate that using general data structure APIs and general implementations of those APIs has performance implications, and therefore represents a trade-off between performance and ease-of-use.

I'm a bit surprised by the answers here. There are big difference between the theoretical performance of LinkedLists and ArrayLists compared to the actual performance of the Java implementations.
What makes the Java LinkedList slower than a theoretical LinkedList is that it does a lot more than just the operations. For example it checks for concurrent modifications and other safeties.
If you know your use case, you can write a your own simple implementation of a LinkedList and it will be much faster.

What is Java's lightest weight non-concurrent implementation of Iterable?

I need a class that implements Iterable, and does not need to be safe for concurrent usage. Of the various options, such as LinkedList, HashSet, ArrayList etc, which is the lightest-weight?
To clarify the use-case, I need to be able to add a number of objects to the Iterable (typically 3 or 4), and then something else needs to iterate over it.

ArrayList. From the Javadoc
The add operation runs in amortized constant time, that is, adding n elements requires O(n) time. All of the other operations run in linear time (roughly speaking). The constant factor is low compared to that for the LinkedList implementation.

That entirely depends on what you mean by "lightest weight". What operations do you need to do, and how often? Do you know the final size beforehand? Are you trying to save execution time or memory?
I would agree that zkarthik that ArrayList is very often a good choice... but it will behave very badly if you want to create a large collection and then repeatedly remove the first element, for example. There's a good reason for there being so many different collections: they have different performance characteristics for different situations.

They all have very different features and behavior, so you should base your choice on how you will use them. For example, for random access and high locality, use an ArrayList; if you need fast unordered insertion and querying, use a HashSet.

If by 'lightweight', you mean 'best performance' then the question is almost impossible to answer without understanding how the collection will be used. All you've told us so for is that it doesn't need to support concurrent usage, but in order to have any hope of answering the question we'd need to know things like
How many objects will be stored in the collection (on average)
What is the relative frequency of read and write access
Is random-access required
Is ordered access required
A number of people have suggested ArrayList may be best. However, I seem to recall reading (possibly in Effective Java 2nd edition), that for certain patterns of usage, Queue performs better than List, because it does not incurr the penalty of random access. In other words, you can add/remove items from a List in any order, but you can only add/remove items in a queue in a specific order (i.e. add to the tail, and remove from the head).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.