NOTE: As the title already hints, this question is not about the specific java.util.ArrayList implementation of an array-based list, but rather about the raw arrays themselves and how they might behave in a "pure" (meaning completely unoptimized) array-based list implementation. I chose to mention java.util.ArrayList because it is the most prominent example of an array-based list in Java, although it is technically not "pure", as it utilizes preallocation to reduce the operation time of add(). If you want to know why I am asking this specific question without being interested in the java.util.ArrayList() preallocation optimization, I added a little explanation of my use case below.
It is generally known that you can access elements in array-based lists (like Java's ArrayList<E>) with a time complexity of O(1), while adding elements to that list will take O(n). With linked lists, it is the other way round (for a doubly linked list, you could optimize the access to half the execution time).
The reason why adding elements to an array-based list takes O(n) is that an array cannot simply be resized, but has to be reallocated and re-filled. The easiest way to do this would be:
String arr[] = new String[n];
//...
String newElem = "foo";
String[] newArr = new String[n + 1];
int i = 0;
for (String elem : arr) {
newArr[i] = arr[i++];
}
newArr[i] = newElem;
arr = newArr;
The time complexity O(n) is clearly visible thanks to the for loop. But there are other ways to copy arrays in Java, for example System.arraycopy().
Sticking to the vanilla for loop solution, even shrinking an array will take O(n), because an array has a fixed size and in order to "shrink" it, you'd have to copy all elements to be retained to a new, smaller array.
So, here are my questions concerning such array operations and their time complexity:
While the vanilla for loop will always take O(n), is it possible that System.arraycopy() optimizes the "add" operation if there is enough space in the memory to expand the array in place, meaning that it would leave the original array at its place and just add the new element at the end of it?
As the shrinking operation could always be executed with O(1) in theory, does System.arraycopy() always optimize this operation to O(1)?
If System.arraycopy() is not capable of using those optimizations, is there any other way in Java to actually utilize those optimizations which are possible in theory OR will array "resizing" always take O(n), no matter under which circumstances?
TL;DR is there any situation in which the "resizing" of an array in Java will take less than O(n)?
Additional information:
I am using openJDK11 (newest release), but if the answer turns out to be JVM-dependent, I'd like to know how other JVMs would behave in comparison.
For the curious ones
who want to know what I want to do with this information:
I am working on a new java.util.List implementation, namely a hybrid list that can store data in an array and in a linked buffer. On certain occasions, the buffer will be flushed into the array, which of course requires that the existing array is resized. But apart from this idea, I want to utilize as many other optimizations on the array part as possible. To avoid array resizing in general, I experimented with the idea of letting the array persist in a constant size, but managing the "valid" range of it with some other fields. Meaning that if you were to pop the last element of the array, it would not shrink the array but rather the range of valid elements. Then, when inserting new elements in the array part, the former invalid section can be used to shift values into, basically reusing the space that was formerly used by a now deleted element. If the inserting operations exceed the actual array size, elements can still be transferred to the linked buffer to avoid resizing. To further optimize this, I chose to use the middle of the array as a pivot when deleting certain elements. Now the valid range might not start at the beginning of the array anymore. Basically this means if you delete an element to the left of the pivot, all elements between the start of the valid range and the deleted element get shifted towards the pivot, to the right. Removing element to the right of the pivot works accordingly. So, after some removals, the array could look like this:
[null null|elem0 elem1 elem2||elem3 elem4 elem5|null null null]
(Where the | at the beginning and at the end mark the valid range and the || marks the pivot)
So, how is this all related to my question?
All of those optimizations build up upon the claim that array resizing is expensive in time, namely O(n). Therefore array resizing is avoided whenever possible. Those optimizations might sound neat, but the code implementing them can get quite messy, especially when implementing the batch operations (addAll(), removeAll(), retainAll()...). So, if it turns out that the array resizing operation itself can be less expensive in some cases (especially shrinking), I would cut out a lot of those optimizations which are then rendered useless, making the code a lot easier in the process.
So, before sticking to my optimization ideas and experiments, I'd like to know whether they are even needed.
Related
I have a need for a data structure that will be able to give preceding and following neighbors for a given int that is part of the structure.
Some criteria I've set for myself:
write once, read many times
contain 100 to 1000 int
be efficient: order of magnitude O(1)
be memory efficient (size of the ints + some housekeeping bits ideally)
implemented in pure Java (no libraries for this, as I want to learn)
items are unique
no concurrency requirements
ints are ordered externally, that order will most likely not be a natural ordering, and that order must be preserved (ie. there is no contract whatsoever regarding the difference in value between two neighboring ints - any int may be greater or smaller than the int it preceeds in the order).
This is in Java, and is mostly theoretical, as I've started using the solution described below.
Things I've considered:
LinkedHashSet: very quick to find an item, order of O(1), and very quick to retrieve next neighbor. No apparent way to get previous neighbor without reverse sorting the set. Boxed Integer objects only.
int[]: very easy on memory because no boxing required, very quick to get previous and next neighbor, retrieval of an item is O(n) though because index is not known and array traversal is required, and that is not acceptable.
What I'm using now is a combination of int[] and HashMap:
HashMap for retrieving index of a specific int in the int[]
int[] for retrieving the neighbors of that int
What I like:
neighbor lookup is ideally O(2)
int[] does not do boxing
performance is theoretically very good
What I dislike:
HashMap does boxing twice (key and value)
the ints are stored twice (in both the map and the array)
theoretical memory use could be improved quite a bit
I'd be curious to hear of better solutions.
One solution is to sort the array when you add elements. That way, the previous element is always i-1 and to locate a value, you can use a binary search which is O(log(N)).
The next obvious candidate is a balanced binary tree. For this structure, insert is somewhat expensive but lookup is again O(log(N)).
If the values aren't 32bit, then you can make the lookup faster by having a second array where each value is the index in the first and the index is the value you're looking for.
More options: You could look at bit sets but that again depends on the range which the values can have.
Commons Lang has a hash map which uses primitive int as keys: http://grepcode.com/file/repo1.maven.org/maven2/commons-lang/commons-lang/2.6/org/apache/commons/lang/IntHashMap.java
but the type is internal, so you'd have to copy the code to use it.
That means you don't need to autobox anything (unboxing is cheap).
Related:
http://java-performance.info/implementing-world-fastest-java-int-to-int-hash-map/
HashMap and int as key
ints are ordered externally, that order will most likely not be a natural ordering, and that order must be preserved (ie. there is no contract whatsoever regarding the difference in value between two neighboring ints).
This says "Tree" to me. Like Aaron said, expensive insert but efficient lookup, which is what you want if you have write once, read many.
EDIT: Thinking about this a bit more, if a value can only ever have one child and one parent, and given all your other requirements, I think ArrayList will work just fine. It's simple and very fast, even though it's O(n). But if the data set grows, you'll probably be better off using a Map-List combo.
Keep in mind when working with these structures that the theoretical performance in terms of O() doesn't always correspond to real-word performance. You need to take into account your dataset size and overall environment. One example: ArrayList and HashMap. In theory, List is O(n) for unsorted lookup, while Map is O(1). However, there's a lot of overhead in creating and managing entries for a map, which actually gives worse performance on smaller sets than a List.
Since you say you don't have to worry about memory, I'd stay away from array. The complexity of managing the size isn't worth it on your specified data set size.
More specifically, suppose I have an array with duplicates:
{3,2,3,4,2,2,1,4}
I want to have a data structure that supports search and remove the first occurrence of some value faster than O(n), say if the value is 4, then it becomes:
{3,2,3,2,2,1,4}
I also need to iterate the list from head according to the same order. Other operations like get(index) or insert are not needed.
You can use O(n) time to record the original data(say it's an int[]) in your data structure, I just need the later search and remove faster than O(n).
"Search and remove" is considered as ONE operation as shown above.
If I have to make it myself, I would use a LinkedList to store the data, and HashMap to map every key to a list of all occurrence of nodes together with their previous and next ones.
Is it a right approach? Are there any better choices already there in Java?
The data structure you describe, essentially a hybrid linked list and map, I think is the most efficient way of handling your stated problem. You'll have to keep track of the nodes yourself, since Java's LinkedList doesn't provide access to the actual nodes. The AbstractSequentialList may be helpful here.
The index structure you'll need is a map from an element value to the appearances of that element in the list. I recommend a hash table from hashCode % modulus to a linked list of (value, list of main-list nodes).
Note that this approach is still O(n) in the worst case, when you have universal hash collisions; this applies whether you use open or closed hashing. In the average case it should be something closer to O(ln(n)), but I'm not prepared to prove that.
Consider also whether the overhead of keeping track of all of this is really worth the gains. Unless you've actually profiled running code and determined that a LinkedList is causing problems because remove is O(n), stick with that until you do.
Since your requirement is that the first occurrence of the element should be removed and the remaining occurrences retained, there would be no way to do it faster than O(n) as you would definitely have to move through to the end of the list to find out if there is another occurrence. There is no standard api from Oracle in the java package that does this.
For the method add of the ArrayList Java API states:
The add operation runs in amortized constant time, that is, adding n elements requires O(n) time.
I wonder if it is the same time complexity, linear, when using the add method of a LinkedList.
This depends on where you're adding. E.g. if in an ArrayList you add to the front of the list, the implementation will have to shift all items every time, so adding n elements will run in quadratic time.
Similar for the linked list, the implementation in the JDK keeps a pointer to the head and the tail. If you keep appending to the tail, or prepending in front of the head, the operation will run in linear time for n elements. If you append at a different place, the implementation will have to search the linked list for the right place, which might give you worse runtime. Again, this depends on the insertion position; you'll get the worst time complexity if you're inserting in the middle of the list, as the maximum number of elements have to be traversed to find the insertion point.
The actual complexity depends on whether your insertion position is constant (e.g. always at the 10th position), or a function of the number of items in the list (or some arbitrary search on it). The first one will give you O(n) with a slightly worse constant factor, the latter O(n^2).
In most cases, ArrayList outperforms LinkedList on the add() method, as it's simply saving a pointer to an array and incrementing the counter.
If the woking array is not large enough, though, ArrayList grows the working array, allocating a new one and copying the content. That's slower than adding a new element to LinkedList—but if you constantly add elements, that only happens O(log(N)) times.
When we talk about "amortized" complexity, we take an average time calculated for some reference task.
So, answering your question, it's not the same complexity: it's much faster (though still O(1)) in most cases, and much slower (O(N)) sometimes. What's better for you is better checked with a profiler.
If you mean the add(E) method (not the add(int, E) method), the answer is yes, the time complexity of adding a single element to a LinkedList is constant (adding n elements requires O(n) time)
As Martin Probst indicates, with different positions you get different complexities, but the add(E) operation will always append the element to the tail, resulting in a constant (amortized) time operation
I often* find myself in need of a data structure which has the following properties:
can be initialized with an array of n objects in O(n).
one can obtain a random element in O(1), after this operation the picked
element is removed from the structure.
(without replacement)
one can undo p 'picking without replacement' operations in O(p)
one can remove a specific object (eg by id) from the structure in O(log(n))
one can obtain an array of the objects currently in the structure in
O(n).
the complexity (or even possibility) of other actions (eg insert) does not matter. Besides the complexity it should also be efficient for small numbers of n.
Can anyone give me guidelines on implementing such a structure? I currently implemented a structure having all above properties, except the picking of the element takes O(d) with d the number of past picks (since I explicitly check whether it is 'not yet picked'). I can figure out structures allowing picking in O(1), but these have higher complexities on at least one of the other operations.
BTW:
note that O(1) above implies that the complexity is independent from #earlier picked elements and independent from total #elements.
*in monte carlo algorithms (iterative picks of p random elements from a 'set' of n elements).
HashMap has complexity O(1) both for insertion and removal.
You specify a lot of operation, but all of them are nothing else then insertion, removal and traversing:
can be initialized with an array of n objects in O(n).
n * O(1) insertion. HashMap is fine
one can obtain a random element in
O(1), after this operation the picked
element is removed from the structure.
(without replacement)
This is the only op that require O(n).
one can undo p 'picking without
replacement' operations in O(p)
it's an insertion operation: O(1).
one can remove a specific object (eg
by id) from the structure in O(log(n))
O(1).
one can obtain an array of the objects
currently in the structure in O(n).
you can traverse an HashMap in O(n)
EDIT:
example of picking up a random element in O(n):
HashMap map ....
int randomIntFromZeroToYouHashMapSize = ...
Collection collection = map.values();
Object[] values = collection.toArray();
values[randomIntFromZeroToYouHashMapSize];
Ok, same answer as 0verbose with a simple fix to get the O(1) random lookup. Create an array which stores the same n objects. Now, in the HashMap, store the pairs . For example, say your Objects (strings for simplicity) are:
{"abc" , "def", "ghi"}
Create an
List<String> array = ArrayList<String>("abc","def","ghi")
Create a HashMap map with the following values:
for (int i = 0; i < array.size(); i++)
{
map.put(array[i],i);
}
O(1) random lookup is easily achieved by picking any index in the array. The only complication that arises is when you delete an object. For that, do:
Find object in map. Get its array index. Lets call this index i (map.get(i)) - O(1)
Swap array[i] with array[size of array - 1] (the last element in the array). Reduce the size of the array by 1 (since there is one less number now) - O(1)
Update the index of the new object in position i of the array in map (map.put(array[i], i)) - O(1)
I apologize for the mix of java and cpp notation, hope this helps
Here's my analysis of using Collections.shuffle() on an ArrayList:
✔ can be initialized with an array of n objects in O(n).
Yes, although the cost is amortized unless n is known in advance.
✔ one can obtain a random element in O(1), after this operation the picked element is removed from the structure, without replacement.
Yes, choose the last element in the shuffled array; replace the array with a subList() of the remaining elements.
✔ one can undo p 'picking without replacement' operations in O(p).
Yes, append the element to the end of this list via add().
❍ one can remove a specific object (eg by id) from the structure in O(log(n)).
No, it looks like O(n).
✔ one can obtain an array of the objects currently in the structure in O(n).
Yes, using toArray() looks reasonable.
How about an array (or ArrayList) that's divided into "picked" and "unpicked"? You keep track of where the boundary is, and to pick, you generate a random index below the boundary, then (since you don't care about order), swap the item at that index with the last unpicked item, and decrement the boundary. To unpick, you just increment the boundary.
Update: Forgot about O(log(n)) removal. Not that hard, though, just a little memory-expensive, if you keep a HashMap of IDs to indices.
If you poke around on line you'll find various IndexedHashSet implementations that all work on more or less this principle -- an array or ArrayList plus a HashMap.
(I'd love to see a more elegant solution, though, if one exists.)
Update 2: Hmm... or does the actual removal become O(n) again, if you have to either recopy the arrays or shift them around?
I'm looking for a collection that offers list semantics, but also allows array semantics. Say I have a list with the following items:
apple orange carrot pear
then my container array would:
container[0] == apple
container[1] == orangle
container[2] == carrot
Then say I delete the orange element:
container[0] == apple
container[1] == carrot
I want to collapse gaps in the array without having to do an explicit resizing, Ie if I delete container[0], then the container collapses, so that container[1] is now mapped as container[0], and container[2] as container[1], etc. I still need to access the list with array semantics, and null values aren't allow (in my particular use case).
EDIT:
To answer some questions - I know O(1) is impossible, but I don't want a container with array semantics approaching O(log N). Sort of defeats the purpose, I could just iterate the list.
I originally had some verbiage here on sort order, I'm not sure what I was thinking at the time (Friday beer-o-clock most likely). One of the use-cases is Qt list that contains images - deleting an image from the list should collapse the list, not necessary take the last item from the list and throw it in it's place. In this case, yet, I do want to preserve list semantics.
The key differences I see as separating list and array:
Array - constant-time access
List - arbitrary insertion
I'm also not overly concerned if rebalancing invalidates iterators.
You could do an ArrayList/Vector (Java/C++) and when you delete, instead swap the last element with the deleted element first. So if you have A B C D E, and you delete C, you'll end up with A B E D. Note that references to E will have to look at 2 instead of 4 now (assuming 0 indexed) but you said sort order isn't a problem.
I don't know if it handles this automatically (optimized for removing from the end easily) but if it's not you could easily write your own array-wrapper class.
O(1) might be too much to ask for.
Is O(logn) insert/delete/access time ok? Then you can have a balanced red-black tree with order statistics: http://www.catonmat.net/blog/mit-introduction-to-algorithms-part-seven/
It allows you to insert/delete/access elements by position.
As Micheal was kind enough to point out, Java Treemap supports it: http://java.sun.com/j2se/1.5.0/docs/api/java/util/TreeMap.html
Also, not sure why you think O(logN) will be as bad as iterating the list!
From my comments to you on some other answer:
For 1 million items, using balanced
red-black trees, the worst case is
2log(n+1) i.e ~40. You need to do no
more than 40 compares to find your
element and that is the absolute worst
case. Red-black trees also cater to
the hole/gap disappearing. This is
miles ahead of iterating the list (~
1/2 million on average!).
With AVL trees instead of red-black
trees, the worst case guarantee is
even better: 1.44 log(n+1), which is
~29 for a million items.
You should use a HashMap, the you will have O(1)- Expected insertion time, just do a mapping from integers to whatever.
If the order isn't important, then a vector will be fine. Access is O(1), as is insertion using push_back, and removal like this:
swap(container[victim], container.back());
container.pop_back();
EDIT: just noticed the question is tagged C++ and Java. This answer is for C++ only.
I'm not aware of any data structure that provides O(1) random access, insertion, and deletion, so I suspect you'll have to accept some tradeoffs.
LinkedList in Java provides O(1) insertion/deletion from the head or tail of the list is O(1), but random access is O(n).
ArrayList provides O(1) random access, but insertion/deletion is only O(1) at the tail of the list. If you insert/delete from the middle of the list, it has to move around the remaining elements in the list. On the bright side, it uses System.arraycopy to move elements, and it's my understanding that this is essentially O(1) on modern architectures because it literally just copies blocks of memory around instead of processing each element individually. I say essentially because there is still work to find enough contiguous blocks of free space, etc. and I'm not sure what the big-O might be on that.
Since you seem to want to insert at arbitrary positions in (near) constant time, I think using a std::deque is your best bet in C++. Unlike the std::vector, a deque (double-ended queue) is implemented as a list of memory pages, i.e. a chunked vector. This makes insertion and deletion at arbitrary positions a constant-time operation (depending only on the page size used in the deque). The data structure also provides random access (“array access”) in near-constant time – it does have to search for the correct page but this is a very fast operation in practice.
Java’s standard container library doesn’t offer anything similar but the implementation is straightforward.
Does the data structure described at http://research.swtch.com/2008/03/using-uninitialized-memory-for-fun-and.html do anything like what you want?
What about Concurent SkipList Map?
It do O(Log N) ?