Which implementation to use when creating a List from Iterable

Which implementation to use when creating a List from Iterable - java

I find myself frequently doing the following:
Iterator<A> itr = iterableOfA.getIterator();
List<B> list = new ArrayList<>(); // how about LinkedList?
while (itr.hasNext()) {
B obj = iter.next().getB();
list.add(obj);
}
someMethod(list); // this method takes an Iterable
I have no idea just how many elements are likely to be in iterableOfA — could be 5, could be 5000. In this case, would LinkedList be a better implementation to use here (since list.add(obj) would then be O(1))? As it stands, if iterableOfA has 5000 elements, this will lead to many resizings of backing array of list.
Other option is to do:
Iterator<A> itr = iterableOfA.getIterator();
int size = Iterables.size(iterableOfA); // from Guava
List<B> list = new ArrayList<>(size);
// and the rest...
This means double iteration of iterableOfA. Which option would be best when the size of the iterable is unknowns and can vary wildly:
Just use ArrayList.
Just use LinkedList.
Count the elements in iterableOfA and allocate an ArrayList.
Edit 1
To clarify some details:
I am optimizing primarily for performance and secondarily for memory usage.
list is a short-lived allocation as at the end of the request no code should be holding a reference to it.
Edit 2
For my specific case, I realized that someMethod(list) doesn't handle an iterable with greater than 200 elements, so I decided to go with new ArrayList<>(200) which works well enough for me.
However, in the general case I would have preferred to implement the solution outlined in the accepted answer (wrap in a custom iterable, obviating the need for allocating a list).
All the other answers gave valuable insight into how suitable ArrayList is compared to LinkedList, so on behalf of the general SO community I thank you all!

Which option would be best when the size of the iterable is unknowns and can vary wildly
It depends what you are optimizing for.
If you are optimizing for performance, then using ArrayList is probably faster. Even though ArrayList will need to resize the backing array, it does this using an exponential growth pattern. However, it depends on the overheads of iteration.
If you are optimizing for long-term memory usage, consider using ArrayList followed by trimToSize().
If you are optimizing for peak memory usage, the "count first" approach is probably the best. (This assumes that you can iterate twice. If the iterator is actually a wrapper for a lazy calculation .... this may be impossible.)
If you are optimizing to reduce GC, then "count first" is probably best, depending on the details of the iteration.
In all cases you would be advised to:
Profile you application before you spend more time on this issue. In a lot of cases you will find that this is simply not worth your effort in optimizing.
Benchmark the two alternatives that you are considering, using the classes and typical data structures from your application.
As it stands, if iterableOfA has 5000 elements, this will lead to many resizings of backing array of list.
The ArrayList class resizes to a new size that proportional to the current size. That means that the number of resizings is O(logN), and the overall cost of N list append calls is O(N).

I would completely skip copying the elements to a new collection.
We have utility code for easily wrapping Iterators into Iterables and Filter for converting between types, but the gist of it is:
final Iterable<A> iofA ... ;
Iterable<B> iofB = new Iterable<B>() {
public Iterator<B> iterator() {
return new Iterator<B>() {
private final Iterator<A> _iter = iofA.iterator();
public boolean hasNext() { return _iter.hasNext(); }
public B next() { return _iter.next().getB(); }
};
}
};
No additional storage, etc. necessary.

3rd option is not bad. For getting size, most of the collections just return the counter that they maintain internally...it does not iterate through the entire list. It depends on the implementation, but all the java.util.xxx collection classes does that way.
If u know what are the potential types of "iterableOfA", u can check how they are doing size.
If "iterableOfA" is going to be some custom implementation and you are not sure how size is done, linkedlist would be safer. That is because ur size varies and potential of resizing is higher, hence u will not get a predictable performance.
Also not sure what operations you are performing in the collection that you are filling "B", your choice would depend on that also.

LinkedList is a cache-hostile memory eating moster, which its father (Joshua Bloch) regrets.
I'd bet, it's not faster in your case as ArrayList resizing is optimized and takes amortized O(1) per element, too.
Basically, the only case when LinkedList is faster, is the following loop:
for (Iterator<E> it = list.iterator(); it.hasNext(); ) {
E e = it.next();
if (someCondition(e)) e.remove();
}
As it stands, if iterableOfA has 5000 elements, this will lead to many resizings of backing array of list.
Many is something like log(5000 / 10) / log(1.5), i.e., 15. But the count doesn't matter much as the last resizings dominate. You'll be copying each object reference maybe twice, that's cheap.
Assuming you'll be doing anything with the list, it's very cheap .
Iterating just to find out the number of elements might help in some cases, but the speed depends on the input Iterable. So unless you need the speed really badly and you know that the input is never really slow, I'd refrain from such an optimization.

Related

Is creating a HashMap alongside an ArrayList just for constant-time contains() a valid strategy?

I've got an ArrayList that can be anywhere from 0 to 5000 items long (pretty big objects, too).
At one point I compare it against another ArrayList, to find their intersection. I know this is O(n^2).
Is creating a HashMap alongside this ArrayList, to achieve constant-time lookup, a valid strategy here, in order to reduce the complexity to O(n)? Or is the overhead of another data structure simply not worth it? I believe it would take up no additional space (besides for the references).
(I know, I'm sure 'it depends on what I'm doing', but I'm seriously wondering if there's any drawback that makes it pointless, or if it's actually a common strategy to use. And yes, I'm aware of the quote about prematurely optimizing. I'm just curious from a theoretical standpoint).

First of all, a short side note:
And yes, I'm aware of the quote about prematurely optimizing.
What you are asking about here is not "premature optimization"!
You are not talking about replacing a multiplication with some odd bitwise operations "because they are faster (on a 90's PC, in a C-program)". You are thinking about the right data structure for your application pattern. You are considering the application cases (though you did not tell us many details about them). And you are considering the implications that the choice of a certain data structure will have on the asymptotic running time of your algorithms. This is planning, or maybe engineering, but not "premature optimization".
That being said, and to tell you what you already know: It depends.
To elaborate this a bit: It depends on the actual operations (methods) that you perform on these collections, how frequently you perform then, how time-critical they are, and how memory-sensitive the application is.
(For 5000 elements, the latter should not be a problem, as only references are stored - see the discussion in the comments)
In general, I'd also be hesitant to really store the Set alongside the List, if they are always supposed to contain the same elements. This wording is intentional: You should always be aware of the differences between both collections. Primarily: A Set can contain each element only once, whereas a List may contain the same element multiple times.
For all hints, recommendations and considerations, this should be kept in mind.
But even if it is given for granted that the lists will always contain elements only once in your case, then you still have to make sure that both collections are maintained properly. If you really just stored them, you could easily cause subtle bugs:
private Set<T> set = new HashSet<T>();
private List<T> list = new ArrayList<T>();
// Fine
void add(T element)
{
set.add(element);
list.add(element);
}
// Fine
void remove(T element)
{
set.remove(element);
list.remove(element); // May be expensive, but ... well
}
// Added later, 100 lines below the other methods:
void removeAll(Collection<T> elements)
{
set.removeAll(elements);
// Ooops - something's missing here...
}
To avoid this, one could even consider to create a dedicated collection class - something like a FastContainsList that combines a Set and a List, and forwards the contains call to the Set. But you'll qickly notice that it will be hard (or maybe impossible) to not violate the contracts of the Collection and List interfaces with such a collection, unless the clause that "You may not add elements twice" becomes part of the contract...
So again, all this depends on what you want to do with these methods, and which interface you really need. If you don't need the indexed access of List, then it's easy. Otherwise, referring to your example:
At one point I compare it against another ArrayList, to find their intersection. I know this is O(n^2).
You can avoid this by creating the sets locally:
static <T> List<T> computeIntersection(List<T> list0, List<T> list1)
{
Set<T> set0 = new LinkedHashSet<T>(list0);
Set<T> set1 = new LinkedHashSet<T>(list1);
set0.retainAll(set1);
return new ArrayList<T>(set0);
}
This will have a running time of O(n). Of course, if you do this frequently, but rarely change the contents of the lists, there may be options to avoid the copies, but for the reason mentioned above, maintainng the required data structures may become tricky.

Remove list elements - my approach for best performance in Java

If I need to remove elements in a list, will the following be better than using LinkedList:
int j = 0;
List list = new ArrayList(1000000);
...
// fill in the list code here
...
for (Iterator i = list.listIterator(); i.hasNext(); j++) {
if (checkCondition) {
i.remove();
i = list.listIterator(j);
}
}
?
LinkedList does "remove and add elements" more effectively than ArrayList, but LinkedList as a doubly-linked list needs more memory, since each element is wrapped as an Entry object. While I need a one-direction List interface, because I'm running over in ascending order of index.

The answer is: it depends on the frequency and distribution of your add and removes. If you have to do only a single remove infrequently, then you might use a linked list. However, the main killer for an ArrayList over a LinkedList is constant time random access. You can't really do this with a normal linked list (however, look at a skip list for some inspiration..). Instead, if you're removing elements relative to other elements (where, you need to remove the next element) then you should use a linked list.

There is no simple answer to this:
It depends on what you are optimizing for. Do you care more about the time taken to perform the operations, or the space used by the lists?
It depends on how long the lists are.
It depends on the proportion of elements that you are removing from the lists.
It depends on the other things that you do to the list.
The chances are that one or more of these determining factors is not predictable up-front; i.e. you don't really know. So my advice would be to put this off for now; i.e. just pick one or the other based on gut feeling (or a coin toss). You can revisit the decision later, if you have a quantifiable performance problem in this area ... as demonstrated by cpu or memory usage profiling.

Fastest way to flush a hashmap to disk into an sorted set

I have a Map<byte[], Element> and I want to sort it and write it to disk, so that I have a file with all the elements sorted by key through Guava's UnsignedBytes.lexicographicalComparator.
What I'm doing right now is:
HashMap<byte[], Element> memory;
// ... code creating and populating memory ...
TreeMap<byte[], Element> sortedMap = new TreeMap<byte[], Element>(UnsignedBytes.lexicographicalComparator());
sortedMap.putAll(memory.getMap());
MyWriter writer = new MyWriter("myfile.dat");
for (Element element: sortedMap.values())
writer.write(element);
writer.close();
It's probably difficult to make the sorting faster (O(nlogn)), the question is whether I can improve on the navigation of the sorted list. Ideally I'd sort into an ArrayList instead of a TreeMap, so that iterating through it would be very fast.
I thought about putting the HashMap into an ArrayList and Collections.sort() it, but that would require more copying than the actual solution.
Any ideas?
Edit:
I add here my test with ArrayList which is 2x faster, but I assume it uses more memory. Maybe some comments on this assumption?
// ArrayList-based implementation 2x faster
ArrayList<Element> sorted = new ArrayList<Element>(memory.size());
sorted.addAll(memory.values());
final Comparator<byte[]> lexic = UnsignedBytes.lexicographicalComparator();
Collections.sort(sorted, new Comparator<Element>(){
public int compare(Element arg0, Element arg1) {
return lexic.compare(arg0.getKey(), arg1.getKey());
}
});
MyWriter writer = new MyWriter(filename);
for (Element element: sorted)
writer.write(element);
writer.close();

Your question was "Any ideas?". I guess anything I could write would be an answer.
I had the same problem as you, and extensively benchmarked the two solutions: use a treemap so items were sorted in advance, or sort them after the fact. My benchmark showed the same result as yours. It's faster to sort after the fact.
I wouldn't be concerned about the fact that the second approach requires more copying. First, faster is faster, right? If the second approach takes fewer CPU cycles then it's better.
If memory is a concern, then keep in mind that treemaps and hashmaps take far more memory per item than an ArrayList, which is backed by a simple object array. Each element in a treemap or hashmap requires at least one object, and usually more. Objects have a lot of overhead, 32 or more bytes. In a flat array each element takes only 4 bytes.
My benchmarks showed that the time to allocate an array from memory was roughly proportional to the size of the array, once you got to an array size over a few dozen bytes. So allocating the ArrayList may be slow if it's really large. Still, I think it's the better bet, so long as there's no danger of running out of memory.

Why is ArrayDeque better than LinkedList

I am trying to to understand why Java's ArrayDeque is better than Java's LinkedList as they both implement Deque interface.
I hardly see someone using ArrayDeque in their code. If someone sheds more light into how ArrayDeque is implemented, it would be helpful.
If I understand it, I will be more confident using it. I could not clearly understand the JDK implementation as to the way it manages head and tail references.

Linked structures are possibly the worst structure to iterate with a cache miss on each element. On top of it they consume way more memory.
If you need add/remove of the both ends, ArrayDeque is significantly better than a linked list. Random access each element is also O(1) for a cyclic queue.
The only better operation of a linked list is removing the current element during iteration.

I believe that the main performance bottleneck in LinkedList is the fact that whenever you push to any end of the deque, behind the scene the implementation allocates a new linked list node, which essentially involves JVM/OS, and that's expensive. Also, whenever you pop from any end, the internal nodes of LinkedList become eligible for garbage collection and that's more work behind the scene.
Also, since the linked list nodes are allocated here and there, usage of CPU cache won't provide much benefit.
If it might be of interest, I have a proof that adding (appending) an element to ArrayList or ArrayDeque runs in amortized constant time; refer to this.

All the people criticizing a LinkedList, think about every other guy that has been using List in Java probably uses ArrayList and an LinkedList most of the times because they have been before Java 6 and because those are the ones being taught as a start in most books.
But, that doesn't mean, I would blindly take LinkedList's or ArrayDeque's side. If you want to know, take a look at the below benchmark done by Brian (archived).
The test setup considers:
Each test object is a 500 character String. Each String is a different object in memory.
The size of the test array will be varied during the tests.
For each array size/Queue-implementation combination, 100 tests are run and average time-per-test is calculated.
Each tests consists of filling each queue with all objects, then removing them all.
Measure time in terms of milliseconds.
Test Result:
Below 10,000 elements, both LinkedList and ArrayDeque tests averaged at a sub 1 ms level.
As the sets of data get larger, the differences between the ArrayDeque and LinkedList average test time gets larger.
At the test size of 9,900,000 elements, the LinkedList approach took ~165% longer than the ArrayDeque approach.
Graph:
Takeaway:
If your requirement is storing 100 or 200 elements, it wouldn't make
much of a difference using either of the Queues.
However, if you are developing on mobile, you may want to use an
ArrayList or ArrayDeque with a good guess of maximum capacity
that the list may be required to be because of strict memory constraint.
A lot of code exists, written using a LinkedList so tread carefully when deciding to use a ArrayDeque especially because it DOESN'T implement the List interface(I think that's reason big enough). It may be that your codebase talks to the List interface extensively, most probably and you decide to jump in with an ArrayDeque. Using it for internal implementations might be a good idea...

ArrayDeque is new with Java 6, which is why a lot of code (especially projects that try to be compatible with earlier Java versions) don't use it.
It's "better" in some cases because you're not allocating a node for each item to insert; instead all elements are stored in a giant array, which is resized if it gets full.

ArrayDeque and LinkedList are implementing Deque interface but implementation is different.
Key differences:
The ArrayDeque class is the resizable array implementation of the Deque interface and LinkedList class is the list implementation
NULL elements can be added to LinkedList but not in ArrayDeque
ArrayDeque is more efficient than the LinkedList for add and remove operation at both ends and LinkedList implementation is efficient for removing the current element during the iteration
The LinkedList implementation consumes more memory than the ArrayDeque
So if you don't have to support NULL elements && looking for less memory && efficiency of add/remove elements at both ends, ArrayDeque is the best
Refer to documentation for more details.

I don't think ArrayDeque is better than LinkedList. They are different.
ArrayDeque is faster than LinkedList on average. But for adding an element, ArrayDeque takes amortized constant time, and LinkedList takes constant time.
For time-sensitive applications that require all operations to take constant time, only LinkedList should be used.
ArrayDeque's implementation uses arrays and requires resizing, and occasionally, when the array is full and needs to add an element, it will take linear time to resize, resulting the add() method taking linear time. That could be a disaster if the application is very time-sensitive.
A more detailed explanation of Java's implementation of the two data structures is available in the "Algorithms, Part I" course on Coursera offered by Princeton University, taught by Wayne and Sedgewick. The course is free to the public.
The details are explained in the video "Resizing Arrays" in the "Stacks and Queues" section of "Week 2".

although ArrayDeque<E> and LinkedList<E> have both implemented Deque<E> Interface, but the ArrayDeque uses basically Object array E[] for keeping the elements inside its Object, so it generally uses index for locating the head and tail elements.
In a word, it just works like Deque (with all Deque's method), however uses array's data structure. As regards which one is better, depends on how and where you use them.

That's not always the case.
For example, in the case below linkedlist has better performance than ArrayDeque according to leetcode 103.
/**
* Definition for a binary tree node.
* public class TreeNode {
* int val;
* TreeNode left;
* TreeNode right;
* TreeNode(int x) { val = x; }
* }
*/
class Solution {
public List<List<Integer>> zigzagLevelOrder(TreeNode root) {
List<List<Integer>> rs=new ArrayList<>();
if(root==null)
return rs;
// 👇 here ,linkedlist works better
Queue<TreeNode> queue=new LinkedList<>();
queue.add(root);
boolean left2right=true;
while(!queue.isEmpty())
{
int size=queue.size();
LinkedList<Integer> t=new LinkedList<>();
while(size-->0)
{
TreeNode tree=queue.remove();
if(left2right)
t.add(tree.val);
else
t.addFirst(tree.val);
if(tree.left!=null)
{
queue.add(tree.left);
}
if(tree.right!=null)
{
queue.add(tree.right);
}
}
rs.add(t);
left2right=!left2right;
}
return rs;
}
}

Time complexity for ArrayDeque for accessing a element is O(1) and that for LinkList is is O(N) to access last element. ArrayDeque is not thread safe so manually synchronization is necessary so that you can access it through multiple threads and so they they are faster.

Which list<Object> implementation will be the fastest for one pass write, read, then destroy?

What is the fastest list implementation (in java) in a scenario where the list will be created one element at a time then at a later point be read one element at a time? The reads will be done with an iterator and then the list will then be destroyed.
I know that the Big O notation for get is O(1) and add is O(1) for an ArrayList, while LinkedList is O(n) for get and O(1) for add. Does the iterator behave with the same Big O notation?

It depends largely on whether you know the maximum size of each list up front.
If you do, use ArrayList; it will certainly be faster.
Otherwise, you'll probably have to profile. While access to the ArrayList is O(1), creating it is not as simple, because of dynamic resizing.
Another point to consider is that the space-time trade-off is not clear cut. Each Java object has quite a bit of overhead. While an ArrayList may waste some space on surplus slots, each slot is only 4 bytes (or 8 on a 64-bit JVM). Each element of a LinkedList is probably about 50 bytes (perhaps 100 in a 64-bit JVM). So you have to have quite a few wasted slots in an ArrayList before a LinkedList actually wins its presumed space advantage. Locality of reference is also a factor, and ArrayList is preferable there too.
In practice, I almost always use ArrayList.

First Thoughts:
Refactor your code to not need the list.
Simplify the data down to a scalar data type, then use: int[]
Or even just use an array of whatever object you have: Object[] - John Gardner
Initialize the list to the full size: new ArrayList(123);
Of course, as everyone else is mentioning, do performance testing, prove your new solution is an improvement.

Iterating through a linked list is O(1) per element.
The Big O runtime for each option is the same. Probably the ArrayList will be faster because of better memory locality, but you'd have to measure it to know for sure. Pick whatever makes the code clearest.

Note that iterating through an instance of LinkedList can be O(n^2) if done naively. Specifically:
List<Object> list = new LinkedList<Object>();
for (int i = 0; i < list.size(); i++) {
list.get(i);
}
This is absolutely horrible in terms of efficiency due to the fact that the list must be traversed up to i twice for each iteration. If you do use LinkedList, be sure to use either an Iterator or Java 5's enhanced for-loop:
for (Object o : list) {
// ...
}
The above code is O(n), since the list is traversed statefully in-place.
To avoid all of the above hassle, just use ArrayList. It's not always the best choice (particularly for space efficiency), but it's usually a safe bet.

There is a new List implementation called GlueList which is faster than all classic List implementations.
Disclaimer: I am the author of this library

You almost certainly want an ArrayList. Both adding and reading are "amortized constant time" (i.e. O(1)) as specified in the documentation (note that this is true even if the list has to increase it's size - it's designed like that see http://java.sun.com/j2se/1.5.0/docs/api/java/util/ArrayList.html ). If you know roughly the number of objects you will be storing then even the ArrayList size increase is eliminated.
Adding to the end of a linked list is O(1), but the constant multiplier is larger than ArrayList (since you are usually creating a node object every time). Reading is virtually identical to the ArrayList if you are using an iterator.
It's a good rule to always use the simplest structure you can, unless there is a good reason not to. Here there is no such reason.
The exact quote from the documentation for ArrayList is: "The add operation runs in amortized constant time, that is, adding n elements requires O(n) time. All of the other operations run in linear time (roughly speaking). The constant factor is low compared to that for the LinkedList implementation."

I suggest benchmarking it. It's one thing reading the API, but until you try it for yourself, it'd academic.
Should be fair easy to test, just make sure you do meaningful operations, or hotspot will out-smart you and optimise it all to a NO-OP :)

I have actually begun to think that any use of data structures with non-deterministic behavior, such as ArrayList or HashMap, should be avoided, so I would say only use ArrayList if you can bound its size; any unbounded list use LinkedList. That is because I mainly code systems with near real time requirements though.
The main problem is that any memory allocation (which could happen randomly with any add operation) could also cause a garbage collection, and any garbage collection can cause you to miss a target. The larger the allocation, the more likely this is to occur, and this is also compounded if you are using CMS collector. CMS is non-compacting, so finding space for a new linked list node is generally going to be easier than finding space for a new 10,000 element array.
The more rigorous your approach to coding, the closer you can come to real time with a stock JVM. But choosing only data structures with deterministic behavior is one of the first steps you would have to take.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.