I have an iterator that gives me n elements. Currently I copy them one by one into an ArrayList and then call Collections.Sort() on that list to obtain a sorted ArrayList. This takes nlog(n)+n operations. Is there a faster way to do it, i.e. can I already use the insertion operation to a certain degree?
The iterator does not give any sorting, the elements occur pretty much randomly.
if you have only that iterator, I don't see faster solutions. note that nlogn+n is also O(nlogn).
if you want to "sort while inserting", you need do binary search on each insertion, it would be O(nlogn) too. I don't think it would be much faster than what you have.
TreeSet can save you from the binary search implementation, but basically it is the same logic.
Since an iterator is not a collection nor container, it is not possible to sort directly in the iterator, like you already noticed. The method that you are using seems to be the best solution in this case.
If your elements are unique you could drop them into a TreeSet and then copy them out of the TreeSet into an ArrayList. That may not actually be any faster than what you are already doing though.
Beyond that you are unlikely to be able to optimise further than you already have. Writing your own insertion sort would almost certainly be slower than just using the highly optimised Java sort routines.
You could consider looking at the new Java Streams API in Java 8 though. That would allow you to do this by opening the iterator as a stream, sorting it, then collating it to your final collection.
http://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html
If you have an object rather than raw data type (such as int , double) in your array, the cost of the object copy must be considered. In this situation, sort the array index may be a better way. Use search data structure map/set is better only when you need to process sorting and inserting simultaneously.
Related
Java 8 provides java.util.Arrays.parallelSort, which sorts arrays in parallel using the fork-join framework. But there's no corresponding Collections.parallelSort for sorting lists.
I can use toArray, sort that array, and store the result back in my list, but that will temporarily increase memory usage, which if I'm using parallel sorting is already high because parallel sorting only pays off for huge lists. Instead of twice the memory (the list plus parallelSort's working memory), I'm using thrice (the list, the temporary array and parallelSort's working memory). (Arrays.parallelSort documentation says "The algorithm requires a working space no greater than the size of the original array".)
Memory usage aside, Collections.parallelSort would also be more convenient for what seems like a reasonably common operation. (I tend not to use arrays directly, so I'd certainly use it more often than Arrays.parallelSort.)
The library can test for RandomAccess to avoid trying to e.g. quicksort a linked list, so that can't a reason for a deliberate omission.
How can I sort a List in parallel without creating a temporary array?
There doesn't appear to be any straightforward way to sort a List in parallel in Java 8. I don't think this is fundamentally difficult; it looks more like an oversight to me.
The difficulty with a hypothetical Collections.parallelSort(list, cmp) is that the Collections implementation knows nothing about the list's implementation or its internal organization. This can be seen by examining the Java 7 implementation of Collections.sort(list, cmp). As you observed, it has to copy the list elements out to an array, sort them, and then copy them back into the list.
This is the big advantage of the List.sort(cmp) extension method over Collections.sort(list, cmp). It might seem that this is merely a small syntactic advantage being able to write myList.sort(cmp) instead of Collections.sort(myList, cmp). The difference is that myList.sort(cmp), being an interface extension method, can be overridden by the specific List implementation. For example, ArrayList.sort(cmp) sorts the list in-place using Arrays.sort() whereas the default implementation implements the old copyout-sort-copyback technique.
It should be possible to add a parallelSort extension method to the List interface that has similar semantics to List.sort but does the sorting in parallel. This would allow ArrayList to do a straightforward in-place sort using Arrays.parallelSort. (It's not entirely clear to me what the default implementation should do. It might still be worth it to do copyout-parallelSort-copyback.) Since this would be an API change, it can't happen until the next major release of Java SE.
As for a Java 8 solution, there are a couple workarounds, none very pretty (as is typical of workarounds). You could create your own array-based List implementation and override sort() to sort in parallel. Or you could subclass ArrayList, override sort(), grab the elementData array via reflection and call parallelSort() on it. Of course you could just write your own List implementation and provide a parallelSort() method, but the advantage of overriding List.sort() is that this works on the plain List interface and you don't have to modify all the code in your code base to use a different List subclass.
I think you are doomed to use a custom List implementation augmented with your own parallelSort or else change all your other code to store the big data in Array types.
This is the inherent problem with layers of abstract data types. They're meant to isolate the programmer from details of implementation. But when the details of implementation matter - as in the case of underlying storage model for sort - the otherwise splendid isolation leaves the programmer helpless.
The standard List sort documents provide an example. After the explanation that mergesort is used, they say
The default implementation obtains an array containing all elements in this list, sorts the array, and iterates over this list resetting each element from the corresponding position in the array. (This avoids the n2 log(n) performance that would result from attempting to sort a linked list in place.)
In other words, "since we don't know the underlying storage model for a List and couldn't touch it if we did, we make a copy organized in a known way." The parenthesized expression is based on the fact that the List "i'th element accessor" on a linked list is Omega(n), so the normal array mergesort implemented with it would be a disaster. In fact it's easy to implement mergesort efficiently on linked lists. The List implementer is just prevented from doing it.
A parallel sort on List has the same problem. The standard sequential sort fixes it with custom sorts in the concrete List implementations. The Java folks just haven't chosen to go there yet. Maybe in Java 9.
Use the following:
yourCollection.parallelStream().sorted().collect(Collectors.toList());
This will be parallel when sorting, because of parallelStream(). I believe this is what you mean by parallel sort?
Just speculating here, but I see several good reasons for generic sort algorithms preferring to work on arrays instead of List instances:
Element access is performed via method calls. Despite all the optimizations JIT can apply, even for a list that implements RandomAccess, this probably means a lot of overhead compared to plain array accesses which can be optimized very well.
Many algorithms require copying some fragments of the array to temporary structures. There are efficient methods for copying arrays or their fragments. An arbitrary List instance on the other hand, can't be easily copied. New lists would have to be allocated which poses two problems. First, this means allocating some new objects which is likely more costly than allocating arrays. Second, the algorithm would have to choose what implementation of List should be allocated for this temporary structure. There are two obvious solutions, both bad: either just choose some hard-coded implementation, e.g. ArrayList, but then it could just allocate simple arrays as well (and if we're generating arrays then it's much easier if the soiurce is also an array). Or, let the user provide some list factory object, which makes the code much more complicated.
Related to the previous issue: there is no obvious way of copying a list into another due to how the API is designed. The best the List interface offers is addAll() method, but this is probably not efficient for most cases (think of pre-allocating the new list to its target size vs adding elements one by one which many implementations do).
Most lists that need to be sorted will be small enough for another copy to not be an issue.
So probably the designers thought of CPU efficiency and code simplicity most of all, and this is easily achieved when the API accepts arrays. Some languages, e.g. Scala, have sort methods that work directly on lists, but this comes at a cost and probably is less efficient than sorting arrays in many cases (or sometimes there will probably just be a conversion to and from array performed behind the scenes).
By combining the existing answers I came up with this code.
This works if you are not interested in creating a custom List class and if you don't bother to create a temporary array (Collections.sort is doing it anyway).
This uses the initial list and does not create a new one as in the parallelStream solution.
// Convert List to Array so we can use Arrays.parallelSort rather than Collections.sort.
// Note that Collections.sort begins with this very same conversion, so we're not adding overhead
// in comparaison with Collections.sort.
Foo[] fooArr = fooLst.toArray(new Foo[0]);
// Multithread the TimSort. Automatically fallback to mono-thread when size is less than 8192.
Arrays.parallelSort(fooArr, Comparator.comparingStuff(Foo::yourmethod));
// Refill the List using the sorted Array, the same way Collections.sort does it.
ListIterator<Foo> i = fooLst.listIterator();
for (Foo e : fooArr) {
i.next();
i.set(e);
}
Of course, I know about the performance difference between arraylist and linkedlist. I have run tests myself and seen the huge difference in time and memory for insertion/deletion and iteration between arraylist and linkedlist for a very big list.
(Correct me if i am wrong)We generally prefer arraylist over linkedlist because:
1)We practically do iterations more often than insertion/deletion. So we prefer iterations to be faster than insertion/deletion.
2)The memory overhead of linkedlist is much more than arraylist
3)There is NO way in which we can define a list as linkedlist while inserting/deleting in batch, and as arraylist while iterating. It is because arraylist and linkedlist have fundamentally different data-storage techniques.
Am I wrong about the 3rd point [I hope so :)]? Is there any possibility to have benefits of these two data structures in a single list? I guess, data structure designers must have thought about it.
If you are looking for some more performant collection implementations, check out Javolution. That package provides a FastList and FastTable which may at least reduce the cost of choosing between linked lists and array lists.
You might want to look into Clojure's "vectors" (which are a lot more than a simple array under the hood): http://blog.higher-order.net/2009/02/01/understanding-clojures-persistentvector-implementation/. They are O(log32 n) for lookup and insertion.
Note that these are directly usable from Java! (Actually, they're implemented in Java code.)
Probably we have another points to consider, but one aspect that make me choose LinkedList over ArrayList is when:
When I don't need to get an element by index (in case of process all elements)
When I don't know the size when creating my list
Here is an interesting manifesto about this topic.
I'm thinking about filling a collection with a large amount of unique objects.
How is the cost of an insert in a Set (say HashSet) compared to an List (say ArrayList)?
My feeling is that duplicate elimination in sets might cause a slight overhead.
There is no "duplicate elimination" such as comparing to all existing elements. If you insert into hash set, it's really a dictionary of items by hash code. There's no duplicate checking unless there already are items with the same hash code. Given a reasonable (well-distributed) hash function, it's not that bad.
As Will has noted, because of the dictionary structure HashSet is probably a bit slower than an ArrayList (unless you want to insert "between" existing elements). It also is a bit larger. I'm not sure that's a significant difference though.
You're right: set structures are inherently more complex in order to recognize and eliminate duplicates. Whether this overhead is significant for your case should be tested with a benchmark.
Another factor is memory usage. If your objects are very small, the memory overhead introduced by the set structure can be significant. In the most extreme case (TreeSet<Integer> vs. ArrayList<Integer>) the set structure can require more than 10 times as much memory.
If you're certain your data will be unique, use a List. You can use a Set to enforce this rule.
Sets are faster than Lists if you have a large data set, while the inverse is true for smaller data sets. I haven't personally tested this claim.
Which type of List?
Also, consider which List to use. LinkedLists are faster at adding, removing elements.
ArrayLists are faster at random access (for loops, etc), but this can be worked around using the Iterator of a LinkedList. ArrayLists are are much faster at: list.toArray().
You have to compare concrete implementations (for example HashSet with ArrayList), because the abstract interfaces Set/List don't really tell you anything about performance.
Inserting into a HashSet is a pretty cheap operation, as long as the hashCode() of the object to be inserted is sane. It will still be slightly slower than ArrayList, because it's insertion is a simple insertion into an array (assuming you insert in the end and there's still free space; I don't factor in resizing the internal array, because the same cost applies to HashSet as well).
If the goal is the uniqueness of the elements, you should use an implementation of the java.util.Set interface. The class java.util.HashSet and java.util.LinkedHashSet have O(alpha) (close to O(1) in the best case) complexity for insert, delete and contains check.
ArrayList have O(n) for object (not index) contains check (you have to scroll through the whole list) and insertion (if the insertion is not in tail of the list, you have to shift the whole underline array).
You can use LinkedHashSet that preserve the order of insertion and have the same potentiality of HashSet (takes up only a bit more of memory).
I don't think you can make this judgement simply on the cost of building the collection. Other things that you need to take into account are:
Is the input dataset ordered? Is there a requirement that the output data structure preserves insertion order?
Is there a requirement that the output data structure is ordered (or reordered) based on element values?
Will the output data structure be subsequently modified? How?
Is there a requirement that the output data structure is duplicate free if other elements are added subsequently?
Do you know how many elements are likely to be in the input dataset?
Can you measure the size of the input dataset? (Or is it provided via an iterator?)
Does space utilization matter?
These can all effect your choice of data structure.
Java List:
If you don't have such requirement that you have to keep duplicate or not. Then you can use List instead of Set.
List is an interface in Collection framework. Which extends Collection interface. and ArrayList, LinkedList is the implementation of List interface.
When to use ArrayList or LinkedList
ArrayList: If you have such requirement that in your application mostly work is accessing the data. Then you should go for ArrayList. because ArrayList implements RtandomAccess interface which is Marker Interface. because of Marker interface ArrayList have capability to access the data in O(1) time. and you can use ArrayList over LinkedList where you want to get data according to insertion order.
LinkedList: If you have such requirement that your mostly work is insertion or deletion. Then you should use LinkedList over the ArrayList. because in LinkedList insertion and deletion happen in O(1) time whereas in ArrayList it's O(n) time.
Java Set:
If you have requirement in your application that you don't want any duplicates. Then you should go for Set instead of List. Because Set doesn't store any duplicates. Because Set works on the principle of Hashing. If we add object in Set then first it checks object's hashCode in the bucket if it's find any hashCode present in it's bucked then it'll not add that object.
In my program I often use collections to store lists of objects. Currently I use ArrayList to store objects.
My question is: is this a best choice? May be its better to use LinkedList? Or something else?
Criteria to consider are:
Memory usage
Performance
Operations which I need are:
Add element to collection
Iterate through the elements
Any thoughts?
Update: my choice is : ArrayList :) Basing on this discussion as well as the following ones:
When to use LinkedList over ArrayList?
List implementations: does LinkedList really perform so poorly vs. ArrayList and TreeList?
I always default to ArrayList, and would in your case as well, except when
I need thread safety (in which case I start looking at List implementations in java.util.concurrent)
I know I'm going to be doing lots of insertion and manipulation to the List or profiling reveals my usage of an ArrayList to be a problem (very rare)
As to what to pick in that second case, this SO.com thread has some useful insights: List implementations: does LinkedList really perform so poorly vs. ArrayList and TreeList?
I know I'm late but, maybe, this page can help you, not only now, but in the future...
Linked list is faster for adding/removing inside elements (ie not head or tail)
Arraylist is faster for iterating
It's a classic tradeoff between insert vs. retrieve optimization. Common choice for the task as you describe it is the ArrayList.
ArrayList is fine for your (and most other) purposes. It has a very small memory overhead and has good amortized performance for most operations. The cases where it is not ideal are relatively rare:
The list ist very large
You frequently need to do one of these operations:
Add/remove items during iteration
Remove items from the beginning of the list
If you're only adding at the end of the list, ArrayList should be ok. From the documentation of ArrayList:
The details of the growth policy are not specified beyond the fact that adding an element has constant amortized time cost
and ArrayList should also use less memory than a linked list as you don't need to use space for the links.
It depends on your usage profile.
Do you add to the end of the list? Both are fine for this.
Do you add to the start of the list? LinkedList is better for this.
Do you require random access (will you ever call get(n) on it)? ArrayList is better for this.
Both are good at iterating, both Iterator implementations are O(1) for next().
If in doubt, test your own app with each implementation and make your own choice.
Given your criteria, you should be using the LinkedList.
LinkedList implements the Deque interface which means that it can add to the start or end of the list in constant time (1). In addition, both the ArrayList and LinkedList will iterate in (N) time.
You should NOT use the ArrayList simply because the cost of adding an element when the list is full. In this case, adding the element would be (N) because of the new array being created and copying all elements from one array to the other.
Also, the ArrayList will take up more memory because the size of your backing array might not be completely filled.
Is there a way to first sort then search for an objects within a linked list of objects.
I thought just to you one of the sorting way and a binary search what do you think?
Thanks
This is not a good approach, IMO. If you use Collections.sort(list), where the list is a LinkedList, this copies the list to a temporary array, sorts it, and then copies back to the list' i.e. O(NlogN) to sort plus 2 * O(N) copies. But when you then try to do an binary search (e.g. using Collections.binarySearch(list), each search will do O(N) list traversal operations. So you may as well have not bothered sorting the list!
Another approach would be to convert the list to an array or an ArrayList, and then sort and search that array / ArrayList. That gives one copy plus one sort to setup, and O(logN) for each search.
But neither of these is the best approach. That depends on how many times you need to perform search operations.
If you simply want to do one search on the list, then calling list.contains(...) is O(N) ... and that is better than anything involving sorting and binary searching.
If you want to do multiple searches on a list that never changes, you're probably better off putting the list entries into a HashSet. Constructing a HashSet is O(N) and searching is O(1). (This assumes you don't need your own comparator.)
If you want to do multiple searches on a list that keeps changing where the order IS NOT significant, replace the list with a HashSet. The incremental cost of updating the HashSet will be O(1) for each addition/removal, and O(1) for each search.
If you want to do multiple searches on a list that keeps changing and the order IS significant, replace the list with an insertion-ordered LinkedHashMap. That will be O(1) for each addition/removal, and O(1) for each search ... but with large constants of proportionality than for a HashSet.
java.util.Collections#sort()
java.util.Collections#binarySearch()
The Collections class has lots of other amazing methods to make programmers life easier.
Note that the sort method's implementation will indeed convert the list to array, but from you need not explicitly convert the list in to array before calling the method:)
You may want to question if searching over a sorted list is the best option for your use-case as this does not perform well. The list sort is O(NlogN) and the binary search is O(logN). You might consider making a Set out of your list elements and then searching that via the contains method, which is O(1), if you just want to see if an element exists. It would be much easier to give you some advice on what collection you might consider if you could explain more about your use-case.
EDIT: Consider performance issues of List sorting if you plan to do this for large lists.