a collection data structure to keep items sorted

a collection data structure to keep items sorted - java

I got a program which is using an ArrayList<T> and that type T also implements Comparable<T>. I need to keep that list sorted.
For now, when I insert a new item, I add it to the ArrayList and then invoke Collections.sort(myArrayList).
Is sorting with Collections.sort every time I insert a new item seriously hurt run time complexity?
Is there a better suited data structure I can use to always keep the list sorted? I know of a structure called a PriorityQueue but I also need to be able to get the list's elements by index.
EDIT:
In my specific case, inserting a new item happens much less than geting an already existing item, so eventually a good advice could also be to stay with the ArrayList since it got a constant time complexity of getting an item. But if you know of anything else...

It seems like Collection.Sort is actually the way to go here since when the collection is already almost sorted, the sorting will take not longer than O(n) in the worst case.

Instead of using Collections.sort(myArrayList) after every insertion you might want to do something a bit smarter as you know that every time you insert an element your collection is already ordered.
Collections.sort(myArrayList) takes 0(nlogn) time, you could do an ordered insert in an ordered collection in O(n) time using Collections.binarySearch. If the collection is ordered in ascending order Collections.binarySearch returns the index of the element you are looking for if it exists or (-(insertion point) - 1). Before inserting an element you can look for it with Collections.binarySearch (O(logn) time). Done that you can derive the index at which inserting the new element. You can then add the element with addAt in O(n) time. The whole insertion complexity is bounded by the addAt so you can do an ordered insert in an ArrayList in O(n) time.

List is an ordered collection, which means you need to have the ability to access with index. If a collection internally shuffles or sorts the elements, the insertion order wont be same as the order of the elements in the internal data structure. So you cannot depend on index based access anymore. Hence Sun didn't provide a SortedList or a TreeList class. That is why you use Collections.sort(..)
Apache commons-collections does provide a TreeList class but it is not a sorted List and is called so because it uses a tree data structure to store the elements internally. Check its documentation here - http://commons.apache.org/proper/commons-collections/javadocs/api-3.2.1/org/apache/commons/collections/list/TreeList.html

A single data structure cannot provide both sorting and index based retrieval. If the input set is limited to a few hundreds or thousands, you can keep two data structures in parallel.
For example, ArrayList for index based retrieval and TreeMap (or priority queue) for sorting.

Related

How to efficiently remove an element from java LinkedList

I have an algorithm where I pass through nodes in a graph in a certain way, occasionally passing through the same node several times, and I need to form a list of the nodes passed, such that a node appears once for the last time I passed it.
For instance, if I passed through nodes A -> B -> C -> A -> C, the list I need in the end is [B, A, C].
What I wanted to do was to use a LinkedList, such that every node in the graph will contain a reference to its node in the LinkedList. Then, every time I pass through a node, I will remove its corresponding node from the LinkedList and insert it again into the end of the LinkedList, and the complexity of the operation will only be O(1).
However, when I began implementing this, I ran into a problem: apparently, the java class LinkedList does not allow me to see its actual list nodes. Using the regular remove functions of LinkedList to remove the list node containing a given graph node will be O(n) instead O(1), negating the whole point of using a LinkedList to begin with.
Naturally, I can implement LinkedList myself, but I would rather avoid that - it seems to me that if I have to implement LinkedList in java, I'm doing something wrong.
So, is there a way to solve this problem without implementing LinkedList on my own? Is there something that I'm missing?

As it seems, you are expecting a built-in approach, i don't think there is any Collection which provides such functionality. You will have to implement it on your own as #MartijinCourteaux suggested. Or:
use Sorted Set collection like TreeSet<E> with supporting cost of O(log n) for operations: add, remove and contains.
LinkedHashSet<E> But beware unlike HashSet<E>, LinkedHashSet can have O(1) expected performance for operations: add, contains, remove but the performance is likely to be just slightly below that of HashSet, due to the added expense of maintaining the linked list. But we can use it without incurring the increased cost associated with TreeSet. However, insertion order is not affected if an element is re-inserted into the set so try removing the first insertion of an element before re-inserting it.

LinkedHashMap keeps order of entered values and allows remove node by its key and then put back to the end. I think that it is all you need.

Unless your linked list is large just using a regular array list will give fast performance even with the shuffling. You should also consider using hash sets, if order is not important, linked hash set if the order of insert matters, or tree set if you want it sorted. They don't allow duplicate values but have good O performance for insert, delete and contains.

If a sorted Collection of an ArrayList expands past its initial size, does it become unsorted?

I have an ArrayList that I sort initially. When I add to it, I do:
Index = Collections.binarySearch(Data.fileList, fileEntry, FileData.COMPARE_BY_FNAME);
if (Index >=0)
fileEntry = Data.fileList.get(Index) // get the object that matches
else
Data.fileList.add ((Index+1)*-1, fileEntry) // add the new object
which adds the entry into the correct location so I don't have to sort again (I believe).
When the ArrayList gets big I end up with duplicate entries, so I assume it is no longer sorted.
I think that when the ArrayList gets past its initial size and is expanded, that my collection is no longer sorted.
Q1) Is this true?
Q2) Is there a way to tell if the collection is no longer sorted? Is there a way to tell if the ArrayList has been expanded? Or do I have to do a sort after every insert?
Q3) The ArrayList.size() returns the number of elements in the list. Is there a way to tell the capacity of the list?
Thanks.
-J

Consider using a data structure that guarantees sorted insertion, like a TreeSet. Other than that, I'm guessing that the problem is in your algorithm for sorted insertion in an ArrayList, and it is not related to the fact that the ArrayList is growing.

When the ArrayList gets big I end up with duplicate entries, so I assume it is no longer sorted.
It stays sorted for as long as you keep adding things in the right location. It's not really clear what you mean by "duplicate entries" here - but there's nothing about expanding an ArrayList which will reorder it.
On the other hand, using a sorted collection to start with (as suggested by Óscar López) would make your life simpler.

If you want a collection that is always sorted, ArrayList is not the tool you really want to use. There are collections which specialize in remaining sorted.
If you want to allow duplicates, PriorityQueue is your best bet. If you are okay with ignoring duplicates, TreeSet might be a better option.

The resizing has nothing to do with it. Sort order is not part of the List contract. A list is simply an ordered collection. The order is up to who is adding the elements. When an ArrayList resizes, it simply re-allocates the underlying array, and copies the existing data over to the new array in the same order.
As mentioned above, you should be using a SortedSet (specifically TreeSet) if you want to keep things sorted as you add them. If you want to allow duplicates the TreeBag class in commons-collections is a good option.

When to use HashMap over LinkedList or ArrayList and vice-versa

What is the reason why we cannot always use a HashMap, even though it is much more efficient than ArrayList or LinkedList in add,remove operations, also irrespective of the number of the elements.
I googled it and found some reasons, but there was always a workaround for using HashMap, with advantages still alive.

Lists represent a sequential ordering of elements.
Maps are used to represent a collection of key / value pairs.
While you could use a map as a list, there are some definite downsides of doing so.
Maintaining order:
A list by definition is ordered. You add items and then you are able to iterate back through the list in the order that you inserted the items. When you add items to a HashMap, you are not guaranteed to retrieve the items in the same order you put them in. There are subclasses of HashMap like LinkedHashMap that will maintain the order, but in general order is not guaranteed with a Map.
Key/Value semantics:
The purpose of a map is to store items based on a key that can be used to retrieve the item at a later point. Similar functionality can only be achieved with a list in the limited case where the key happens to be the position in the list.
Code readability
Consider the following examples.
// Adding to a List
list.add(myObject); // adds to the end of the list
map.put(myKey, myObject); // sure, you can do this, but what is myKey?
map.put("1", myObject); // you could use the position as a key but why?
// Iterating through the items
for (Object o : myList) // nice and easy
for (Object o : myMap.values()) // more code and the order is not guaranteed
Collection functionality
Some great utility functions are available for lists via the Collections class. For example ...
// Randomize the list
Collections.shuffle(myList);
// Sort the list
Collections.sort(myList, myComparator);

Lists and Maps are different data structures. Maps are used for when you want to associate a key with a value and Lists are an ordered collection.
Map is an interface in the Java Collection Framework and a HashMap is one implementation of the Map interface. HashMap are efficient for locating a value based on a key and inserting and deleting values based on a key. The entries of a HashMap are not ordered.
ArrayList and LinkedList are an implementation of the List interface. LinkedList provides sequential access and is generally more efficient at inserting and deleting elements in the list, however, it is it less efficient at accessing elements in a list. ArrayList provides random access and is more efficient at accessing elements but is generally slower at inserting and deleting elements.

I will put here some real case examples and scenarios when to use one or another, it might be of help for somebody else:
HashMap
When you have to use cache in your application. Redis and membase are some type of extended HashMap. (Doesn't matter the order of the elements, you need quick ( O(1) ) read access (a value), using a key).
LinkedList
When the order is important (they are ordered as they were added to the LinkedList), the number of elements are unknown (don't waste memory allocation) and you require quick insertion time ( O(1) ). A list of to-do items that can be listed sequentially as they are added is a good example.

The downfall of ArrayList and LinkedList is that when iterating through them, depending on the search algorithm, the time it takes to find an item grows with the size of the list.
The beauty of hashing is that although you sacrifice some extra time searching for the element, the time taken does not grow with the size of the map. This is because the HashMap finds information by converting the element you are searching for, directly into the index, so it can make the jump.
Long story short...
LinkedList: Consumes a little more memory than ArrayList, low cost for insertions(add & remove)
ArrayList: Consumes low memory, but similar to LinkedList, and takes extra time to search when large.
HashMap: Can perform a jump to the value, making the search time constant for large maps. Consumes more memory and takes longer to find the value than small lists.

Java collection insertion: Set vs. List

I'm thinking about filling a collection with a large amount of unique objects.
How is the cost of an insert in a Set (say HashSet) compared to an List (say ArrayList)?
My feeling is that duplicate elimination in sets might cause a slight overhead.

There is no "duplicate elimination" such as comparing to all existing elements. If you insert into hash set, it's really a dictionary of items by hash code. There's no duplicate checking unless there already are items with the same hash code. Given a reasonable (well-distributed) hash function, it's not that bad.
As Will has noted, because of the dictionary structure HashSet is probably a bit slower than an ArrayList (unless you want to insert "between" existing elements). It also is a bit larger. I'm not sure that's a significant difference though.

You're right: set structures are inherently more complex in order to recognize and eliminate duplicates. Whether this overhead is significant for your case should be tested with a benchmark.
Another factor is memory usage. If your objects are very small, the memory overhead introduced by the set structure can be significant. In the most extreme case (TreeSet<Integer> vs. ArrayList<Integer>) the set structure can require more than 10 times as much memory.

If you're certain your data will be unique, use a List. You can use a Set to enforce this rule.
Sets are faster than Lists if you have a large data set, while the inverse is true for smaller data sets. I haven't personally tested this claim.
Which type of List?
Also, consider which List to use. LinkedLists are faster at adding, removing elements.
ArrayLists are faster at random access (for loops, etc), but this can be worked around using the Iterator of a LinkedList. ArrayLists are are much faster at: list.toArray().

You have to compare concrete implementations (for example HashSet with ArrayList), because the abstract interfaces Set/List don't really tell you anything about performance.
Inserting into a HashSet is a pretty cheap operation, as long as the hashCode() of the object to be inserted is sane. It will still be slightly slower than ArrayList, because it's insertion is a simple insertion into an array (assuming you insert in the end and there's still free space; I don't factor in resizing the internal array, because the same cost applies to HashSet as well).

If the goal is the uniqueness of the elements, you should use an implementation of the java.util.Set interface. The class java.util.HashSet and java.util.LinkedHashSet have O(alpha) (close to O(1) in the best case) complexity for insert, delete and contains check.
ArrayList have O(n) for object (not index) contains check (you have to scroll through the whole list) and insertion (if the insertion is not in tail of the list, you have to shift the whole underline array).
You can use LinkedHashSet that preserve the order of insertion and have the same potentiality of HashSet (takes up only a bit more of memory).

I don't think you can make this judgement simply on the cost of building the collection. Other things that you need to take into account are:
Is the input dataset ordered? Is there a requirement that the output data structure preserves insertion order?
Is there a requirement that the output data structure is ordered (or reordered) based on element values?
Will the output data structure be subsequently modified? How?
Is there a requirement that the output data structure is duplicate free if other elements are added subsequently?
Do you know how many elements are likely to be in the input dataset?
Can you measure the size of the input dataset? (Or is it provided via an iterator?)
Does space utilization matter?
These can all effect your choice of data structure.

Java List:
If you don't have such requirement that you have to keep duplicate or not. Then you can use List instead of Set.
List is an interface in Collection framework. Which extends Collection interface. and ArrayList, LinkedList is the implementation of List interface.
When to use ArrayList or LinkedList
ArrayList: If you have such requirement that in your application mostly work is accessing the data. Then you should go for ArrayList. because ArrayList implements RtandomAccess interface which is Marker Interface. because of Marker interface ArrayList have capability to access the data in O(1) time. and you can use ArrayList over LinkedList where you want to get data according to insertion order.
LinkedList: If you have such requirement that your mostly work is insertion or deletion. Then you should use LinkedList over the ArrayList. because in LinkedList insertion and deletion happen in O(1) time whereas in ArrayList it's O(n) time.
Java Set:
If you have requirement in your application that you don't want any duplicates. Then you should go for Set instead of List. Because Set doesn't store any duplicates. Because Set works on the principle of Hashing. If we add object in Set then first it checks object's hashCode in the bucket if it's find any hashCode present in it's bucked then it'll not add that object.

searching an unorder list without converting it to an array

Is there a way to first sort then search for an objects within a linked list of objects.
I thought just to you one of the sorting way and a binary search what do you think?
Thanks

This is not a good approach, IMO. If you use Collections.sort(list), where the list is a LinkedList, this copies the list to a temporary array, sorts it, and then copies back to the list' i.e. O(NlogN) to sort plus 2 * O(N) copies. But when you then try to do an binary search (e.g. using Collections.binarySearch(list), each search will do O(N) list traversal operations. So you may as well have not bothered sorting the list!
Another approach would be to convert the list to an array or an ArrayList, and then sort and search that array / ArrayList. That gives one copy plus one sort to setup, and O(logN) for each search.
But neither of these is the best approach. That depends on how many times you need to perform search operations.
If you simply want to do one search on the list, then calling list.contains(...) is O(N) ... and that is better than anything involving sorting and binary searching.
If you want to do multiple searches on a list that never changes, you're probably better off putting the list entries into a HashSet. Constructing a HashSet is O(N) and searching is O(1). (This assumes you don't need your own comparator.)
If you want to do multiple searches on a list that keeps changing where the order IS NOT significant, replace the list with a HashSet. The incremental cost of updating the HashSet will be O(1) for each addition/removal, and O(1) for each search.
If you want to do multiple searches on a list that keeps changing and the order IS significant, replace the list with an insertion-ordered LinkedHashMap. That will be O(1) for each addition/removal, and O(1) for each search ... but with large constants of proportionality than for a HashSet.

java.util.Collections#sort()
java.util.Collections#binarySearch()
The Collections class has lots of other amazing methods to make programmers life easier.
Note that the sort method's implementation will indeed convert the list to array, but from you need not explicitly convert the list in to array before calling the method:)

You may want to question if searching over a sorted list is the best option for your use-case as this does not perform well. The list sort is O(NlogN) and the binary search is O(logN). You might consider making a Set out of your list elements and then searching that via the contains method, which is O(1), if you just want to see if an element exists. It would be much easier to give you some advice on what collection you might consider if you could explain more about your use-case.
EDIT: Consider performance issues of List sorting if you plan to do this for large lists.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.