I'm using it to sort Objects based on particular attributes of each object (ex: date and quantity). When I attach a comparator to a TreeSet, what type of performance will I be getting when I add N values into it and letting it self-sort?
From the documentation for TreeSet, emphasis mine:
This implementation provides guaranteed log(n) time cost for the basic operations (add, remove and contains).
Your choice of comparator does not matter assuming your comparator is O(1) with respect to the size of the set (which it generally is).
The number of values you are inserting is not relevant to the complexity analysis of the insert operation itself.
Related
LinkedHashSet - This implementation spares its clients from the unspecified, generally chaotic ordering provided by HashSet, without incurring the increased cost associated with TreeSet.
Same is said about LinkedHashMap vs TreeMap
What is this increased cost (LinkedHashMap vs TreeMap) exactly?
Does that mean that TreeSet needs more memory per element? LinkedHashSet needs more memory for two additional links, but TreeSet needs additional memory to store Map.Entry pair of elements (because implicitly based on TreeMap), besides LinkedHashSet is based on HashMap which also has Map.Entry pair of elements overhead...
So the difference is how fast a new element is added (in case of TreeSet it takes longer due to some "sorting").
What are other significant increased costs?
TreeSet/TreeMap have a higher time complexity for operations such ass add(), contains() (for TreeSet), put(), containsKey() (for TreeMap), etc... since they require logarithmic time to locate elements in the tree (or add elements to the tree), while LinkedHashSet/LinkedHashMap require expected constant time for those operations.
In terms of memory requirements, there's a very small difference:
TreeMap entries hold key, value, 3 Entry references (left, right, parent) and a boolean.
LinkedHashMap entries hold key, value, 3 Entry references (next, before, after) and an int.
When iterating a HashSet, the iteration order is generally the order of the hash of the object, which is generally not too useful if you want a predictable order.
If sane ordering is important you would generally need to use a TreeSet which iterates in sorted order but at a cost because maintaining the sorted order adds to the complexity of the process.
A LinkedHashSet can be used as a middle-ground solution to the seemingly insane ordering of a HashSet by ensuring that the iteration order is at least consistent by using the insertion order.
I have few questions related to Collection Frameworks's TreeSet that I am putting here.
Is the only functional difference between TreeSet and ArrayList classes is constraint of unique elements and elements being sorted too in TreeSet?
Presence of prefix Tree creates a confusion about visualizing a TreeSet as a hierarchical data structure or linear one. Mathematical sets are linear data structures while name Tree in computing indicates a hierarchical one.
Is there really any similarity / relation between Tree Data Structure and Java's TreeSet or name TreeSet just a coincidence?
I mean, it doesn't seem that set will have anything to do with parent - child relationships.
EDIT - Looks like, I was confused about what I am trying to ask which got clarified after pondering over comments and answers. I guess, my main question should have been "why mathematical set DS ( sorted or unsorted ) is implemented via a Tree?" and that is a duplicate of How to implement Set data structure?
Is the only functional difference between TreeSet and ArrayList
classes is constraint of unique elements and elements being sorted too
in TreeSet?
That is major difference, apart from internal implementation, and this enables TreeSet to provide functions like subset, tailset, headSet which are not possible with a ArrayList.
Presence of prefix Tree creates a confusion about visualizing a
TreeSet as a hierarchical data structure or linear one. Mathematical
sets are linear data structures while name Tree in computing indicates
a hierarchical one.
Yes, it is hierarchical structure. Internally the implementation is a Red-black binary tree.
Is there really any similarity / relation between Tree Data Structure
and Java's TreeSet or name TreeSet just a coincidence?
The internal implementation is a R-B binary tree.
On a side note, since these two are different data structures, time complexity of TreeSet is completely from ArrayList for same set of operations. For ex: add ArrayList is O(1) but for TreeSet it is O(logn), search for arrayList is O(n) and for TreeSet is is O(logn) and so on...
TreeSet is real tree, not coincidence.
So there's many difference with Arraylist.
For example performance ( I mean Big-O ) is totally different.
In terms of usage it is just a Set, plus some extra goodies like having a definite sequence. However, it is internally implemented as a tree.
The naming convention here is similar as with HashSet, another Set internally implemented as a hash table.
Internally TreeSet is present as a Tree structure. So this fact influences on operations complexity. Most of operation require O(log n) actions for TRee based structures but array based structures work in constant time for most used read only operation. So HashSet is based on array and allows const time access to its values.
Also they provide different functionality. HashSet just stores elements. It behaves like math set as you sad, in linear manner.
But TreeSet provides more operation: take a look at NavigableSet and SortedSer interfaces it implements. Elements of TreeSet are always sorted. But in the same time they require setting sorting rules for them provided by impelemting Comparable interface or using side Comparator object.
I'm running some benchmarks. One of my tests depends on order, so I'm using a TreeSet for that. My second test doesn't, so I'm using a HashSet for it.
I know that insertion is slower for the TreeSet. But what about iterating through all elements?
TreeSets internally uses TreeMaps which are Red Black Trees (special type of BST) .
BST Inorder Traversal is O(n)
HashSets internally uses HashMaps which use an array for holding Entry objects.
Here also traversal should be O(n) .
Unless you write a benchmark it is going to be difficult to prove which is faster.
From a similar post (Hashset vs Treeset):
HashSet is much faster than TreeSet (constant-time versus log-time for most operations like add, remove and contains) but offers no ordering guarantees like TreeSet.
HashSet:
class offers constant time performance for the basic operations (add, remove, contains and size).
it does not guarantee that the order of elements will remain constant over time
iteration performance depends on the initial capacity and the load factor of the HashSet.
It's quite safe to accept default load factor but you may want to specify an initial capacity that's about twice the size to which you expect the set to grow.
TreeSet:
guarantees log(n) time cost for the basic operations (add, remove and contains)
guarantees that elements of set will be sorted (ascending, natural, or the one specified by you via it's constructor)
doesn't offer any tuning parameters for iteration performance
offers a few handy methods to deal with the ordered set like first(), last(), headSet(), and tailSet() etc
Important points:
Both guarantee duplicate-free collection of elements
It is generally faster to add elements to the HashSet and then convert the collection to a TreeSet for a duplicate-free sorted traversal.
None of these implementation are synchronized. That is if multiple threads access a set concurrently, and at least one of the threads modifies the set, it must be synchronized externally.
LinkedHashSet is in some sense intermediate between HashSet and TreeSet. Implemented as a hash table with a linked list running through it, however it provides insertion-ordered iteration which is not same as sorted traversal guaranteed by TreeSet.
So choice of usage depends entirely on your needs but I feel that even if you need an ordered collection then you should still prefer HashSet to create the Set and then convert it into TreeSet.
e.g. Set<String> s = new TreeSet<String>(hashSet);
If you want stable ordering with (nearly) the performance of a HashSet, then use a LinkedHashSet. You will still get constant-time operations, whereas I would assume a TreeSet will get you logarithmic time.
I'm thinking about filling a collection with a large amount of unique objects.
How is the cost of an insert in a Set (say HashSet) compared to an List (say ArrayList)?
My feeling is that duplicate elimination in sets might cause a slight overhead.
There is no "duplicate elimination" such as comparing to all existing elements. If you insert into hash set, it's really a dictionary of items by hash code. There's no duplicate checking unless there already are items with the same hash code. Given a reasonable (well-distributed) hash function, it's not that bad.
As Will has noted, because of the dictionary structure HashSet is probably a bit slower than an ArrayList (unless you want to insert "between" existing elements). It also is a bit larger. I'm not sure that's a significant difference though.
You're right: set structures are inherently more complex in order to recognize and eliminate duplicates. Whether this overhead is significant for your case should be tested with a benchmark.
Another factor is memory usage. If your objects are very small, the memory overhead introduced by the set structure can be significant. In the most extreme case (TreeSet<Integer> vs. ArrayList<Integer>) the set structure can require more than 10 times as much memory.
If you're certain your data will be unique, use a List. You can use a Set to enforce this rule.
Sets are faster than Lists if you have a large data set, while the inverse is true for smaller data sets. I haven't personally tested this claim.
Which type of List?
Also, consider which List to use. LinkedLists are faster at adding, removing elements.
ArrayLists are faster at random access (for loops, etc), but this can be worked around using the Iterator of a LinkedList. ArrayLists are are much faster at: list.toArray().
You have to compare concrete implementations (for example HashSet with ArrayList), because the abstract interfaces Set/List don't really tell you anything about performance.
Inserting into a HashSet is a pretty cheap operation, as long as the hashCode() of the object to be inserted is sane. It will still be slightly slower than ArrayList, because it's insertion is a simple insertion into an array (assuming you insert in the end and there's still free space; I don't factor in resizing the internal array, because the same cost applies to HashSet as well).
If the goal is the uniqueness of the elements, you should use an implementation of the java.util.Set interface. The class java.util.HashSet and java.util.LinkedHashSet have O(alpha) (close to O(1) in the best case) complexity for insert, delete and contains check.
ArrayList have O(n) for object (not index) contains check (you have to scroll through the whole list) and insertion (if the insertion is not in tail of the list, you have to shift the whole underline array).
You can use LinkedHashSet that preserve the order of insertion and have the same potentiality of HashSet (takes up only a bit more of memory).
I don't think you can make this judgement simply on the cost of building the collection. Other things that you need to take into account are:
Is the input dataset ordered? Is there a requirement that the output data structure preserves insertion order?
Is there a requirement that the output data structure is ordered (or reordered) based on element values?
Will the output data structure be subsequently modified? How?
Is there a requirement that the output data structure is duplicate free if other elements are added subsequently?
Do you know how many elements are likely to be in the input dataset?
Can you measure the size of the input dataset? (Or is it provided via an iterator?)
Does space utilization matter?
These can all effect your choice of data structure.
Java List:
If you don't have such requirement that you have to keep duplicate or not. Then you can use List instead of Set.
List is an interface in Collection framework. Which extends Collection interface. and ArrayList, LinkedList is the implementation of List interface.
When to use ArrayList or LinkedList
ArrayList: If you have such requirement that in your application mostly work is accessing the data. Then you should go for ArrayList. because ArrayList implements RtandomAccess interface which is Marker Interface. because of Marker interface ArrayList have capability to access the data in O(1) time. and you can use ArrayList over LinkedList where you want to get data according to insertion order.
LinkedList: If you have such requirement that your mostly work is insertion or deletion. Then you should use LinkedList over the ArrayList. because in LinkedList insertion and deletion happen in O(1) time whereas in ArrayList it's O(n) time.
Java Set:
If you have requirement in your application that you don't want any duplicates. Then you should go for Set instead of List. Because Set doesn't store any duplicates. Because Set works on the principle of Hashing. If we add object in Set then first it checks object's hashCode in the bucket if it's find any hashCode present in it's bucked then it'll not add that object.
What are the advantages of each structure?
In my program I will be performing these steps and I was wondering which data structure above I should be using:
Taking in an unsorted array and
adding them to a sorted structure1.
Traversing through sorted data and removing the right one
Adding data (never removing) and returning that structure as an array
When do you know when to use a TreeSet or LinkedList? What are the advantages of each structure?
In general, you decide on a collection type based on the structural and performance properties that you need it to have. For instance, a TreeSet is a Set, and therefore does not allow duplicates and does not preserve insertion order of elements. By contrast a LinkedList is a List and therefore does allow duplicates and does preserve insertion order. On the performance side, TreeSet gives you O(logN) insertion and deletion, whereas LinkedList gives O(1) insertion at the beginning or end, and O(N) insertion at a selected position or deletion.
The details are all spelled out in the respective class and interface javadocs, but a useful summary may be found in the Java Collections Cheatsheet.
In practice though, the choice of collection type is intimately connected to algorithm design. The two need to be done in parallel. (It is no good deciding that your algorithm requires a collection with properties X, Y and Z, and then discovering that no such collection type exists.)
In your use-case, it looks like TreeSet would be a better fit. There is no efficient way (i.e. better than O(N^2)) to sort a large LinkedList that doesn't involve turning it into some other data structure to do the sorting. There is no efficient way (i.e. better than O(N)) to insert an element into the correct position in a previously sorted LinkedList. The third part (copying to an array) works equally well with a LinkedList or TreeSet; it is an O(N) operation in both cases.
[I'm assuming that the collections are large enough that the big O complexity predicts the actual performance accurately ... ]
The genuine power and advantage of TreeSet lies in interface it realizes - NavigableSet
Why is it so powerfull and in which case?
Navigable Set interface add for example these 3 nice methods:
headSet(E toElement, boolean inclusive)
tailSet(E fromElement, boolean inclusive)
subSet(E fromElement, boolean fromInclusive, E toElement, boolean toInclusive)
These methods allow to organize effective search algorithm(very fast).
Example: we need to find all the names which start with Milla and end with Wladimir:
TreeSet<String> authors = new TreeSet<String>();
authors.add("Andreas Gryphius");
authors.add("Fjodor Michailowitsch Dostojewski");
authors.add("Alexander Puschkin");
authors.add("Ruslana Lyzhichko");
authors.add("Wladimir Klitschko");
authors.add("Andrij Schewtschenko");
authors.add("Wayne Gretzky");
authors.add("Johann Jakob Christoffel");
authors.add("Milla Jovovich");
authors.add("Taras Schewtschenko");
System.out.println(authors.subSet("Milla", "Wladimir"));
output:
[Milla Jovovich, Ruslana Lyzhichko, Taras Schewtschenko, Wayne Gretzky]
TreeSet doesn't go over all the elements, it finds first and last elemenets and returns a new Collection with all the elements in the range.
TreeSet:
TreeSet uses Red-Black tree underlying. So the set could be thought as a dynamic search tree. When you need a structure which is operated read/write frequently and also should keep order, the TreeSet is a good choice.
If you want to keep it sorted and it's append-mostly, TreeSet with a Comparator is your best bet. The JVM would have to traverse the LinkedList from the beginning to decide where to place an item. LinkedList = O(n) for any operations, TreeSet = O(log(n)) for basic stuff.
The most important point when choosing a data structure are its inherent limitations. For example if you use TreeSet to store objects and during run-time your algorithm changes attributes of these objects which affect equal comparisons while the object is an element of the set, get ready for some strange bugs.
The Java Doc for Set interface state that:
Note: Great care must be exercised if mutable objects are used as set elements. The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set. A special case of this prohibition is that it is not permissible for a set to contain itself as an element.
Interface Set Java Doc