Java collection insertion: Set vs. List

Java collection insertion: Set vs. List - java

I'm thinking about filling a collection with a large amount of unique objects.
How is the cost of an insert in a Set (say HashSet) compared to an List (say ArrayList)?
My feeling is that duplicate elimination in sets might cause a slight overhead.

There is no "duplicate elimination" such as comparing to all existing elements. If you insert into hash set, it's really a dictionary of items by hash code. There's no duplicate checking unless there already are items with the same hash code. Given a reasonable (well-distributed) hash function, it's not that bad.
As Will has noted, because of the dictionary structure HashSet is probably a bit slower than an ArrayList (unless you want to insert "between" existing elements). It also is a bit larger. I'm not sure that's a significant difference though.

You're right: set structures are inherently more complex in order to recognize and eliminate duplicates. Whether this overhead is significant for your case should be tested with a benchmark.
Another factor is memory usage. If your objects are very small, the memory overhead introduced by the set structure can be significant. In the most extreme case (TreeSet<Integer> vs. ArrayList<Integer>) the set structure can require more than 10 times as much memory.

If you're certain your data will be unique, use a List. You can use a Set to enforce this rule.
Sets are faster than Lists if you have a large data set, while the inverse is true for smaller data sets. I haven't personally tested this claim.
Which type of List?
Also, consider which List to use. LinkedLists are faster at adding, removing elements.
ArrayLists are faster at random access (for loops, etc), but this can be worked around using the Iterator of a LinkedList. ArrayLists are are much faster at: list.toArray().

You have to compare concrete implementations (for example HashSet with ArrayList), because the abstract interfaces Set/List don't really tell you anything about performance.
Inserting into a HashSet is a pretty cheap operation, as long as the hashCode() of the object to be inserted is sane. It will still be slightly slower than ArrayList, because it's insertion is a simple insertion into an array (assuming you insert in the end and there's still free space; I don't factor in resizing the internal array, because the same cost applies to HashSet as well).

If the goal is the uniqueness of the elements, you should use an implementation of the java.util.Set interface. The class java.util.HashSet and java.util.LinkedHashSet have O(alpha) (close to O(1) in the best case) complexity for insert, delete and contains check.
ArrayList have O(n) for object (not index) contains check (you have to scroll through the whole list) and insertion (if the insertion is not in tail of the list, you have to shift the whole underline array).
You can use LinkedHashSet that preserve the order of insertion and have the same potentiality of HashSet (takes up only a bit more of memory).

I don't think you can make this judgement simply on the cost of building the collection. Other things that you need to take into account are:
Is the input dataset ordered? Is there a requirement that the output data structure preserves insertion order?
Is there a requirement that the output data structure is ordered (or reordered) based on element values?
Will the output data structure be subsequently modified? How?
Is there a requirement that the output data structure is duplicate free if other elements are added subsequently?
Do you know how many elements are likely to be in the input dataset?
Can you measure the size of the input dataset? (Or is it provided via an iterator?)
Does space utilization matter?
These can all effect your choice of data structure.

Java List:
If you don't have such requirement that you have to keep duplicate or not. Then you can use List instead of Set.
List is an interface in Collection framework. Which extends Collection interface. and ArrayList, LinkedList is the implementation of List interface.
When to use ArrayList or LinkedList
ArrayList: If you have such requirement that in your application mostly work is accessing the data. Then you should go for ArrayList. because ArrayList implements RtandomAccess interface which is Marker Interface. because of Marker interface ArrayList have capability to access the data in O(1) time. and you can use ArrayList over LinkedList where you want to get data according to insertion order.
LinkedList: If you have such requirement that your mostly work is insertion or deletion. Then you should use LinkedList over the ArrayList. because in LinkedList insertion and deletion happen in O(1) time whereas in ArrayList it's O(n) time.
Java Set:
If you have requirement in your application that you don't want any duplicates. Then you should go for Set instead of List. Because Set doesn't store any duplicates. Because Set works on the principle of Hashing. If we add object in Set then first it checks object's hashCode in the bucket if it's find any hashCode present in it's bucked then it'll not add that object.

Related

Fast LinkedList search and delete in java

I am using Java's Linkedlist in my project. I have to build a delete function that removes an element with a specified unique id (id is a filed in my class) in the Linkedlist. As per the Java official document, were I to use LinkedList.remove, the runtime would be O(n) as the process happens in two steps, the first of which is a linear search with a runtime of O(n) followed by the actual delete which takes O(1).
In an attempt to speed things up, I wanted to use a binary tree for lookup, where each node in the tree is (id, reference to the node in the linkedlist). I am not exactly sure how to implement this in Java. In C/C++, one could just store a pointer as reference to the node in the linkedlist.
==
If you are wondering why I have to use LinkedList, it's because I am building an order-matching engine for exchanges. LinkedList offers superior runtime as far as insert is concerned. I am also using insertion sort to keep prices in the orderbook sorted. Priority queue does not suit my needs because I have to show the sorted order book in real time.

Have you seen the video of Stroustrup's conference talk where he showed that you should use std::vector unless you have measured a performance benefit of not using std::vector? He showed that std::vector is almost always the correct collection to use, and showed that it is faster than linked list even when inserting and deleting in the middle.
Now translate that to Java: use ArrayList unless you have measured better performance with something else.
Why is that? With modern processor architectures, there is a locality benefit: elements that you compare together, elements that you process together, are all stored next to each other in memory and are likely to be in the CPU's cache at the same time. This allows them to be fetched and written to much faster than when they're in main memory. This is not the case with a linked list, where elements are allocated individually and spread all over the place. (This locality benefit is much more pronounced in C++ where you have the actual objects next to each other, but it's still valid to a smaller extent in Java, where you have the references next to each other, albeit not the actual objects.)
Now with ArrayList, you can keep the orders sorted by price, and use binary search to insert an order in the right place.
If your performance measurement shows that LinkedList is preferable, then unfortunately Java doesn't give you access to the internal representation – the actual nodes – of the LinkedList, so you'll have to homebrew your own list.

Why are you using a List?
If you have a unique id for each object, why not put it in a Map with the id as the key? If you choose a HashMap is implementation removal is O(1). If you implement using LinkedHashMap you can preserve insertion order as well.
LinkedList insertion is superior to....what?
HashMap get/put complexity

You can easily solve this by having a small change.
First have an object that has your value and id as fields
class MyElement implements Comparable{
int id,value;
//Implement compareTo() to sort based on values
//Override equals() method to compare ids
//Override hashcode() to return the id
}
Now use a TreeSet to store these objects.
In this data structure the incoming objects get sorted and deletion and insertion also find lower time complexity of O(log n)

To preserve order by id and to get good performance use TreeMap. Put, remove and get operations will be O(log n).
EDIT:
For preserving order of insertion of elements for each id you can use TreeMap < Integer, ArrayList < T > >, i.e. for each id you can save elements with particular id in list in order of insertion.

When to use HashMap over LinkedList or ArrayList and vice-versa

What is the reason why we cannot always use a HashMap, even though it is much more efficient than ArrayList or LinkedList in add,remove operations, also irrespective of the number of the elements.
I googled it and found some reasons, but there was always a workaround for using HashMap, with advantages still alive.

Lists represent a sequential ordering of elements.
Maps are used to represent a collection of key / value pairs.
While you could use a map as a list, there are some definite downsides of doing so.
Maintaining order:
A list by definition is ordered. You add items and then you are able to iterate back through the list in the order that you inserted the items. When you add items to a HashMap, you are not guaranteed to retrieve the items in the same order you put them in. There are subclasses of HashMap like LinkedHashMap that will maintain the order, but in general order is not guaranteed with a Map.
Key/Value semantics:
The purpose of a map is to store items based on a key that can be used to retrieve the item at a later point. Similar functionality can only be achieved with a list in the limited case where the key happens to be the position in the list.
Code readability
Consider the following examples.
// Adding to a List
list.add(myObject); // adds to the end of the list
map.put(myKey, myObject); // sure, you can do this, but what is myKey?
map.put("1", myObject); // you could use the position as a key but why?
// Iterating through the items
for (Object o : myList) // nice and easy
for (Object o : myMap.values()) // more code and the order is not guaranteed
Collection functionality
Some great utility functions are available for lists via the Collections class. For example ...
// Randomize the list
Collections.shuffle(myList);
// Sort the list
Collections.sort(myList, myComparator);

Lists and Maps are different data structures. Maps are used for when you want to associate a key with a value and Lists are an ordered collection.
Map is an interface in the Java Collection Framework and a HashMap is one implementation of the Map interface. HashMap are efficient for locating a value based on a key and inserting and deleting values based on a key. The entries of a HashMap are not ordered.
ArrayList and LinkedList are an implementation of the List interface. LinkedList provides sequential access and is generally more efficient at inserting and deleting elements in the list, however, it is it less efficient at accessing elements in a list. ArrayList provides random access and is more efficient at accessing elements but is generally slower at inserting and deleting elements.

I will put here some real case examples and scenarios when to use one or another, it might be of help for somebody else:
HashMap
When you have to use cache in your application. Redis and membase are some type of extended HashMap. (Doesn't matter the order of the elements, you need quick ( O(1) ) read access (a value), using a key).
LinkedList
When the order is important (they are ordered as they were added to the LinkedList), the number of elements are unknown (don't waste memory allocation) and you require quick insertion time ( O(1) ). A list of to-do items that can be listed sequentially as they are added is a good example.

The downfall of ArrayList and LinkedList is that when iterating through them, depending on the search algorithm, the time it takes to find an item grows with the size of the list.
The beauty of hashing is that although you sacrifice some extra time searching for the element, the time taken does not grow with the size of the map. This is because the HashMap finds information by converting the element you are searching for, directly into the index, so it can make the jump.
Long story short...
LinkedList: Consumes a little more memory than ArrayList, low cost for insertions(add & remove)
ArrayList: Consumes low memory, but similar to LinkedList, and takes extra time to search when large.
HashMap: Can perform a jump to the value, making the search time constant for large maps. Consumes more memory and takes longer to find the value than small lists.

When do you know when to use a TreeSet or LinkedList?

What are the advantages of each structure?
In my program I will be performing these steps and I was wondering which data structure above I should be using:
Taking in an unsorted array and
adding them to a sorted structure1.
Traversing through sorted data and removing the right one
Adding data (never removing) and returning that structure as an array

When do you know when to use a TreeSet or LinkedList? What are the advantages of each structure?
In general, you decide on a collection type based on the structural and performance properties that you need it to have. For instance, a TreeSet is a Set, and therefore does not allow duplicates and does not preserve insertion order of elements. By contrast a LinkedList is a List and therefore does allow duplicates and does preserve insertion order. On the performance side, TreeSet gives you O(logN) insertion and deletion, whereas LinkedList gives O(1) insertion at the beginning or end, and O(N) insertion at a selected position or deletion.
The details are all spelled out in the respective class and interface javadocs, but a useful summary may be found in the Java Collections Cheatsheet.
In practice though, the choice of collection type is intimately connected to algorithm design. The two need to be done in parallel. (It is no good deciding that your algorithm requires a collection with properties X, Y and Z, and then discovering that no such collection type exists.)
In your use-case, it looks like TreeSet would be a better fit. There is no efficient way (i.e. better than O(N^2)) to sort a large LinkedList that doesn't involve turning it into some other data structure to do the sorting. There is no efficient way (i.e. better than O(N)) to insert an element into the correct position in a previously sorted LinkedList. The third part (copying to an array) works equally well with a LinkedList or TreeSet; it is an O(N) operation in both cases.
[I'm assuming that the collections are large enough that the big O complexity predicts the actual performance accurately ... ]

The genuine power and advantage of TreeSet lies in interface it realizes - NavigableSet
Why is it so powerfull and in which case?
Navigable Set interface add for example these 3 nice methods:
headSet(E toElement, boolean inclusive)
tailSet(E fromElement, boolean inclusive)
subSet(E fromElement, boolean fromInclusive, E toElement, boolean toInclusive)
These methods allow to organize effective search algorithm(very fast).
Example: we need to find all the names which start with Milla and end with Wladimir:
TreeSet<String> authors = new TreeSet<String>();
authors.add("Andreas Gryphius");
authors.add("Fjodor Michailowitsch Dostojewski");
authors.add("Alexander Puschkin");
authors.add("Ruslana Lyzhichko");
authors.add("Wladimir Klitschko");
authors.add("Andrij Schewtschenko");
authors.add("Wayne Gretzky");
authors.add("Johann Jakob Christoffel");
authors.add("Milla Jovovich");
authors.add("Taras Schewtschenko");
System.out.println(authors.subSet("Milla", "Wladimir"));
output:
[Milla Jovovich, Ruslana Lyzhichko, Taras Schewtschenko, Wayne Gretzky]
TreeSet doesn't go over all the elements, it finds first and last elemenets and returns a new Collection with all the elements in the range.

TreeSet:
TreeSet uses Red-Black tree underlying. So the set could be thought as a dynamic search tree. When you need a structure which is operated read/write frequently and also should keep order, the TreeSet is a good choice.

If you want to keep it sorted and it's append-mostly, TreeSet with a Comparator is your best bet. The JVM would have to traverse the LinkedList from the beginning to decide where to place an item. LinkedList = O(n) for any operations, TreeSet = O(log(n)) for basic stuff.

The most important point when choosing a data structure are its inherent limitations. For example if you use TreeSet to store objects and during run-time your algorithm changes attributes of these objects which affect equal comparisons while the object is an element of the set, get ready for some strange bugs.
The Java Doc for Set interface state that:
Note: Great care must be exercised if mutable objects are used as set elements. The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set. A special case of this prohibition is that it is not permissible for a set to contain itself as an element.
Interface Set Java Doc

confusing java data structures

Maybe the title is not appropriate but I couldn't think of any other at this moment. My question is what is the difference between LinkedList and ArrayList or HashMap and THashMap .
Is there a tree structure already for Java(ex:AVL,red-black) or balanced or not balanced(linked list). If this kind of question is not appropriate for SO please let me know I will delete it. thank you

ArrayList and LinkedList are implementations of the List abstraction. The first holds the elements of the list in an internal array which is automatically reallocated as necessary to make space for new elements. The second constructs a doubly linked list of holder cells, each of which refers to a list element. While the respective operations have identical semantics, they differ considerably in performance characteristics. For example:
The get(int) operation on an ArrayList takes constant time, but it takes time proportional to the length of the list for a LinkedList.
Removing an element via the Iterator.remove() takes constant time for a LinkedList, but it takes time proportional to the length of the list for an ArrayList.
The HashMap and THashMap are both implementations of the Map abstraction that are use hash tables. The difference is in the form of hash table data structure used in each case. The HashMap class uses closed addressing which means that each bucket in the table points to a separate linked list of elements. The THashMap class uses open addressing which means that elements that hash to the same bucket are stored in the table itself. The net result is that THashMap uses less memory and is faster than HashMap for most operations, but is much slower if you need the map's set of key/value pairs.
For more detail, read a good textbook on data structures. Failing that, look up the concepts in Wikipedia. Finally, take a look at the source code of the respective classes.

Read the API docs for the classes you have mentioned. The collections tutorial also explains the differences fairly well.
java.util.TreeMap is based on a red-black tree.

Regarding the lists:
Both comply with the List interface, but their implementation is different, and they differ in the efficiency of some of their operations.
ArrayList is a list stored internally as an array. It has the advantage of random access, but a single item addition is not guaranteed to run in constant time. Also, removal of items is inefficient.
A LinkedList is implemented as a doubly connected linked list. It does not support random access, but removing an item while iterating through it is efficient.

As I remember, both (LinkedList and ArrayList) are the lists. But they have defferent inner realization.

What is the need of collection framework in java?

What is the need of Collection framework in Java since all the data operations(sorting/adding/deleting) are possible with Arrays and moreover array is suitable for memory consumption and performance is also better compared with Collections.
Can anyone point me a real time data oriented example which shows the difference in both(array/Collections) of these implementations.

Arrays are not resizable.
Java Collections Framework provides lots of different useful data types, such as linked lists (allows insertion anywhere in constant time), resizeable array lists (like Vector but cooler), red-black trees, hash-based maps (like Hashtable but cooler).
Java Collections Framework provides abstractions, so you can refer to a list as a List, whether backed by an array list or a linked list; and you can refer to a map/dictionary as a Map, whether backed by a red-black tree or a hashtable.
In other words, Java Collections Framework allows you to use the right data structure, because one size does not fit all.

Several reasons:
Java's collection classes provides a higher level interface than arrays.
Arrays have a fixed size. Collections (see ArrayList) have a flexible size.
Efficiently implementing a complicated data structures (e.g., hash tables) on top of raw arrays is a demanding task. The standard HashMap gives you that for free.
There are different implementation you can choose from for the same set of services: ArrayList vs. LinkedList, HashMap vs. TreeMap, synchronized, etc.
Finally, arrays allow covariance: setting an element of an array is not guaranteed to succeed due to typing errors that are detectable only at run time. Generics prevent this problem in arrays.
Take a look at this fragment that illustrates the covariance problem:
String[] strings = new String[10];
Object[] objects = strings;
objects[0] = new Date(); // <- ArrayStoreException: java.util.Date

Collection classes like Set, List, and Map implementations are closer to the "problem space." They allow developers to complete work more quickly and turn in more readable/maintainable code.

For each class in the Collections API there's a different answer to your question. Here are a few examples.
LinkedList: If you remove an element from the middle of an array, you pay the cost of moving all of the elements to the right of the removed element. Not so with a linked list.
Set: If you try to implement a set with an array, adding an element or testing for an element's presence is O(N). With a HashSet, it's O(1).
Map: To implement a map using an array would give the same performance characteristics as your putative array implementation of a set.

It depends upon your application's needs. There are so many types of collections, including:
HashSet
ArrayList
HashMap
TreeSet
TreeMap
LinkedList
So for example, if you need to store key/value pairs, you will have to write a lot of custom code if it will be based off an array - whereas the Hash* collections should just work out of the box. As always, pick the right tool for the job.

Well the basic premise is "wrong" since Java included the Dictionary class since before interfaces existed in the language...
collections offer Lists which are somewhat similar to arrays, but they offer many more things that are not. I'll assume you were just talking about List (and even Set) and leave Map out of it.
Yes, it is possible to get the same functionality as List and Set with an array, however there is a lot of work involved. The whole point of a library is that users do not have to "roll their own" implementations of common things.
Once you have a single implementation that everyone uses it is easier to justify spending resources optimizing it as well. That means when the standard collections are sped up or have their memory footprint reduced that all applications using them get the improvements for free.
A single interface for each thing also simplifies every developers learning curve - there are not umpteen different ways of doing the same thing.
If you wanted to have an array that grows over time you would probably not put the growth code all over your classes, but would instead write a single utility method to do that. Same for deletion and insertion etc...
Also, arrays are not well suited to insertion/deletion, especially when you expect that the .length member is supposed to reflect the actual number of contents, so you would spend a huge amount of time growing and shrinking the array. Arrays are also not well suited for Sets as you would have to iterate over the entire array each time you wanted to do an insertion to check for duplicates. That would kill any perceived efficiency.

Arrays are not efficient always. What if you need something like LinkedList? Looks like you need to learn some data structure : http://en.wikipedia.org/wiki/List_of_data_structures

Java Collections came up with different functionality,usability and convenience.
When in an application we want to work on group of Objects, Only ARRAY can not help us,Or rather they might leads to do things with some cumbersome operations.
One important difference, is one of usability and convenience, especially given that Collections automatically expand in size when needed:
Collections came up with methods to simplify our work.
Each one has a unique feature:
List- Essentially a variable-size array;
You can usually add/remove items at any arbitrary position;
The order of the items is well defined (i.e. you can say what position a given item goes in in the list).
Used- Most cases where you just need to store or iterate through a "bunch of things" and later iterate through them.
Set- Things can be "there or not"— when you add items to a set, there's no notion of how many times the item was added, and usually no notion of ordering.
Used- Remembering "which items you've already processed", e.g. when doing a web crawl;
Making other yes-no decisions about an item, e.g. "is the item a word of English", "is the item in the database?" , "is the item in this category?" etc.
Here you find use of each collection as per scenario:

Collection is the framework in Java and you know that framework is very easy to use rather than implementing and then use it and your concern is that why we don't use the array there are drawbacks of array like it is static you have to define the size of row at least in beginning, so if your array is large then it would result primarily in wastage of large memory.
So you can prefer ArrayList over it which is inside the collection hierarchy.
Complexity is other issue like you want to insert in array then you have to trace it upto define index so over it you can use LinkedList all functions are implemented only you need to use and became your code less complex and you can read there are various advantages of collection hierarchy.

Collection framework are much higher level compared to Arrays and provides important interfaces and classes that by using them we can manage groups of objects with a much sophisticated way with many methods already given by the specific collection.
For example:
ArrayList - It's like a dynamic array i.e. we don't need to declare its size, it grows as we add elements to it and it shrinks as we remove elements from it, during the runtime of the program.
LinkedList - It can be used to depict a Queue(FIFO) or even a Stack(LIFO).
HashSet - It stores its element by a process called hashing. The order of elements in HashSet is not guaranteed.
TreeSet - TreeSet is the best candidate when one needs to store a large number of sorted elements and their fast access.
ArrayDeque - It can also be used to implement a first-in, first-out(FIFO) queue or a last-in, first-out(LIFO) queue.
HashMap - HashMap stores the data in the form of key-value pairs, where key and value are objects.
Treemap - TreeMap stores key-value pairs in a sorted ascending order and retrieval speed of an element out of a TreeMap is quite fast.
To learn more about Java collections, check out this article.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.