Reasons for using a Bag in Java

Reasons for using a Bag in Java - java

I am currently studying about Algorithms & Data Structures and while I was reading over the Book of Algorithms 4th edition, I discovered the Bag data-structure together with the Stack and Queue.
After reading the the explanation of it, it is still unclear to me why would I prefer using a Bag (which has no remove() method) over other data-structures such as Stack, Queue, LinkedList or a Set?
As far as I can understand from the Book, the implementation of a Bag, is the same as for a Stack, just replacing the name of push() to add() and remove the pop() method.
So the idea of a Bag is basically having the ability to collect items and then iterate through the collected items, check if a bag is empty and find the number of items in it.
But under which circumstances I would better using a Bag over one of the mentioned above Collections? And why a Bag doesn't have a remove() method basically? is there a specific reason for it?
Thanks in advance.

Stack is ADT of the collection of elements with specific remove order = LIFO (last-in-first-out), allows duplicates,
Queue is ADT of the collection of elements with specific remove order = FIFO (first-in-first-out), allows duplicates,
LinkedList is implementation of the list,
Set is ADT of the collection of elements which disallows duplicates,
Bag is ADT of the collection of elements which allows duplicates.
In general, anything that holds an elements is Collection.
Any collection which allows duplicates is Bag, otherwise it is Set.
Any bag which access elements via index is List.
Bag which appends new element after the last one and has a method to remove element from the head (first index) is Queue.
Bag which appends new element after the last one and has a method to remove element from the tail (last index) is Stack.
Example: In Java, LinkedList is a collection, bag, list, queue and also you can work with it as it was a stack since it support stack operations (add~addLast~push, peekLast, removeLast~pop), so you can call it also stack. The reason, why it does not implement Stack interface is, that peek method is reserved by Queue implementation which retrieves the head of the list (first element). Therefore in case of LinkedList, the "stack methods" are derived from Deque.
Whether Bag contains remove(Object) or not may depend on the implementation e. g. you can implement your own Bag type which supports this operation. Also you can implement get(int) operation to access object on specified index. Time complexity of the get(int) would depend on your implementation e. g. one can implement Bag via linked-list so the complexity would be at average O(n/2), other one via resizable array (array-list) with direct access to the element via index, so the complexity would be O(1).
But the main idea of the Bag is, that it allows duplicates and iteration through this collection. Whether it supports another useful operations depends on implementator's design decision.
Which one of the collection type to use dependes on your needs, if duplicates are not desired, you would use Set instead of Bag. Moreover, if you care about remove order you would pick Stack or Queue which are basically Bags with specific remove order. You can think of Bag as super-type of the Stack and Queue which extends its api by specific operations.
Most of the time, you just need to collect objects and process them in some way (iteration + element processing). So you will use the most simple Bag implementation which is one directional linked-list.

Bag is an unordered collection of values that may have duplicates. When comparing a stack to a bag, the first difference is that for stacks,
order matters.
Bag only supports the add and iterate operations. You cannot remove items from a bag-it’s possible to remove elements from a stack.-. After checking if the container is actually empty, clients can iterate through its elements; since the actual order is unspecified by definition, clients must not rely on it.
Bags are useful when you need to collect objects and process them as a whole set rather than individually. For example, you could collect samples and then, later, compute statistics on them, such as average or standard deviation—the order is
irrelevant in that case.
in terms of priority queues, a bag is a Priority queue for which element
removal (top()-Returns and extracts the element with the highest priority. ) is disabled. Priority Queue api has, top, peek,insert,remove and update methods. it’s possible to peek one element at a time, and the
priority of each element is given by a random number from a uniform distribution. Priorities also change at every iteration.

Related

Why don't we count linear search cost as a prerequisite bottleneck for the insertion operation of a linked list, compared to ArrayList?

I have had this question for a while but I have been unsatisfied with the answers because the distinctions appear to be arbitrary and more like conventional wisdom that is sort of blindly accepted rather than assessed critically.
In an ArrayList it is said that insertion cost (for a single element) is linear. If we are inserting at index p for 0 <= p < n where n is the size of the list, then the remaining n-p elements are shifted over first before the new element is copied into position p.
In a LinkedList, it is said that insertion cost (for a single element) is constant. For instance if we already have a node and we want to insert after it, we rearrange some pointers and it's done quickly. But getting this node in the first place, I don't see how it can be done other than a linear search first (assuming it isn't a trivial case like prepending at the start of the list or appending at the end).
And yet in the case of the LinkedList, we don't count that initial search time. To me this is confusing because it's sort of like saying "The ice cream is free... after you pay for it." It's like, well, of course it is... but that sort of skips the hard part of paying for it. Of course inserting in a LinkedList is going to be constant time if you already have the node you want, but getting that node in the first place may take some extra time! I could easily say that inserting in an ArrayList is constant time... after I move the remaining n-p elements.
So I don't understand why this distinction is made for one but not the other. You could argue that insertion is considered constant for LinkedLists because of the cases where you insert at the front or back where linear time operations are not required, whereas in an ArrayList, insertion requires copying of the suffix array after position p, but I could easily counter that by saying if we insert at the back of an ArrayList, it is amortized constant time and doesn't require extra copying in most cases unless we reach capacity.
In other words we separate the linear stuff from the constant stuff for LinkedList, but we don't separate them for the ArrayList, even though in both cases, the linear operations may not be invoked or not invoked.
So why do we consider them separate for LinkedList and not for ArrayList? Or are they only being defined here in the context where LinkedList is overwhelmingly used for head/tail appends and prepends as opposed to elements in the middle?

This is basically a limitation of the Java interface for List and LinkedList, rather than a fundamental limitation of linked lists. That is, in Java there is no convenient concept of "a pointer to a list node".
Every type of list has a few different concepts loosely associated with the idea of pointing to a particular item:
The idea of a "reference" to a specific item in a list
The integer position of an item in the list
The value of a item that may be in the list (possibly multiple times)
The most general concept is the first one, and is usually encapsulated in the idea of an iterator. As it happens, the simple way to implement an iterator for an array backed list is simply to wrap an integer which refers to the position of the item in a list. So for array lists only, the first and second ways of referring to items are pretty tightly bound.
For other list types, however, and even for most other container types (trees, hashes, etc) that is not the case. The generic reference to an item is usually something like a pointer to the wrapper structure around one item (e.g., HashMap.Entry or LinkedList.Entry). For these structures the idea of accessing the nth element isn't necessary natural or even possible (e.g., unordered collections like sets and many hash maps).
Perhaps unfortunately, Java made the idea of getting an item by its index a first-class operation. Many of the operations directly on List objects are implemented in terms of list indexes: remove(int index), add(int index, ...), get(int index), etc. So it's kind of natural to think of those operations as being the fundamental ones.
For LinkedList though it's more fundamental to use a pointer to a node to refer to an object. Rather than passing around a list index, you'd pass around the pointer. After inserting an element, you'd get a pointer to the element.
In C++ this concept is embodied in the concept of the iterator, which is the first class way to refer to items in collections, including lists. So does such a "pointer" exist in Java? It sure does - it's the Iterator object! Usually you think of an Iterator as being for iteration, but you can also think of it as pointing to a particular object.
So the key observation is: given an pointer (iterator) to an object, you can remove and add from linked lists in constant time, but from an array-like list this takes linear time in general. There is no inherent need to search for an object before deleting it: there are plenty of scenarios where you can maintain or take as input such a reference, or where you are processing the entire list, and here the constant time deletion of linked lists does change the algorithmic complexity.
Of course, if you need to do something like delete the first entry containing the value "foo" that implies both a search and a delete operation. Both array-based and linked lists taken O(n) for search, so they don't vary here - but you can meaningfully separate the search and delete operations.
So you could, in principle, pass around Iterator objects rather than list indexes or object values - at least if your use case supports it. However, at the top I said that "Java has no convenient notion of a pointer to a list node". Why?
Well because actually using Iterator is actually very inconvenient. First of all, it's tough to get an Iterator to an object in the first place: for example, and unlike C++, the add() methods don't return an Iterator - so to get a pointer to the item you just added, you need to go ahead and iterate over the list or use the listIterator(int index) call, which is inherently inefficient for linked lists. Many methods (e.g., subList()) support only a version that takes indexes, but not Iterators - even when such a method could be efficiently supported.
Add to that the restrictions around iterator invalidation when the list is modified, and they actually become pretty useless for referring to elements except in immutable lists.
So Java's support of pointers to list elements is pretty half-hearted an so it's tough to leverage the constant time operations that linked list offers, except in cases such as adding to the front of a list, or deleting items during iteration.
It's not limited to lists, either - the ConcurrentQueue is also a linked structure which supports constant time deletes, but you can't reliably use that ability from Java.

If you're using a LinkedList, chances are you're not going to use it for a random access insert. LinkedList offers constant time for push (insert at the beginning) or add (because it has a ref to the final element IIRC). You are correct in your suspicion that an insert into a random index (e.g. insert sorted) will take linear time - not constant.
ArrayList, by contrast, is worst case linear. Most of the time it simply does an arraycopy to shift the indices (which is a low-level shift that is constant time). Only when you need to resize the backing array will it take linear time.

Which Java Collection should I use?

In this question How can I efficiently select a Standard Library container in C++11? is a handy flow chart to use when choosing C++ collections.
I thought that this was a useful resource for people who are not sure which collection they should be using so I tried to find a similar flow chart for Java and was not able to do so.
What resources and "cheat sheets" are available to help people choose the right Collection to use when programming in Java? How do people know what List, Set and Map implementations they should use?

Since I couldn't find a similar flowchart I decided to make one myself.
This flow chart does not try and cover things like synchronized access, thread safety etc or the legacy collections, but it does cover the 3 standard Sets, 3 standard Maps and 2 standard Lists.
This image was created for this answer and is licensed under a Creative Commons Attribution 4.0 International License. The simplest attribution is by linking to either this question or this answer.
Other resources
Probably the most useful other reference is the following page from the oracle documentation which describes each Collection.
HashSet vs TreeSet
There is a detailed discussion of when to use HashSet or TreeSet here:
Hashset vs Treeset
ArrayList vs LinkedList
Detailed discussion: When to use LinkedList over ArrayList?

Summary of the major non-concurrent, non-synchronized collections
Collection: An interface representing an unordered "bag" of items, called "elements". The "next" element is undefined (random).
Set: An interface representing a Collection with no duplicates.
HashSet: A Set backed by a Hashtable. Fastest and smallest memory usage, when ordering is unimportant.
LinkedHashSet: A HashSet with the addition of a linked list to associate elements in insertion order. The "next" element is the next-most-recently inserted element.
TreeSet: A Set where elements are ordered by a Comparator (typically natural ordering). Slowest and largest memory usage, but necessary for comparator-based ordering.
EnumSet: An extremely fast and efficient Set customized for a single enum type.
List: An interface representing a Collection whose elements are ordered and each have a numeric index representing its position, where zero is the first element, and (length - 1) is the last.
ArrayList: A List backed by an array, where the array has a length (called "capacity") that is at least as large as the number of elements (the list's "size"). When size exceeds capacity (when the (capacity + 1)-th element is added), the array is recreated with a new capacity of (new length * 1.5)--this recreation is fast, since it uses System.arrayCopy(). Deleting and inserting/adding elements requires all neighboring elements (to the right) be shifted into or out of that space. Accessing any element is fast, as it only requires the calculation (element-zero-address + desired-index * element-size) to find it's location. In most situations, an ArrayList is preferred over a LinkedList.
LinkedList: A List backed by a set of objects, each linked to its "previous" and "next" neighbors. A LinkedList is also a Queue and Deque. Accessing elements is done starting at the first or last element, and traversing until the desired index is reached. Insertion and deletion, once the desired index is reached via traversal is a trivial matter of re-mapping only the immediate-neighbor links to point to the new element or bypass the now-deleted element.
Map: An interface representing an Collection where each element has an identifying "key"--each element is a key-value pair.
HashMap: A Map where keys are unordered, and backed by a Hashtable.
LinkedhashMap: Keys are ordered by insertion order.
TreeMap: A Map where keys are ordered by a Comparator (typically natural ordering).
Queue: An interface that represents a Collection where elements are, typically, added to one end, and removed from the other (FIFO: first-in, first-out).
Stack: An interface that represents a Collection where elements are, typically, both added (pushed) and removed (popped) from the same end (LIFO: last-in, first-out).
Deque: Short for "double ended queue", usually pronounced "deck". A linked list that is typically only added to and read from either end (not the middle).
Basic collection diagrams:
Comparing the insertion of an element with an ArrayList and LinkedList:

Even simpler picture is here. Intentionally simplified!
Collection is anything holding data called "elements" (of the same type). Nothing more specific is assumed.
List is an indexed collection of data where each element has an index. Something like the array, but more flexible.
Data in the list keep the order of insertion.
Typical operation: get the n-th element.
Set is a bag of elements, each elements just once (the elements are distinguished using their equals() method.
Data in the set are stored mostly just to know what data are there.
Typical operation: tell if an element is present in the list.
Map is something like the List, but instead of accessing the elements by their integer index, you access them by their key, which is any object. Like the array in PHP :)
Data in Map are searchable by their key.
Typical operation: get an element by its ID (where ID is of any type, not only int as in case of List).
The differences
Set vs. Map: in Set you search data by themselves, whilst in Map by their key.
N.B. The standard library Sets are indeed implemented exactly like this: a map where the keys are the Set elements themselves, and with a dummy value.
List vs. Map: in List you access elements by their int index (position in List), whilst in Map by their key which os of any type (typically: ID)
List vs. Set: in List the elements are bound by their position and can be duplicate, whilst in Set the elements are just "present" (or not present) and are unique (in the meaning of equals(), or compareTo() for SortedSet)

It is simple: if you need to store values with keys mapped to them go for the Map interface, otherwise use List for values which may be duplicated and finally use the Set interface if you don’t want duplicated values in your collection.
Here is the complete explanation http://javatutorial.net/choose-the-right-java-collection , including flowchart etc

Map
If choosing a Map, I made this table summarizing the features of each of the ten implementations bundled with Java 11.

Common collections, Common collections

How does java implement the conversion of LinkedList to ArrayList in Java?

I am implementing a public method that needs a data structure that needs to be able to handle insertion at two ends. Since ArrayList.add(0,key) will take O(N) time, I decide to use a LinkedList instead - the add and addFirst methods should both take O(1) time.
However, in order to work with existing API, my method needs to return an ArrayList.
So I have two approaches:
(1) use LinkedList,
do all the addition of N elements where N/2 will be added to the front and N/2 will be added to the end.
Then convert this LinkedList to ArrayList by calling the ArrayList constructor:
return new ArrayList<key>(myLinkedList);
(2) use ArrayList and call ArrayList.add(key) to add N/2 elements to the back and call ArrayList.add(0,key) to add N/2 elements to the front. Return this ArrayList.
Can anyone comment on which option is more optimized in terms of time complexity? I am not sure how Java implements the constructor of ArrayList - which is the key factor that decides which option is better.
thanks.

The first method iterates across the list:
http://docs.oracle.com/javase/1.5.0/docs/api/java/util/ArrayList.html#ArrayList(java.util.Collection)
Constructs a list containing the elements of the specified collection, in the order they are returned by the collection's iterator.
which, you can reasonably infer, uses the iterator interface.
The second method will shift elements every time you add to the front (and resize every once in a while):
http://docs.oracle.com/javase/1.5.0/docs/api/java/util/ArrayList.html#add(int, E)
Inserts the specified element at the specified position in this list. Shifts the element currently at that position (if any) and any subsequent elements to the right (adds one to their indices).
Given the official assumptions regarding the functions, the first method is more efficient.
FYI: you may get more mileage using LinkedList.toArray

I would suggest that you use an ArrayDeque which is faster than a LinkedList to insert elements at two ends and consumes less memory. Then convert it to an ArrayList using method #1.

Traversing which object type is the fastest in java

I was wondering which Java collection types are traversed fastest. Collections I am most interested in are...
array
LinkedList
Queue
PriorityLinkedList
HashMap

Actually among concrete classes of Collection interface , traversing will be fast through array. Its because as you know it traverse with the index of the element.Since it follows the index pattern so,traversing through index it makes our traversing fast. Why not others? Let me explain one by one..
1.LinkedList : LinkedList follows the insertion order.If you traverse the data and searching for elements,for every element it will search from beginning. So traversing becomes slow.
2.Queue : LinkedList and PriorityQueue are two concrete classes of Queue. The elements of the priority queue are ordered according to their natural ordering, or by a Comparator provided at queue construction time, depending on which constructor is used.It's not guaranteed to traverse the elements of the priority queue in any particular order.If you need ordered traversal, consider using Arrays.sort(pq.toArray()). So it becomes useless for traversing provided if you traverse without sorting it explicitly.
3.HashMap: If you use Map instead of Collection , traversing is not guaranteed here because it works on hashcode of the key element. So here again traversing becomes useless. You can directly search the element by providing key-value of the element.
4.PriorityLinkedList: This class does not exist in Java APIs.

When do you know when to use a TreeSet or LinkedList?

What are the advantages of each structure?
In my program I will be performing these steps and I was wondering which data structure above I should be using:
Taking in an unsorted array and
adding them to a sorted structure1.
Traversing through sorted data and removing the right one
Adding data (never removing) and returning that structure as an array

When do you know when to use a TreeSet or LinkedList? What are the advantages of each structure?
In general, you decide on a collection type based on the structural and performance properties that you need it to have. For instance, a TreeSet is a Set, and therefore does not allow duplicates and does not preserve insertion order of elements. By contrast a LinkedList is a List and therefore does allow duplicates and does preserve insertion order. On the performance side, TreeSet gives you O(logN) insertion and deletion, whereas LinkedList gives O(1) insertion at the beginning or end, and O(N) insertion at a selected position or deletion.
The details are all spelled out in the respective class and interface javadocs, but a useful summary may be found in the Java Collections Cheatsheet.
In practice though, the choice of collection type is intimately connected to algorithm design. The two need to be done in parallel. (It is no good deciding that your algorithm requires a collection with properties X, Y and Z, and then discovering that no such collection type exists.)
In your use-case, it looks like TreeSet would be a better fit. There is no efficient way (i.e. better than O(N^2)) to sort a large LinkedList that doesn't involve turning it into some other data structure to do the sorting. There is no efficient way (i.e. better than O(N)) to insert an element into the correct position in a previously sorted LinkedList. The third part (copying to an array) works equally well with a LinkedList or TreeSet; it is an O(N) operation in both cases.
[I'm assuming that the collections are large enough that the big O complexity predicts the actual performance accurately ... ]

The genuine power and advantage of TreeSet lies in interface it realizes - NavigableSet
Why is it so powerfull and in which case?
Navigable Set interface add for example these 3 nice methods:
headSet(E toElement, boolean inclusive)
tailSet(E fromElement, boolean inclusive)
subSet(E fromElement, boolean fromInclusive, E toElement, boolean toInclusive)
These methods allow to organize effective search algorithm(very fast).
Example: we need to find all the names which start with Milla and end with Wladimir:
TreeSet<String> authors = new TreeSet<String>();
authors.add("Andreas Gryphius");
authors.add("Fjodor Michailowitsch Dostojewski");
authors.add("Alexander Puschkin");
authors.add("Ruslana Lyzhichko");
authors.add("Wladimir Klitschko");
authors.add("Andrij Schewtschenko");
authors.add("Wayne Gretzky");
authors.add("Johann Jakob Christoffel");
authors.add("Milla Jovovich");
authors.add("Taras Schewtschenko");
System.out.println(authors.subSet("Milla", "Wladimir"));
output:
[Milla Jovovich, Ruslana Lyzhichko, Taras Schewtschenko, Wayne Gretzky]
TreeSet doesn't go over all the elements, it finds first and last elemenets and returns a new Collection with all the elements in the range.

TreeSet:
TreeSet uses Red-Black tree underlying. So the set could be thought as a dynamic search tree. When you need a structure which is operated read/write frequently and also should keep order, the TreeSet is a good choice.

If you want to keep it sorted and it's append-mostly, TreeSet with a Comparator is your best bet. The JVM would have to traverse the LinkedList from the beginning to decide where to place an item. LinkedList = O(n) for any operations, TreeSet = O(log(n)) for basic stuff.

The most important point when choosing a data structure are its inherent limitations. For example if you use TreeSet to store objects and during run-time your algorithm changes attributes of these objects which affect equal comparisons while the object is an element of the set, get ready for some strange bugs.
The Java Doc for Set interface state that:
Note: Great care must be exercised if mutable objects are used as set elements. The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set. A special case of this prohibition is that it is not permissible for a set to contain itself as an element.
Interface Set Java Doc

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.