Retrieve Least Element, Elements are Dynamically-Ordered

Retrieve Least Element, Elements are Dynamically-Ordered - java

I have collection of elements from which I need to retrieve the least/minimum element.
Normally I would use a PriorityQueue as they are designed specifically for this purpose, and offer O(log(n)) time for dequeing methods.
However, the elements in my array have a dynamic order, ie there natural order changes unpredictably over time. I assume PriorityQueue and other such Sorted collections sort an element when inserted, and then leave it. If this is so PriorityQueue wouldn't work for dynamically-ordered elements. Am I correct in my assumption? Or would PriorityQueue still be appropriate in this situation?
If I can't use PriorityQueue, Collections.min would be my next instinct. However this iterates over the entire collection, which presumably gives O(n) time. Is this the next best solution?
What is the best collection/method to use to retrieve the least element from a collection, given that the natural order of the elements may change unpredictably over time?
Edit:
The order of several elements changes per retrieval operation
Edit 2:
The compare algorithm remains constant, however the values of the fields which it assesses vary unpredictably between retrievals.

I think if the change is truly "unpredictable" you may be stuck with Collections.min(). However, maybe for some other collections like PriorityQueue you could try, before calling for the min.
Add something that you KNOW is the min.
Remove that
Then ask again for the "real" min and hope that your little kludge resorted things...
Alternatively, do you know if the order has changed over time? e.g. some OrderChangedEvent can be fired? If so, recreate the sorted whatever as needed.

A possible way to do this would be to extend PriorityQueue that contains a list as one of the fields. This list will store the java.lang.Object.hashCode() of each object. Whenever an add, peek, poll, offer, etc. is called on the PriorityQueue, the queue will check the hash codes of each element and make see if any element changed. If they have, it will re-order the elements that have changed. Then, it will replace the hashcodes of the changed elements in the list. I don't know how fast this will be, but I suspect it will be faster than O(n).

Without any further assumption on the operations you are going to do, you can't achieve better performance than with a PriorityQueue or another O(log(n))-insert collection (TreeSet , for example, but you lose the O(1)-peek).
As you correctly assumed Collections.min(Collection, Comparator) is a linear operation.
But it depends on how often you need to change the ordering: for example if you only need to change it once in a while and still keep a "standard" ordering, min() is a viable option, but if you need to switch ordering completely then you will probably be better off with reordering the queue/set (that is, traversing and adding all the elements in a new one), tough at a O(nlog(n)) cost. Using Collections.sort(List, Comparator) may be effective if you need a lot of reordering compared to inserts, but requires you to use a List.
Of course if you can make somewhat strong assumptions on the types of sorting you will need (for example, if it can be restricted to a part of the data) you could write your own collection.
Edit:
So you have a (more or less) finite number of orderings (never mind that it's the same type of comparison over different fields, it's different Comparators and that's what matters)? If that's the case, you can probably achieve best performance by using m queues that reference the same objects, each using a different comparator (the simplest method, really). This way you have:
constant time access
O(m*logn(n)) inserts (to insert in every queue)
O(m*n) removals (to remove from every queue)
no ordering costs (as it's handled by the inserts)
slightly larger memory cost (probably negligible)
additional O(n*log(n)) cost the first time a particolar ordering is requested
Supposing a value of m orders of magnitude smaller than n, this is comparable to optimal (single-ordering PriorityQueue) performance. For convenience, you can wrap this into a custom collection that takes a Comparator parameter on retrieval operations, and use it as a key for an HashMap of all the PriorityQueues.
Edit #2:
In that case, there is no better solution than running min() on every retrieval (unless you can make assumptions on the changes of the data); this also means that it's better to just use an ArrayList as the collection, since it has basically the lowest possible cost on every operation and you will not benefit from PriorityQueue's natural ordering anyway. You will end up with linear cost on retrieval (for min) and constant on insertion and deletion: this is optimal as there is no sorting algorithm that has less than Ω(n) and Θ(nlog n) anyway.
As a side note, ordered collections work on the assumption that values will not change after insertion; this is because there is no cost-effective way to monitor the changes nor to reorder them "in place".

Can't you use a java TreeSet which keeps the collection sorted at all times. You need to implement the Comparable interface on your objects to do so. Checkout http://docs.oracle.com/javase/1.4.2/docs/api/java/util/TreeSet.html

Related

Fast LinkedList search and delete in java

I am using Java's Linkedlist in my project. I have to build a delete function that removes an element with a specified unique id (id is a filed in my class) in the Linkedlist. As per the Java official document, were I to use LinkedList.remove, the runtime would be O(n) as the process happens in two steps, the first of which is a linear search with a runtime of O(n) followed by the actual delete which takes O(1).
In an attempt to speed things up, I wanted to use a binary tree for lookup, where each node in the tree is (id, reference to the node in the linkedlist). I am not exactly sure how to implement this in Java. In C/C++, one could just store a pointer as reference to the node in the linkedlist.
==
If you are wondering why I have to use LinkedList, it's because I am building an order-matching engine for exchanges. LinkedList offers superior runtime as far as insert is concerned. I am also using insertion sort to keep prices in the orderbook sorted. Priority queue does not suit my needs because I have to show the sorted order book in real time.

Have you seen the video of Stroustrup's conference talk where he showed that you should use std::vector unless you have measured a performance benefit of not using std::vector? He showed that std::vector is almost always the correct collection to use, and showed that it is faster than linked list even when inserting and deleting in the middle.
Now translate that to Java: use ArrayList unless you have measured better performance with something else.
Why is that? With modern processor architectures, there is a locality benefit: elements that you compare together, elements that you process together, are all stored next to each other in memory and are likely to be in the CPU's cache at the same time. This allows them to be fetched and written to much faster than when they're in main memory. This is not the case with a linked list, where elements are allocated individually and spread all over the place. (This locality benefit is much more pronounced in C++ where you have the actual objects next to each other, but it's still valid to a smaller extent in Java, where you have the references next to each other, albeit not the actual objects.)
Now with ArrayList, you can keep the orders sorted by price, and use binary search to insert an order in the right place.
If your performance measurement shows that LinkedList is preferable, then unfortunately Java doesn't give you access to the internal representation – the actual nodes – of the LinkedList, so you'll have to homebrew your own list.

Why are you using a List?
If you have a unique id for each object, why not put it in a Map with the id as the key? If you choose a HashMap is implementation removal is O(1). If you implement using LinkedHashMap you can preserve insertion order as well.
LinkedList insertion is superior to....what?
HashMap get/put complexity

You can easily solve this by having a small change.
First have an object that has your value and id as fields
class MyElement implements Comparable{
int id,value;
//Implement compareTo() to sort based on values
//Override equals() method to compare ids
//Override hashcode() to return the id
}
Now use a TreeSet to store these objects.
In this data structure the incoming objects get sorted and deletion and insertion also find lower time complexity of O(log n)

To preserve order by id and to get good performance use TreeMap. Put, remove and get operations will be O(log n).
EDIT:
For preserving order of insertion of elements for each id you can use TreeMap < Integer, ArrayList < T > >, i.e. for each id you can save elements with particular id in list in order of insertion.

Java - Most efficient matching method

Assuming one needs to store a list of items, but it can be stored in any variable type; what would be the most efficient type, if used mostly for matching?
To clarify, a list of items needs to be contained, but the form it's contained in doesn't matter (enum, list, hashmap, Arraylist, etc..)
This list of items would be matched against on a regular basis, but not edited. What would the most efficient storage method be, assuming you only need to write to the list once, but could be matching multiple times per second?
Note: No multi-threading

A HashSet (and HashMap) offers O(1) complexity. Also note that you should create a large enough HashSet with small loadfactor which means that after a hashcode check the elements in the result bucket will also be found very quickly (in a bucket there is a sequential search). Optimally each bucket should contain 1 element at the most.
You can read more about the concept of capacity and load factor in the Javadoc of HashMap.
An even faster solution would be if the number of items is no more than 64 is to create an Enum for them and use EnumSet or EnumMap which stores the elements in a long and uses simple and very fast bit operations to test if an element is in the set or map (a contains operation is just a simple bitmask test).
If you choose to go with the HashSet and not with the Enum approach, know that HashSet uses the hashCode() and equals() methods of the elements. You might consider overriding them to provide a faster implementation knowing the internals of the items you wish to store.
A trivial optimization of overriding the hashCode() can be for example to cache a once computed hash code in the item itself if it doesn't change (and subsequent calls to hashCode() should just return the cached value).

From your description it seems that order doesn't matter. If this is so, use a Set. Java's standard implementation is the HashSet.

Most efficient for repeated lookup would almost certainly be an EnumSet
... Enum sets are represented internally as bit vectors. This representation is extremely compact and efficient. The space and time performance of this class should be good enough to allow its use as a high-quality, typesafe alternative to traditional int-based "bit flags." Even bulk operations (such as containsAll and retainAll) should run very quickly if their argument is also an enum set.
...
Implementation note: All basic operations execute in constant time. They are likely (though not guaranteed) to be much faster than their HashSet counterparts. Even bulk operations execute in constant time if their argument is also an enum set.

Java collection insertion: Set vs. List

I'm thinking about filling a collection with a large amount of unique objects.
How is the cost of an insert in a Set (say HashSet) compared to an List (say ArrayList)?
My feeling is that duplicate elimination in sets might cause a slight overhead.

There is no "duplicate elimination" such as comparing to all existing elements. If you insert into hash set, it's really a dictionary of items by hash code. There's no duplicate checking unless there already are items with the same hash code. Given a reasonable (well-distributed) hash function, it's not that bad.
As Will has noted, because of the dictionary structure HashSet is probably a bit slower than an ArrayList (unless you want to insert "between" existing elements). It also is a bit larger. I'm not sure that's a significant difference though.

You're right: set structures are inherently more complex in order to recognize and eliminate duplicates. Whether this overhead is significant for your case should be tested with a benchmark.
Another factor is memory usage. If your objects are very small, the memory overhead introduced by the set structure can be significant. In the most extreme case (TreeSet<Integer> vs. ArrayList<Integer>) the set structure can require more than 10 times as much memory.

If you're certain your data will be unique, use a List. You can use a Set to enforce this rule.
Sets are faster than Lists if you have a large data set, while the inverse is true for smaller data sets. I haven't personally tested this claim.
Which type of List?
Also, consider which List to use. LinkedLists are faster at adding, removing elements.
ArrayLists are faster at random access (for loops, etc), but this can be worked around using the Iterator of a LinkedList. ArrayLists are are much faster at: list.toArray().

You have to compare concrete implementations (for example HashSet with ArrayList), because the abstract interfaces Set/List don't really tell you anything about performance.
Inserting into a HashSet is a pretty cheap operation, as long as the hashCode() of the object to be inserted is sane. It will still be slightly slower than ArrayList, because it's insertion is a simple insertion into an array (assuming you insert in the end and there's still free space; I don't factor in resizing the internal array, because the same cost applies to HashSet as well).

If the goal is the uniqueness of the elements, you should use an implementation of the java.util.Set interface. The class java.util.HashSet and java.util.LinkedHashSet have O(alpha) (close to O(1) in the best case) complexity for insert, delete and contains check.
ArrayList have O(n) for object (not index) contains check (you have to scroll through the whole list) and insertion (if the insertion is not in tail of the list, you have to shift the whole underline array).
You can use LinkedHashSet that preserve the order of insertion and have the same potentiality of HashSet (takes up only a bit more of memory).

I don't think you can make this judgement simply on the cost of building the collection. Other things that you need to take into account are:
Is the input dataset ordered? Is there a requirement that the output data structure preserves insertion order?
Is there a requirement that the output data structure is ordered (or reordered) based on element values?
Will the output data structure be subsequently modified? How?
Is there a requirement that the output data structure is duplicate free if other elements are added subsequently?
Do you know how many elements are likely to be in the input dataset?
Can you measure the size of the input dataset? (Or is it provided via an iterator?)
Does space utilization matter?
These can all effect your choice of data structure.

Java List:
If you don't have such requirement that you have to keep duplicate or not. Then you can use List instead of Set.
List is an interface in Collection framework. Which extends Collection interface. and ArrayList, LinkedList is the implementation of List interface.
When to use ArrayList or LinkedList
ArrayList: If you have such requirement that in your application mostly work is accessing the data. Then you should go for ArrayList. because ArrayList implements RtandomAccess interface which is Marker Interface. because of Marker interface ArrayList have capability to access the data in O(1) time. and you can use ArrayList over LinkedList where you want to get data according to insertion order.
LinkedList: If you have such requirement that your mostly work is insertion or deletion. Then you should use LinkedList over the ArrayList. because in LinkedList insertion and deletion happen in O(1) time whereas in ArrayList it's O(n) time.
Java Set:
If you have requirement in your application that you don't want any duplicates. Then you should go for Set instead of List. Because Set doesn't store any duplicates. Because Set works on the principle of Hashing. If we add object in Set then first it checks object's hashCode in the bucket if it's find any hashCode present in it's bucked then it'll not add that object.

When do you know when to use a TreeSet or LinkedList?

What are the advantages of each structure?
In my program I will be performing these steps and I was wondering which data structure above I should be using:
Taking in an unsorted array and
adding them to a sorted structure1.
Traversing through sorted data and removing the right one
Adding data (never removing) and returning that structure as an array

When do you know when to use a TreeSet or LinkedList? What are the advantages of each structure?
In general, you decide on a collection type based on the structural and performance properties that you need it to have. For instance, a TreeSet is a Set, and therefore does not allow duplicates and does not preserve insertion order of elements. By contrast a LinkedList is a List and therefore does allow duplicates and does preserve insertion order. On the performance side, TreeSet gives you O(logN) insertion and deletion, whereas LinkedList gives O(1) insertion at the beginning or end, and O(N) insertion at a selected position or deletion.
The details are all spelled out in the respective class and interface javadocs, but a useful summary may be found in the Java Collections Cheatsheet.
In practice though, the choice of collection type is intimately connected to algorithm design. The two need to be done in parallel. (It is no good deciding that your algorithm requires a collection with properties X, Y and Z, and then discovering that no such collection type exists.)
In your use-case, it looks like TreeSet would be a better fit. There is no efficient way (i.e. better than O(N^2)) to sort a large LinkedList that doesn't involve turning it into some other data structure to do the sorting. There is no efficient way (i.e. better than O(N)) to insert an element into the correct position in a previously sorted LinkedList. The third part (copying to an array) works equally well with a LinkedList or TreeSet; it is an O(N) operation in both cases.
[I'm assuming that the collections are large enough that the big O complexity predicts the actual performance accurately ... ]

The genuine power and advantage of TreeSet lies in interface it realizes - NavigableSet
Why is it so powerfull and in which case?
Navigable Set interface add for example these 3 nice methods:
headSet(E toElement, boolean inclusive)
tailSet(E fromElement, boolean inclusive)
subSet(E fromElement, boolean fromInclusive, E toElement, boolean toInclusive)
These methods allow to organize effective search algorithm(very fast).
Example: we need to find all the names which start with Milla and end with Wladimir:
TreeSet<String> authors = new TreeSet<String>();
authors.add("Andreas Gryphius");
authors.add("Fjodor Michailowitsch Dostojewski");
authors.add("Alexander Puschkin");
authors.add("Ruslana Lyzhichko");
authors.add("Wladimir Klitschko");
authors.add("Andrij Schewtschenko");
authors.add("Wayne Gretzky");
authors.add("Johann Jakob Christoffel");
authors.add("Milla Jovovich");
authors.add("Taras Schewtschenko");
System.out.println(authors.subSet("Milla", "Wladimir"));
output:
[Milla Jovovich, Ruslana Lyzhichko, Taras Schewtschenko, Wayne Gretzky]
TreeSet doesn't go over all the elements, it finds first and last elemenets and returns a new Collection with all the elements in the range.

TreeSet:
TreeSet uses Red-Black tree underlying. So the set could be thought as a dynamic search tree. When you need a structure which is operated read/write frequently and also should keep order, the TreeSet is a good choice.

If you want to keep it sorted and it's append-mostly, TreeSet with a Comparator is your best bet. The JVM would have to traverse the LinkedList from the beginning to decide where to place an item. LinkedList = O(n) for any operations, TreeSet = O(log(n)) for basic stuff.

The most important point when choosing a data structure are its inherent limitations. For example if you use TreeSet to store objects and during run-time your algorithm changes attributes of these objects which affect equal comparisons while the object is an element of the set, get ready for some strange bugs.
The Java Doc for Set interface state that:
Note: Great care must be exercised if mutable objects are used as set elements. The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set. A special case of this prohibition is that it is not permissible for a set to contain itself as an element.
Interface Set Java Doc

default Collection type

Assume you need to store/retrieve items in a Collection, don't care about ordering, and duplicates are allowed, what type of Collection do you use?
By default, I've always used ArrayList, but I remember reading/hearing somewhere that a Queue implementation may be a better choice. A List allows items to be added/retrieved/removed at arbitrary positions, which incurs a performance penalty. As a Queue does not provide this facility it should in theory be faster when this facility is not required.
I realise that all discussions about performance are somewhat meaningless, the only thing that really matters is measurement. Nevertheless, I'm interested to know what others use for a Collection, when they don't care about ordering, and duplicates are allowed, and why?

"It depends". The question you really need to answer first is "What do I want to use the collection for?"
If you often insert / remove items on one of the ends (beginning, end) a Queue will be better than a ArrayList. However in many cases you create a Collection in order to just read from it. In this case a ArrayList is far more efficient: As it is implemented as an array, you can iterate over it quite efficient (same applies for a LinkedList). However a LinkedList uses references to link single items together. So if you do not need random removals of items (in the middle), a ArrayList is better: An ArrayList will use less memory as the items don't need the storage for the reference to the next/prev item.
To sum it up:
ArrayList = good if you insert once and read often (random access or sequential)
LinkedList = good if you insert/remove often at random positions and read only sequential
ArrayDeque (java6 only) = good if you insert/remove at start/end and read random or sequential

As a default, I tend to prefer LinkedList to ArrayList. Obviously, I use them not through the List interface, but rather through the Collection interface.
Over the time, I've indeed found out that when I need a generic collection, it's more or less to put some things in, then iterate over it. If I need more evolved behaviour (say random access, sorting or unicity checks), I will then maybe change the used implementation, but before that I will change the used interface to the most appropriated. This way, I can ensure feature is provided before to concentrate on optimization and implementation.

ArrayList basicly contains an array inside (that's why it is called ArrayList). And operations like addd/remove at arbitrary positions are done in a straightforward way, so if you don't use them - there is no harm to performance.

If ordering and duplicates is not a problem and case is only for storing,
I use ArrayList, As it implements all the list operations. Never felt any performance issues with these operations (Never impacted my projects either). Actually using these operations have simple usage & I don't need to care how its managed internally.
Only if multiple threads will be accessing this list I use Vector because its methods are synchronized.
Also ArrayList and Vector are collections which you learn first :).

It depends on what you know about it.
If I have no clue, I tend to go for a linked list, since the penalty for adding/removing at the end is constant. If I have a rough idea of the maximum size of it, I go for an arraylist with the capacity specified, because it is faster if the estimation is good. If I really know the exact size I tend to go for a normal array; although that isn't really a collection type.

I realise that all discussions about performance are somewhat meaningless, the only thing that really matters is measurement.
That's not necessarily true.
If your knowledge of how the application is going to work tells you that certain collections are going to be very large, then it is a good idea to pick the right collection type. But the right collection type depends crucially on how the collections are going to be used; i.e. on the algorithms.
For example, if your application is likely to be dominated by testing if a collection holds a given object, the fact that Collection.contains(Object) is O(N) for both LinkedList<T> and ArrayList<T> might mean that neither is an appropriate collection type. Instead, maybe you should represent the collection as a HashMap<T, Integer>, where the Integer represents the number of occurrences of a T in the "collection". That will give you O(1) testing and removal, at the cost of more space overheads and slower (though still O(1)) insertion.
But the thing to stress is that if you are likely to be dealing with really large collections, there should be no such thing as a "default" collection type. You need to think about the collection in the context of the algorithms. (And the flip side is that if the collections are always going to be small, it probably makes little difference which collection type you pick.)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.