ConcurrentSkipListSet and re-sorting (java) - java

I am using a ConcurrentSkipListSet, that is obviously accessed through multiple threads. Now, the values that are used by the compareTo-method of the underlying objects change overtime. Because of this, I want to 'update' the ordering of the list (by resorting it, or something similar).
However, java.util.Collections.sort(list) doesn't work, and just rebuilding the list is probably too slow (and would mess up the whole concurrency-proofness). Is there any other solution I should look at?
It does not have to lead to an optimal sort (which is near-impossible with concurrency and changing values anyway). Near optimal would suffice, as long as any remove/add-calls remain thread-proof (this would be a real issue when rebuilding the list when sorting).

Every time you edit an item such that it's sort order may potentially change, you have to remove it from the list then change the key and then re-insert it.
Dr Cliff Click at Azul Systems has a very nice presentation of how they do lock-free hash-tables using tombstones and such. If you go towards writing your own skip-list/tree to make the reordering of an item into a single - and hopefully faster - op, then you might also go this lock-free route too. And be sure to share your results :)

These types of collections in the Java API do not support mutable elements (i.e. elements where the compareTo method changes). As such, the only way to do it is re-assemble a new list in an atomic way, or as Will suggests you can perform a remove, mutate and re-insert of the element.
HashSet has the same problem - the hash bucket is calculated on insertion of an object, then you won't be able to do set.contains( ... ) if you mutate the object's hash code.
To be exact, collections like ConcurrentSkipListSet and HashSet perform their comparisons/hashing on insertion and removal. The only collections that 'support' mutable elements do not perform special insertion logic based on the state of the elements (e.g. an ArrayList).
The documentation for the Set interface states:
Note: Great care must be exercised if mutable objects are used as set elements. The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set. A special case of this prohibition is that it is not permissible for a set to contain itself as an element.
and the documentation for the SortedSet interface states:
Note that the ordering maintained by a sorted set (whether or not an explicit comparator is provided) must be consistent with equals if the sorted set is to correctly implement the Set interface. (See the Comparable interface or Comparator interface for a precise definition of consistent with equals.) This is so because the Set interface is defined in terms of the equals operation, but a sorted set performs all element comparisons using its compareTo (or compare) method, so two elements that are deemed equal by this method are, from the standpoint of the sorted set, equal. The behavior of a sorted set is well-defined even if its ordering is inconsistent with equals; it just fails to obey the general contract of the Set interface.

Related

To store unique element in a collection with natural order

While I was solving a Java test I came up with the following question:
You need to store elements in a collection that guarantees that no
duplicates are stored and all elements can be accessed in natural
order. Which interface provides that capability?
A. java.util.Map
B. java.util.Set
C. java.util.List
D. java.util.Collection
I have no idea what is the right case here? We can store the same element in any of these collections unless in a Set, but the Set doesn't provide the natural order. What's wrong?
The correct answer for that test is Set Let's remember that it's asking for an interface that could provide that; given the right implementation, the Set interface could provide it.
The Map interface doesn't make any guarantees around what order things are stored, as that's implementation specific. However, if you use the right implementation (that is, TreeMap as spelled out by the docs), then you're guaranteed a natural ordering and no duplicate entries.
However, there's no requirement about key-value pairs.
The Set interface also doesn't make any guarantees around what order things are stored in, as that's implementation specific. But, like TreeMap, TreeSet is a set that can be used to store things in a natural order with no duplicates.
Here's how it'd look.
Set<String> values = new TreeSet<>();
The List interface will definitely allow duplicates, which instantly rules it out.
The Collection interface doesn't have anything directly implementing it, but it is the patriarch of the entire collections hierarchy. So, in theory, code like this is legal:
Collection<String> values = new TreeSet<>();
...but you'd lose information about what kind of collection it actually was, so I'd discourage its usage.
TreeSet would give you ordering (either natural ordering by default of custom ordering via a Comparator).
To be more general, SortedSet is the more general interface that offers uniqueness and ordering.
A Set that further provides a total ordering on its elements. The elements are ordered using their natural ordering, or by a Comparator typically provided at sorted set creation time. The set's iterator will traverse the set in ascending element order. Several additional operations are provided to take advantage of the ordering.
If by natural order, you mean order of insertion, then LinkedHashSet is your go to Set implementation.
The correct answers are:
SortedSet gives guarantees, regarding natural order of elements.
TreeSet is typical implementation
Strictly speaking, when choosing from the above List is the only of the interfaces that has a defined order of iteration, however it does allow duplicates.
Set and Map on the other hands, does not allow duplicates (of keys for Map), but they also do not define the order of iteration, they are unordered by default, with HashSet/HashMap being the counter example.
Collection allows none.
So, strictly speaking - none of the suggested interfaces provide the desired capability,
However, as others suggested, there are specific implementations of the interfaces that do allow natural order of elements and no duplicates, mainly the SortedSet interface and its TreeSet implementation
To further elaborate why Set is not a good option, if you have a variable, let it be mySet, and you want it to be ordered, users are going to be surprised when you use the Set interface, imaigine the following code:
public int processMyDataStructure(Set set) {
//some calculation that assumes set is ordered
return result;
}
and users provide you a HashSet as argument - you are going to get a wrong behavior from your method, because Set does not guarantee ordering. To avoid it you should have asked for an SortedSet rather than Set.
I've come over this question yesterday on my interview test and need to comment on that: the question (assuming one of the listed A, B, C or D answers has to be correct) is plainly wrong. There is no correct answer listed.
Nothing in Set interface guarantees the order in which the elements are to be returned. And there is no such thing, as Makoto would like it in his accepted answer, as right implementation that could theoretically do the job, because we are not asked for any implementation here, but whether interface provides the requested capability.
So, the test question with the answers provided is misguided.
Referring a bit more to accepted answer, there is one more reason for it to be wrong. Specifically, Makoto argues, that The List interface will definitely allow duplicates, which instantly rules it out. This argument may be defied by citation from List specification saying:
It is not inconceivable that someone might wish to implement a list that prohibits duplicates, by throwing runtime exceptions when the user attempts to insert them, but we expect this usage to be rare
so in my opinion, any of the answers given is equally wrong, or, as accepted answer wants it, equally correct, as we are free to write implementation of List (or Map, or Collection) behaving in any way we wish (in boundaries set by an interface specification), but interfaces and their specifications are here to guarantee some contract, and this question is really about them, not about possible implementations.

Java - Most efficient matching method

Assuming one needs to store a list of items, but it can be stored in any variable type; what would be the most efficient type, if used mostly for matching?
To clarify, a list of items needs to be contained, but the form it's contained in doesn't matter (enum, list, hashmap, Arraylist, etc..)
This list of items would be matched against on a regular basis, but not edited. What would the most efficient storage method be, assuming you only need to write to the list once, but could be matching multiple times per second?
Note: No multi-threading
A HashSet (and HashMap) offers O(1) complexity. Also note that you should create a large enough HashSet with small loadfactor which means that after a hashcode check the elements in the result bucket will also be found very quickly (in a bucket there is a sequential search). Optimally each bucket should contain 1 element at the most.
You can read more about the concept of capacity and load factor in the Javadoc of HashMap.
An even faster solution would be if the number of items is no more than 64 is to create an Enum for them and use EnumSet or EnumMap which stores the elements in a long and uses simple and very fast bit operations to test if an element is in the set or map (a contains operation is just a simple bitmask test).
If you choose to go with the HashSet and not with the Enum approach, know that HashSet uses the hashCode() and equals() methods of the elements. You might consider overriding them to provide a faster implementation knowing the internals of the items you wish to store.
A trivial optimization of overriding the hashCode() can be for example to cache a once computed hash code in the item itself if it doesn't change (and subsequent calls to hashCode() should just return the cached value).
From your description it seems that order doesn't matter. If this is so, use a Set. Java's standard implementation is the HashSet.
Most efficient for repeated lookup would almost certainly be an EnumSet
... Enum sets are represented internally as bit vectors. This representation is extremely compact and efficient. The space and time performance of this class should be good enough to allow its use as a high-quality, typesafe alternative to traditional int-based "bit flags." Even bulk operations (such as containsAll and retainAll) should run very quickly if their argument is also an enum set.
...
Implementation note: All basic operations execute in constant time. They are likely (though not guaranteed) to be much faster than their HashSet counterparts. Even bulk operations execute in constant time if their argument is also an enum set.

Retrieve Least Element, Elements are Dynamically-Ordered

I have collection of elements from which I need to retrieve the least/minimum element.
Normally I would use a PriorityQueue as they are designed specifically for this purpose, and offer O(log(n)) time for dequeing methods.
However, the elements in my array have a dynamic order, ie there natural order changes unpredictably over time. I assume PriorityQueue and other such Sorted collections sort an element when inserted, and then leave it. If this is so PriorityQueue wouldn't work for dynamically-ordered elements. Am I correct in my assumption? Or would PriorityQueue still be appropriate in this situation?
If I can't use PriorityQueue, Collections.min would be my next instinct. However this iterates over the entire collection, which presumably gives O(n) time. Is this the next best solution?
What is the best collection/method to use to retrieve the least element from a collection, given that the natural order of the elements may change unpredictably over time?
Edit:
The order of several elements changes per retrieval operation
Edit 2:
The compare algorithm remains constant, however the values of the fields which it assesses vary unpredictably between retrievals.
I think if the change is truly "unpredictable" you may be stuck with Collections.min(). However, maybe for some other collections like PriorityQueue you could try, before calling for the min.
Add something that you KNOW is the min.
Remove that
Then ask again for the "real" min and hope that your little kludge resorted things...
Alternatively, do you know if the order has changed over time? e.g. some OrderChangedEvent can be fired? If so, recreate the sorted whatever as needed.
A possible way to do this would be to extend PriorityQueue that contains a list as one of the fields. This list will store the java.lang.Object.hashCode() of each object. Whenever an add, peek, poll, offer, etc. is called on the PriorityQueue, the queue will check the hash codes of each element and make see if any element changed. If they have, it will re-order the elements that have changed. Then, it will replace the hashcodes of the changed elements in the list. I don't know how fast this will be, but I suspect it will be faster than O(n).
Without any further assumption on the operations you are going to do, you can't achieve better performance than with a PriorityQueue or another O(log(n))-insert collection (TreeSet , for example, but you lose the O(1)-peek).
As you correctly assumed Collections.min(Collection, Comparator) is a linear operation.
But it depends on how often you need to change the ordering: for example if you only need to change it once in a while and still keep a "standard" ordering, min() is a viable option, but if you need to switch ordering completely then you will probably be better off with reordering the queue/set (that is, traversing and adding all the elements in a new one), tough at a O(nlog(n)) cost. Using Collections.sort(List, Comparator) may be effective if you need a lot of reordering compared to inserts, but requires you to use a List.
Of course if you can make somewhat strong assumptions on the types of sorting you will need (for example, if it can be restricted to a part of the data) you could write your own collection.
Edit:
So you have a (more or less) finite number of orderings (never mind that it's the same type of comparison over different fields, it's different Comparators and that's what matters)? If that's the case, you can probably achieve best performance by using m queues that reference the same objects, each using a different comparator (the simplest method, really). This way you have:
constant time access
O(m*logn(n)) inserts (to insert in every queue)
O(m*n) removals (to remove from every queue)
no ordering costs (as it's handled by the inserts)
slightly larger memory cost (probably negligible)
additional O(n*log(n)) cost the first time a particolar ordering is requested
Supposing a value of m orders of magnitude smaller than n, this is comparable to optimal (single-ordering PriorityQueue) performance. For convenience, you can wrap this into a custom collection that takes a Comparator parameter on retrieval operations, and use it as a key for an HashMap of all the PriorityQueues.
Edit #2:
In that case, there is no better solution than running min() on every retrieval (unless you can make assumptions on the changes of the data); this also means that it's better to just use an ArrayList as the collection, since it has basically the lowest possible cost on every operation and you will not benefit from PriorityQueue's natural ordering anyway. You will end up with linear cost on retrieval (for min) and constant on insertion and deletion: this is optimal as there is no sorting algorithm that has less than Ω(n) and Θ(nlog n) anyway.
As a side note, ordered collections work on the assumption that values will not change after insertion; this is because there is no cost-effective way to monitor the changes nor to reorder them "in place".
Can't you use a java TreeSet which keeps the collection sorted at all times. You need to implement the Comparable interface on your objects to do so. Checkout http://docs.oracle.com/javase/1.4.2/docs/api/java/util/TreeSet.html

How does the search algorithm work with objects in a java collection such as HashSet?

The question really is regarding objects that change dynamically in a collection. Does the "contains" method go and compare each of the object individually every time or does it do something clever?
If you have 10000 entries in a collection, I would have expected it to work a bit more cleverly but not sure. Or if not is there a way to optimise it by adding a hook that would tell the collection object to update hashcodes for the objects that have changed??
Additional Question:
Thanks for answers below... Can I also ask what happens in case of ArrayList? I could not find anything in the documentation that says not to put mutable objects in ArrayList. Does that mean the search algorithm simply goes and compares against hashcode of each object??
They hash the object and look it up by its hash code. If it is there, it will compare the objects themselves. This is because two or more objects that have the same hash might not be the same object.
Since Java's hash collections use buckets (chaining), they have to look at all the objects in the bucket. These objects are kept in a linked list (not java.util.LinkedList, but a custom list)
This is generally very efficient, and the HashSet.contains() method is amortized O(1) (constant time).
Java's docs have an answer to the second part of your question:
Note: Great care must be exercised if mutable objects are used as set elements. The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set. A special case of this prohibition is that it is not permissible for a set to contain itself as an element.
A HashSet computes the hash code of an element when it's added to the set. It stores this in a way which makes it very efficient to find all elements with the same hash code.
Then when you call contains(), it simply has to compute the hash code of the value you're looking for, and find all elements in the set with the same hash code. There may be multiple elements as hash codes aren't unique, but there are likely to be far fewer elements with matching hash codes than there are elements within the set itself. Each matching element is then checked with equals until either a match is found or we've run out of candidates.
EDIT: To answer the second part, which somehow I'd missed on first reading, you won't be able to find the element again. You mustn't change an element used as a key in a hash table or an element in a hash set in any equality-affecting manner, or you will basically break things.
The simple answer is — no, nothing clever happens. If you expect an object's state to change in a way that affects its hashCode() and equals(...) behavior, then you must not store it in a HashSet, nor any other Set. To quote from http://download.oracle.com/javase/6/docs/api/java/util/Set.html:
Note: Great care must be exercised if mutable objects are used as set elements. The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set. A special case of this prohibition is that it is not permissible for a set to contain itself as an element.
A HashSet uses a HashMap under the hood. Therefore, the contains operation uses the hashCode() method in the object to check if it's present in the hash table implemented by HashMap.

java constantly sorted list with quick retrieval

I'm looking for a constantly sorted list in java, which can also be used to retrieve an object very quickly. PriorityQueue works great for the "constantly sorted" requirement, and HashMap works great for the fast retrieval by key, but I need both in the same list. At one point I had wrote my own, but it does not implement the collections interfaces (so can't be used as a drop-in replacement for a java.util.List etc), and I'd rather stick to standard java classes if possible.
Is there such a list out there? Right now I'm using 2 lists, a priority queue and a hashmap, both contain the same objects. I use the priority queue to traverse the first part of the list in sorted order, the hashmap for fast retrieval by key (I need to do both operations interchangeably), but I'm hoping for a more elegant solution...
Edit: I should add that I need to have the list sorted by a different comparator then what is used for retrieval by key; the list is sorted by a long value, the key retrieval is a String.
Since you're already using HashMap, that implies that you have unique keys. Assuming that you want to order by those keys, TreeMap is your answer.
It sounds like what you're talking about is a collection with an automatically-maintained index.
Try looking at GlazedLists which use "list pipelines" to efficiently propagate changes -- their SortedList class should do the job.
edit: missed your retrieval-by-key requirement. That can be accomplished with GlazedLists.syncEventListToMap and GlazedLists.syncEventListToMultimap -- syncEventListToMap works if there are no duplicate keys, and syncEventListToMultimap works if there are duplicate keys. The nice part about this approach is that you can create multiple maps based on different indices.
If you want to use TreeMaps for indices -- which may give you better performance -- you need to keep your TreeMaps privately encapsulated within a custom class of your choosing, that exposes the interfaces/methods you want, and create accessors/mutators for that class to keep the indices in sync with the collection. Be sure to deal with concurrency issues (via synchronized methods or locks or whatever) if you access the collection from multiple threads.
edit: finally, if fast traversal of the items in sorted order is important, consider using ConcurrentSkipListMap instead of TreeMap -- not for its concurrency, but for its fast traversal. Skip lists are linked lists with multiple levels of linkage, one that traverses all items, the next that traverses every K items on average (for a given constant K), the next that traverses every K2 items on average, etc.
TreeMap
http://download.oracle.com/javase/6/docs/api/java/util/TreeMap.html
Go with a TreeSet.
A NavigableSet implementation based on a TreeMap. The elements are ordered using their natural ordering, or by a Comparator provided at set creation time, depending on which constructor is used.
This implementation provides guaranteed log(n) time cost for the basic operations (add, remove and contains).
I haven't tested this so I might be wrong, so consider this just an attempt.
Use TreeMap, wrap the key of this map as an object which has two attributes (the string which you use as the key in hashmap and the long which you use to maintain the sort order in PriorityQueue). Now for this object, override the equals and hashcode method using the string. Implement the comparable interface using the long.
Why don't you encapsulate your solution to a class that implements Collection or Map?
This way you could simply delegate the retrieval methods to the faster/better suiting collection. Just make sure that calls to write-methods (add/remove/put) will be forwarded to both collections. Remember indirect accesses, like iterator.remove(). Most of these methods are optional to implement, but you have to deactivate them (Collections.unmodifiableXXX will help here in most cases).

Categories

Resources