Java - Most efficient matching method - java

Assuming one needs to store a list of items, but it can be stored in any variable type; what would be the most efficient type, if used mostly for matching?
To clarify, a list of items needs to be contained, but the form it's contained in doesn't matter (enum, list, hashmap, Arraylist, etc..)
This list of items would be matched against on a regular basis, but not edited. What would the most efficient storage method be, assuming you only need to write to the list once, but could be matching multiple times per second?
Note: No multi-threading

A HashSet (and HashMap) offers O(1) complexity. Also note that you should create a large enough HashSet with small loadfactor which means that after a hashcode check the elements in the result bucket will also be found very quickly (in a bucket there is a sequential search). Optimally each bucket should contain 1 element at the most.
You can read more about the concept of capacity and load factor in the Javadoc of HashMap.
An even faster solution would be if the number of items is no more than 64 is to create an Enum for them and use EnumSet or EnumMap which stores the elements in a long and uses simple and very fast bit operations to test if an element is in the set or map (a contains operation is just a simple bitmask test).
If you choose to go with the HashSet and not with the Enum approach, know that HashSet uses the hashCode() and equals() methods of the elements. You might consider overriding them to provide a faster implementation knowing the internals of the items you wish to store.
A trivial optimization of overriding the hashCode() can be for example to cache a once computed hash code in the item itself if it doesn't change (and subsequent calls to hashCode() should just return the cached value).

From your description it seems that order doesn't matter. If this is so, use a Set. Java's standard implementation is the HashSet.

Most efficient for repeated lookup would almost certainly be an EnumSet
... Enum sets are represented internally as bit vectors. This representation is extremely compact and efficient. The space and time performance of this class should be good enough to allow its use as a high-quality, typesafe alternative to traditional int-based "bit flags." Even bulk operations (such as containsAll and retainAll) should run very quickly if their argument is also an enum set.
...
Implementation note: All basic operations execute in constant time. They are likely (though not guaranteed) to be much faster than their HashSet counterparts. Even bulk operations execute in constant time if their argument is also an enum set.

Related

Map.of vs EnumMap for immutable map with enum keys

An EnumMap uses the restriction that all keys of a map will be from the same enum to gain performance benefits:
Enum maps are represented internally as arrays. This representation is extremely compact and efficient.
In this case, the keys and values are stored in separate arrays and the values are ordinal-ordered. Iteration is done with the internal EnumMapIterator class.
An immutable map created by the various Map.of methods use the restriction that the map will not change structurally to gain performance benefits. If the map is not of size 0 or 1, it uses the MapN internal class that also stores its entries in an array. In this case, the value is stored 1 index after its key. Iteration is done with the internal MapNIterator.
For an immutable map of enum keys of size 2 or more, which answers both of the above's requirements, which map performs better? (Criteria could be space efficiency, time efficiency for containsKey, containsValue, get, and iteration efficiency of entrySet, keySet and values.)
which map gives better space efficiency, and time efficiency for its operations and iteration: containsKey, containsValue, get, entrySet, keySet and values?
You're raising 1 + 6 (or 2 * 6, depending on how it's understood) questions, that's a bit too much. If you want a definite answer, you have to concentrate on a single thing and profile it (nobody's gonna do it for you unless you find a very interesting problem).
The space efficiency for an EnumMap simply must be better. There's no need to store the keys as a shared enum array can be used. There's no need for a holes-containing hash lookup array.
There may be exceptions like a small map based on a huge enum.
The most important operation is get. With an EnumMap, it involves no lookup, just a trivial class comparison and an array access. With Map.of(...), there's a loop, which for enums, usually terminates after the first iteration as Enum.hashCode is IMHO stupid, but usually well distributed.
As containsKey is based on the same lookup, it's clear.
I doubt, I've ever used containsValue, but it doesn't do anything smarter than linear search. I'd expect a tiny win for EnumMap because of the holes (needing trivial null test, but causing branch mispredictions).
The remaining three operations are not worth looking up as they return a collection containing no data and simply pointing to the map, i.e., a constant time operation. For example, map.keySet().contains(x) simply delegates to map.containsKey().
The efficiency of the iteration would be a more interesting question, but you didn't ask it.

Using EnumSet or EnumMap on arbitrary keys

We know that EnumSet and EnumMap are faster than HashSet/HashMap due to the power of bit manipulation. But are we actually harnessing the true power of EnumSet/EnumMap when it really matters? If we have a set of millions of record and we want to find out if some object is present in that set or not, can we take advantage of EnumSet's speed?
I checked around but haven't found anything discussing this. Everywhere the usual stuff is found i.e. because EnumSet and EnumMap uses a predefined set of keys lookups on small collections are very fast. I know enums are compile-time constants but can we have best of both worlds - an EnumSet-like data structure without needing enums as keys?
Interesting insight; the short answer is no, but your question is exploring some good data-structure design concepts which I'll try to discuss.
First, lets talk about HashMap (HashSet uses a HashMap internally, so they share most behavior); a hash-based data structure is powerful because it is fast and general. It's fast (i.e. approximately O(1)) because we can find the key we're looking for with a very small number of computations. Roughly, we have an array of lists of keys, convert the key to an integer index into that array, then look through the associated list for the key. As the mapping gets bigger, the backing array is repeatedly resized to hold more lists. Assuming the lists are evenly distributed, this lookup is very fast. And because this works for any generic object (that has a proper .hashcode() and .equals()) it's useful for just about any application.
Enums have several interesting properties, but for the purpose of efficient lookup we only care about two of them - they're generally small, and they have a fixed number of values. Because of this, we can do better than HashMap; specifically, we can map every possible value to a unique integer, meaning we don't need to compute a hash, and we don't need to worry about hashes colliding. So EnumMap simply stores an array of the same size as the enum and looks up directly into it:
// From Java 7's EnumMap
public V get(Object key) {
return (isValidKey(key) ?
unmaskNull(vals[((Enum)key).ordinal()]) : null);
}
Stripping away some of the required Map sanity checks, it's simply:
return vals[key.ordinal()];
Note that this is conceptually no different than a standard HashMap, it's simply avoiding a few computations. EnumSet is a little more clever, using the bits in one or more longs to represent array indices, but functionally it's no different than the EnumMap case - we allocate enough slots to cover all possible enum values and can use their integer .ordinal() rather than compute a hash.
So how much faster than HashMap is EnumMap? It's clearly faster, but in truth it's not that much faster. HashMap is already a very efficient data structure, so any optimization on it will only yield marginally better results. In particular, both HashMap and EnumMap are asymptotically the same speed (O(1)), meaning as they get bigger, they behave equally well. This is the primary reason why there isn't a more general data-structure like EnumMap - because it wouldn't be worth the effort relative to HashMap.
The second reason we don't want a more general "FiniteKeysMap" is it would make our lives as users more complicated, which would be worthwhile if it were a marked speed increase, but since it's not would just be a hassle. We would have to define an interface (and probably also a factory pattern) for any type that could be a key in this map. The interface would need to provide a guarantee that every unique instance returns a unique hashcode in the range [0-n), and also provide a way for the map to get n and potentially all n elements. Those last two operations would be better as static methods, but since we can't define static methods in an interface, they'd have to either be passed in directly to every map we create, or a separate factory object with this information would have to exist and be passed to the map/set at construction. Because enums are part of the language, they get all of those benefits for free, meaning there's no such cost for end-user programmers to need to take advantage of.
Furthermore, it would be very easy to make mistakes with this interface; say you have a type that has exactly 100,000 unique values. Should it implement our interface? It could. But you'd likely actually be shooting yourself in the foot. This would eat up a lot of unnecessary memory, since our FiniteKeysMap would allocate a new 100,000 length array to represent even an empty map. Generally speaking, that sort of wasted space is not worth the marginal improvement such a data structure would provide.
In short, while your idea is possible, it is not practical. HashMap is so efficient that attempting to create a separate data structure for a very limited number of cases would add far more complexity than value.
For the specific case of faster .contains() checks, you might like Bloom Filters. It's a set-like data structure that very efficiently stores very large sets, with the condition that it may sometimes incorrectly say an element is in the set when it is not (but not the other way around - if it says an element isn't in the set, it's definitely not). Guava provides a nice BloomFilter implementation.

Why an EnumSet or an EnumMap is likely to be more performant than their hashed counterparts?

The following is from the Implementation Note section of Java doc of EnumMap :
Implementation note: All basic operations execute in constant time.
They are likely (though not guaranteed) to be faster than their
HashMap counterparts.
I have seen a similar line in the java doc for EnumSet also . I want to know why is it more likely that EnumSets and EnumMaps will be more faster than their hashed counterparts ?
EnumSet is backed by a bit array. Since the number of different items you can put in EnumSet is known in advance, we can simply reserve one bit for each enum value. You can imagine similar optimization for Set<Byte> or Set<Short>, but it's not feasible for Set<Integer> (you'd need 0.5 GiB of memory for 2^32 bits) or in general.
Thus basic operations like exists or add ar constant time (just like HashSet) but they simply need to examine or set one bit. No hashCode() computation. This is why EnumSet is faster. Also more complex operations like union or easily implemented using bit manipulation techniques.
In OpenJDK there are two implementations of EnumSet: RegularEnumSet capable of handling enum with up to 64 values in long and JumboEnumSet for bigger enums (using long[]). But it's just an implementation detail.
EnumMap works on similar principles, but it uses Object[] to store values while key (index) is implicitly inferred from Enum.ordinal().
EnumSet uses array[] inside and a natural order with consequent consequences:
fast: get/set is called by constant time O(1), hashCode() is not used to find a right bucket
memory efficiency: the size of array is predefined by Enum size and is not dynamic
EnumMap uses Enum as a key based on the same principals as EnumSet where hash code collision is not possible

Retrieve Least Element, Elements are Dynamically-Ordered

I have collection of elements from which I need to retrieve the least/minimum element.
Normally I would use a PriorityQueue as they are designed specifically for this purpose, and offer O(log(n)) time for dequeing methods.
However, the elements in my array have a dynamic order, ie there natural order changes unpredictably over time. I assume PriorityQueue and other such Sorted collections sort an element when inserted, and then leave it. If this is so PriorityQueue wouldn't work for dynamically-ordered elements. Am I correct in my assumption? Or would PriorityQueue still be appropriate in this situation?
If I can't use PriorityQueue, Collections.min would be my next instinct. However this iterates over the entire collection, which presumably gives O(n) time. Is this the next best solution?
What is the best collection/method to use to retrieve the least element from a collection, given that the natural order of the elements may change unpredictably over time?
Edit:
The order of several elements changes per retrieval operation
Edit 2:
The compare algorithm remains constant, however the values of the fields which it assesses vary unpredictably between retrievals.
I think if the change is truly "unpredictable" you may be stuck with Collections.min(). However, maybe for some other collections like PriorityQueue you could try, before calling for the min.
Add something that you KNOW is the min.
Remove that
Then ask again for the "real" min and hope that your little kludge resorted things...
Alternatively, do you know if the order has changed over time? e.g. some OrderChangedEvent can be fired? If so, recreate the sorted whatever as needed.
A possible way to do this would be to extend PriorityQueue that contains a list as one of the fields. This list will store the java.lang.Object.hashCode() of each object. Whenever an add, peek, poll, offer, etc. is called on the PriorityQueue, the queue will check the hash codes of each element and make see if any element changed. If they have, it will re-order the elements that have changed. Then, it will replace the hashcodes of the changed elements in the list. I don't know how fast this will be, but I suspect it will be faster than O(n).
Without any further assumption on the operations you are going to do, you can't achieve better performance than with a PriorityQueue or another O(log(n))-insert collection (TreeSet , for example, but you lose the O(1)-peek).
As you correctly assumed Collections.min(Collection, Comparator) is a linear operation.
But it depends on how often you need to change the ordering: for example if you only need to change it once in a while and still keep a "standard" ordering, min() is a viable option, but if you need to switch ordering completely then you will probably be better off with reordering the queue/set (that is, traversing and adding all the elements in a new one), tough at a O(nlog(n)) cost. Using Collections.sort(List, Comparator) may be effective if you need a lot of reordering compared to inserts, but requires you to use a List.
Of course if you can make somewhat strong assumptions on the types of sorting you will need (for example, if it can be restricted to a part of the data) you could write your own collection.
Edit:
So you have a (more or less) finite number of orderings (never mind that it's the same type of comparison over different fields, it's different Comparators and that's what matters)? If that's the case, you can probably achieve best performance by using m queues that reference the same objects, each using a different comparator (the simplest method, really). This way you have:
constant time access
O(m*logn(n)) inserts (to insert in every queue)
O(m*n) removals (to remove from every queue)
no ordering costs (as it's handled by the inserts)
slightly larger memory cost (probably negligible)
additional O(n*log(n)) cost the first time a particolar ordering is requested
Supposing a value of m orders of magnitude smaller than n, this is comparable to optimal (single-ordering PriorityQueue) performance. For convenience, you can wrap this into a custom collection that takes a Comparator parameter on retrieval operations, and use it as a key for an HashMap of all the PriorityQueues.
Edit #2:
In that case, there is no better solution than running min() on every retrieval (unless you can make assumptions on the changes of the data); this also means that it's better to just use an ArrayList as the collection, since it has basically the lowest possible cost on every operation and you will not benefit from PriorityQueue's natural ordering anyway. You will end up with linear cost on retrieval (for min) and constant on insertion and deletion: this is optimal as there is no sorting algorithm that has less than Ω(n) and Θ(nlog n) anyway.
As a side note, ordered collections work on the assumption that values will not change after insertion; this is because there is no cost-effective way to monitor the changes nor to reorder them "in place".
Can't you use a java TreeSet which keeps the collection sorted at all times. You need to implement the Comparable interface on your objects to do so. Checkout http://docs.oracle.com/javase/1.4.2/docs/api/java/util/TreeSet.html

default Collection type

Assume you need to store/retrieve items in a Collection, don't care about ordering, and duplicates are allowed, what type of Collection do you use?
By default, I've always used ArrayList, but I remember reading/hearing somewhere that a Queue implementation may be a better choice. A List allows items to be added/retrieved/removed at arbitrary positions, which incurs a performance penalty. As a Queue does not provide this facility it should in theory be faster when this facility is not required.
I realise that all discussions about performance are somewhat meaningless, the only thing that really matters is measurement. Nevertheless, I'm interested to know what others use for a Collection, when they don't care about ordering, and duplicates are allowed, and why?
"It depends". The question you really need to answer first is "What do I want to use the collection for?"
If you often insert / remove items on one of the ends (beginning, end) a Queue will be better than a ArrayList. However in many cases you create a Collection in order to just read from it. In this case a ArrayList is far more efficient: As it is implemented as an array, you can iterate over it quite efficient (same applies for a LinkedList). However a LinkedList uses references to link single items together. So if you do not need random removals of items (in the middle), a ArrayList is better: An ArrayList will use less memory as the items don't need the storage for the reference to the next/prev item.
To sum it up:
ArrayList = good if you insert once and read often (random access or sequential)
LinkedList = good if you insert/remove often at random positions and read only sequential
ArrayDeque (java6 only) = good if you insert/remove at start/end and read random or sequential
As a default, I tend to prefer LinkedList to ArrayList. Obviously, I use them not through the List interface, but rather through the Collection interface.
Over the time, I've indeed found out that when I need a generic collection, it's more or less to put some things in, then iterate over it. If I need more evolved behaviour (say random access, sorting or unicity checks), I will then maybe change the used implementation, but before that I will change the used interface to the most appropriated. This way, I can ensure feature is provided before to concentrate on optimization and implementation.
ArrayList basicly contains an array inside (that's why it is called ArrayList). And operations like addd/remove at arbitrary positions are done in a straightforward way, so if you don't use them - there is no harm to performance.
If ordering and duplicates is not a problem and case is only for storing,
I use ArrayList, As it implements all the list operations. Never felt any performance issues with these operations (Never impacted my projects either). Actually using these operations have simple usage & I don't need to care how its managed internally.
Only if multiple threads will be accessing this list I use Vector because its methods are synchronized.
Also ArrayList and Vector are collections which you learn first :).
It depends on what you know about it.
If I have no clue, I tend to go for a linked list, since the penalty for adding/removing at the end is constant. If I have a rough idea of the maximum size of it, I go for an arraylist with the capacity specified, because it is faster if the estimation is good. If I really know the exact size I tend to go for a normal array; although that isn't really a collection type.
I realise that all discussions about performance are somewhat meaningless, the only thing that really matters is measurement.
That's not necessarily true.
If your knowledge of how the application is going to work tells you that certain collections are going to be very large, then it is a good idea to pick the right collection type. But the right collection type depends crucially on how the collections are going to be used; i.e. on the algorithms.
For example, if your application is likely to be dominated by testing if a collection holds a given object, the fact that Collection.contains(Object) is O(N) for both LinkedList<T> and ArrayList<T> might mean that neither is an appropriate collection type. Instead, maybe you should represent the collection as a HashMap<T, Integer>, where the Integer represents the number of occurrences of a T in the "collection". That will give you O(1) testing and removal, at the cost of more space overheads and slower (though still O(1)) insertion.
But the thing to stress is that if you are likely to be dealing with really large collections, there should be no such thing as a "default" collection type. You need to think about the collection in the context of the algorithms. (And the flip side is that if the collections are always going to be small, it probably makes little difference which collection type you pick.)

Categories

Resources