Sparse Map with Enum keys - java

I need to create a Map with enum keys where only a small fraction of the enum constants will be actually inserted. What is the best approach?
An EnumMap would be inefficient if the length of its underlying array is equal to the total number of enum constants.

I suggest using an ordinary HashMap.
Computing hashes for enums is both easy and cheap. There should be no significant memory overhead, since you are not duplicating the enum objects, but instead creating multiple references to the same object. For this reason, there should be little difference between storing an integer key and storing a reference to an enum object.

I'd also go with either a HashMap or a TreeMap (depending on whether you need a deterministic Iterator or not). Since your data is sparse, any real or imagined overhead is unlikely to be a significant performance hurdle.

Related

Map.of vs EnumMap for immutable map with enum keys

An EnumMap uses the restriction that all keys of a map will be from the same enum to gain performance benefits:
Enum maps are represented internally as arrays. This representation is extremely compact and efficient.
In this case, the keys and values are stored in separate arrays and the values are ordinal-ordered. Iteration is done with the internal EnumMapIterator class.
An immutable map created by the various Map.of methods use the restriction that the map will not change structurally to gain performance benefits. If the map is not of size 0 or 1, it uses the MapN internal class that also stores its entries in an array. In this case, the value is stored 1 index after its key. Iteration is done with the internal MapNIterator.
For an immutable map of enum keys of size 2 or more, which answers both of the above's requirements, which map performs better? (Criteria could be space efficiency, time efficiency for containsKey, containsValue, get, and iteration efficiency of entrySet, keySet and values.)
which map gives better space efficiency, and time efficiency for its operations and iteration: containsKey, containsValue, get, entrySet, keySet and values?
You're raising 1 + 6 (or 2 * 6, depending on how it's understood) questions, that's a bit too much. If you want a definite answer, you have to concentrate on a single thing and profile it (nobody's gonna do it for you unless you find a very interesting problem).
The space efficiency for an EnumMap simply must be better. There's no need to store the keys as a shared enum array can be used. There's no need for a holes-containing hash lookup array.
There may be exceptions like a small map based on a huge enum.
The most important operation is get. With an EnumMap, it involves no lookup, just a trivial class comparison and an array access. With Map.of(...), there's a loop, which for enums, usually terminates after the first iteration as Enum.hashCode is IMHO stupid, but usually well distributed.
As containsKey is based on the same lookup, it's clear.
I doubt, I've ever used containsValue, but it doesn't do anything smarter than linear search. I'd expect a tiny win for EnumMap because of the holes (needing trivial null test, but causing branch mispredictions).
The remaining three operations are not worth looking up as they return a collection containing no data and simply pointing to the map, i.e., a constant time operation. For example, map.keySet().contains(x) simply delegates to map.containsKey().
The efficiency of the iteration would be a more interesting question, but you didn't ask it.

Using EnumSet or EnumMap on arbitrary keys

We know that EnumSet and EnumMap are faster than HashSet/HashMap due to the power of bit manipulation. But are we actually harnessing the true power of EnumSet/EnumMap when it really matters? If we have a set of millions of record and we want to find out if some object is present in that set or not, can we take advantage of EnumSet's speed?
I checked around but haven't found anything discussing this. Everywhere the usual stuff is found i.e. because EnumSet and EnumMap uses a predefined set of keys lookups on small collections are very fast. I know enums are compile-time constants but can we have best of both worlds - an EnumSet-like data structure without needing enums as keys?
Interesting insight; the short answer is no, but your question is exploring some good data-structure design concepts which I'll try to discuss.
First, lets talk about HashMap (HashSet uses a HashMap internally, so they share most behavior); a hash-based data structure is powerful because it is fast and general. It's fast (i.e. approximately O(1)) because we can find the key we're looking for with a very small number of computations. Roughly, we have an array of lists of keys, convert the key to an integer index into that array, then look through the associated list for the key. As the mapping gets bigger, the backing array is repeatedly resized to hold more lists. Assuming the lists are evenly distributed, this lookup is very fast. And because this works for any generic object (that has a proper .hashcode() and .equals()) it's useful for just about any application.
Enums have several interesting properties, but for the purpose of efficient lookup we only care about two of them - they're generally small, and they have a fixed number of values. Because of this, we can do better than HashMap; specifically, we can map every possible value to a unique integer, meaning we don't need to compute a hash, and we don't need to worry about hashes colliding. So EnumMap simply stores an array of the same size as the enum and looks up directly into it:
// From Java 7's EnumMap
public V get(Object key) {
return (isValidKey(key) ?
unmaskNull(vals[((Enum)key).ordinal()]) : null);
}
Stripping away some of the required Map sanity checks, it's simply:
return vals[key.ordinal()];
Note that this is conceptually no different than a standard HashMap, it's simply avoiding a few computations. EnumSet is a little more clever, using the bits in one or more longs to represent array indices, but functionally it's no different than the EnumMap case - we allocate enough slots to cover all possible enum values and can use their integer .ordinal() rather than compute a hash.
So how much faster than HashMap is EnumMap? It's clearly faster, but in truth it's not that much faster. HashMap is already a very efficient data structure, so any optimization on it will only yield marginally better results. In particular, both HashMap and EnumMap are asymptotically the same speed (O(1)), meaning as they get bigger, they behave equally well. This is the primary reason why there isn't a more general data-structure like EnumMap - because it wouldn't be worth the effort relative to HashMap.
The second reason we don't want a more general "FiniteKeysMap" is it would make our lives as users more complicated, which would be worthwhile if it were a marked speed increase, but since it's not would just be a hassle. We would have to define an interface (and probably also a factory pattern) for any type that could be a key in this map. The interface would need to provide a guarantee that every unique instance returns a unique hashcode in the range [0-n), and also provide a way for the map to get n and potentially all n elements. Those last two operations would be better as static methods, but since we can't define static methods in an interface, they'd have to either be passed in directly to every map we create, or a separate factory object with this information would have to exist and be passed to the map/set at construction. Because enums are part of the language, they get all of those benefits for free, meaning there's no such cost for end-user programmers to need to take advantage of.
Furthermore, it would be very easy to make mistakes with this interface; say you have a type that has exactly 100,000 unique values. Should it implement our interface? It could. But you'd likely actually be shooting yourself in the foot. This would eat up a lot of unnecessary memory, since our FiniteKeysMap would allocate a new 100,000 length array to represent even an empty map. Generally speaking, that sort of wasted space is not worth the marginal improvement such a data structure would provide.
In short, while your idea is possible, it is not practical. HashMap is so efficient that attempting to create a separate data structure for a very limited number of cases would add far more complexity than value.
For the specific case of faster .contains() checks, you might like Bloom Filters. It's a set-like data structure that very efficiently stores very large sets, with the condition that it may sometimes incorrectly say an element is in the set when it is not (but not the other way around - if it says an element isn't in the set, it's definitely not). Guava provides a nice BloomFilter implementation.

Java - Most efficient matching method

Assuming one needs to store a list of items, but it can be stored in any variable type; what would be the most efficient type, if used mostly for matching?
To clarify, a list of items needs to be contained, but the form it's contained in doesn't matter (enum, list, hashmap, Arraylist, etc..)
This list of items would be matched against on a regular basis, but not edited. What would the most efficient storage method be, assuming you only need to write to the list once, but could be matching multiple times per second?
Note: No multi-threading
A HashSet (and HashMap) offers O(1) complexity. Also note that you should create a large enough HashSet with small loadfactor which means that after a hashcode check the elements in the result bucket will also be found very quickly (in a bucket there is a sequential search). Optimally each bucket should contain 1 element at the most.
You can read more about the concept of capacity and load factor in the Javadoc of HashMap.
An even faster solution would be if the number of items is no more than 64 is to create an Enum for them and use EnumSet or EnumMap which stores the elements in a long and uses simple and very fast bit operations to test if an element is in the set or map (a contains operation is just a simple bitmask test).
If you choose to go with the HashSet and not with the Enum approach, know that HashSet uses the hashCode() and equals() methods of the elements. You might consider overriding them to provide a faster implementation knowing the internals of the items you wish to store.
A trivial optimization of overriding the hashCode() can be for example to cache a once computed hash code in the item itself if it doesn't change (and subsequent calls to hashCode() should just return the cached value).
From your description it seems that order doesn't matter. If this is so, use a Set. Java's standard implementation is the HashSet.
Most efficient for repeated lookup would almost certainly be an EnumSet
... Enum sets are represented internally as bit vectors. This representation is extremely compact and efficient. The space and time performance of this class should be good enough to allow its use as a high-quality, typesafe alternative to traditional int-based "bit flags." Even bulk operations (such as containsAll and retainAll) should run very quickly if their argument is also an enum set.
...
Implementation note: All basic operations execute in constant time. They are likely (though not guaranteed) to be much faster than their HashSet counterparts. Even bulk operations execute in constant time if their argument is also an enum set.

Why an EnumSet or an EnumMap is likely to be more performant than their hashed counterparts?

The following is from the Implementation Note section of Java doc of EnumMap :
Implementation note: All basic operations execute in constant time.
They are likely (though not guaranteed) to be faster than their
HashMap counterparts.
I have seen a similar line in the java doc for EnumSet also . I want to know why is it more likely that EnumSets and EnumMaps will be more faster than their hashed counterparts ?
EnumSet is backed by a bit array. Since the number of different items you can put in EnumSet is known in advance, we can simply reserve one bit for each enum value. You can imagine similar optimization for Set<Byte> or Set<Short>, but it's not feasible for Set<Integer> (you'd need 0.5 GiB of memory for 2^32 bits) or in general.
Thus basic operations like exists or add ar constant time (just like HashSet) but they simply need to examine or set one bit. No hashCode() computation. This is why EnumSet is faster. Also more complex operations like union or easily implemented using bit manipulation techniques.
In OpenJDK there are two implementations of EnumSet: RegularEnumSet capable of handling enum with up to 64 values in long and JumboEnumSet for bigger enums (using long[]). But it's just an implementation detail.
EnumMap works on similar principles, but it uses Object[] to store values while key (index) is implicitly inferred from Enum.ordinal().
EnumSet uses array[] inside and a natural order with consequent consequences:
fast: get/set is called by constant time O(1), hashCode() is not used to find a right bucket
memory efficiency: the size of array is predefined by Enum size and is not dynamic
EnumMap uses Enum as a key based on the same principals as EnumSet where hash code collision is not possible

Should use TreeMap or HashMap for wrapping named parameters?

In most cases, there will be only 0-5 parameters in the map. I guess TreeMap might have a smaller footprint, because it's less sparse then HashMap. But I'm not sure.
Or, maybe it's even better to write my own Map in such case?
The main difference is that TreeMap is a SortedMap, and HashMap is not. If you need your map to be sorted, use a TreeMap, if not then use a HashMap. The performance characteristics and memory usage can vary, but if you only have 0-5 entries then there will be no noticeable difference.
I would not recommend you write your own map unless you need functionality which is not available from the standard Maps, which it sounds like you don't.
I guess TreeMap might have a smaller
footprint, because it's less sparse
then HashMap.
That may actually be wrong, because empty HashMap slots are null and thus take up little space, and TreeMap entries have a higher overhead than HashMap entries because of the child pointers and color flag.
In any case, it's only a concern if you have hundreds of thousands of such maps.
I guess you don't need order of entries in Map, so HashMap is OK for you.
And 5 entries are not performance concern.
You need to write Map which has dozen of methods to implemented, I don't think that is what you need.
If your about 5 keys are always the same (or part of a small set of keys), and you are usually querying them by string literals anyway, and you only seldom have to really parse the keys from user input or similar, then you may think about using an enum type as key type of a EnumMap. This should be even more efficient than a HashMap. The difference will only matter if you have many of these maps, though.

Categories

Resources