An EnumMap uses the restriction that all keys of a map will be from the same enum to gain performance benefits:
Enum maps are represented internally as arrays. This representation is extremely compact and efficient.
In this case, the keys and values are stored in separate arrays and the values are ordinal-ordered. Iteration is done with the internal EnumMapIterator class.
An immutable map created by the various Map.of methods use the restriction that the map will not change structurally to gain performance benefits. If the map is not of size 0 or 1, it uses the MapN internal class that also stores its entries in an array. In this case, the value is stored 1 index after its key. Iteration is done with the internal MapNIterator.
For an immutable map of enum keys of size 2 or more, which answers both of the above's requirements, which map performs better? (Criteria could be space efficiency, time efficiency for containsKey, containsValue, get, and iteration efficiency of entrySet, keySet and values.)
which map gives better space efficiency, and time efficiency for its operations and iteration: containsKey, containsValue, get, entrySet, keySet and values?
You're raising 1 + 6 (or 2 * 6, depending on how it's understood) questions, that's a bit too much. If you want a definite answer, you have to concentrate on a single thing and profile it (nobody's gonna do it for you unless you find a very interesting problem).
The space efficiency for an EnumMap simply must be better. There's no need to store the keys as a shared enum array can be used. There's no need for a holes-containing hash lookup array.
There may be exceptions like a small map based on a huge enum.
The most important operation is get. With an EnumMap, it involves no lookup, just a trivial class comparison and an array access. With Map.of(...), there's a loop, which for enums, usually terminates after the first iteration as Enum.hashCode is IMHO stupid, but usually well distributed.
As containsKey is based on the same lookup, it's clear.
I doubt, I've ever used containsValue, but it doesn't do anything smarter than linear search. I'd expect a tiny win for EnumMap because of the holes (needing trivial null test, but causing branch mispredictions).
The remaining three operations are not worth looking up as they return a collection containing no data and simply pointing to the map, i.e., a constant time operation. For example, map.keySet().contains(x) simply delegates to map.containsKey().
The efficiency of the iteration would be a more interesting question, but you didn't ask it.
Related
We know that EnumSet and EnumMap are faster than HashSet/HashMap due to the power of bit manipulation. But are we actually harnessing the true power of EnumSet/EnumMap when it really matters? If we have a set of millions of record and we want to find out if some object is present in that set or not, can we take advantage of EnumSet's speed?
I checked around but haven't found anything discussing this. Everywhere the usual stuff is found i.e. because EnumSet and EnumMap uses a predefined set of keys lookups on small collections are very fast. I know enums are compile-time constants but can we have best of both worlds - an EnumSet-like data structure without needing enums as keys?
Interesting insight; the short answer is no, but your question is exploring some good data-structure design concepts which I'll try to discuss.
First, lets talk about HashMap (HashSet uses a HashMap internally, so they share most behavior); a hash-based data structure is powerful because it is fast and general. It's fast (i.e. approximately O(1)) because we can find the key we're looking for with a very small number of computations. Roughly, we have an array of lists of keys, convert the key to an integer index into that array, then look through the associated list for the key. As the mapping gets bigger, the backing array is repeatedly resized to hold more lists. Assuming the lists are evenly distributed, this lookup is very fast. And because this works for any generic object (that has a proper .hashcode() and .equals()) it's useful for just about any application.
Enums have several interesting properties, but for the purpose of efficient lookup we only care about two of them - they're generally small, and they have a fixed number of values. Because of this, we can do better than HashMap; specifically, we can map every possible value to a unique integer, meaning we don't need to compute a hash, and we don't need to worry about hashes colliding. So EnumMap simply stores an array of the same size as the enum and looks up directly into it:
// From Java 7's EnumMap
public V get(Object key) {
return (isValidKey(key) ?
unmaskNull(vals[((Enum)key).ordinal()]) : null);
}
Stripping away some of the required Map sanity checks, it's simply:
return vals[key.ordinal()];
Note that this is conceptually no different than a standard HashMap, it's simply avoiding a few computations. EnumSet is a little more clever, using the bits in one or more longs to represent array indices, but functionally it's no different than the EnumMap case - we allocate enough slots to cover all possible enum values and can use their integer .ordinal() rather than compute a hash.
So how much faster than HashMap is EnumMap? It's clearly faster, but in truth it's not that much faster. HashMap is already a very efficient data structure, so any optimization on it will only yield marginally better results. In particular, both HashMap and EnumMap are asymptotically the same speed (O(1)), meaning as they get bigger, they behave equally well. This is the primary reason why there isn't a more general data-structure like EnumMap - because it wouldn't be worth the effort relative to HashMap.
The second reason we don't want a more general "FiniteKeysMap" is it would make our lives as users more complicated, which would be worthwhile if it were a marked speed increase, but since it's not would just be a hassle. We would have to define an interface (and probably also a factory pattern) for any type that could be a key in this map. The interface would need to provide a guarantee that every unique instance returns a unique hashcode in the range [0-n), and also provide a way for the map to get n and potentially all n elements. Those last two operations would be better as static methods, but since we can't define static methods in an interface, they'd have to either be passed in directly to every map we create, or a separate factory object with this information would have to exist and be passed to the map/set at construction. Because enums are part of the language, they get all of those benefits for free, meaning there's no such cost for end-user programmers to need to take advantage of.
Furthermore, it would be very easy to make mistakes with this interface; say you have a type that has exactly 100,000 unique values. Should it implement our interface? It could. But you'd likely actually be shooting yourself in the foot. This would eat up a lot of unnecessary memory, since our FiniteKeysMap would allocate a new 100,000 length array to represent even an empty map. Generally speaking, that sort of wasted space is not worth the marginal improvement such a data structure would provide.
In short, while your idea is possible, it is not practical. HashMap is so efficient that attempting to create a separate data structure for a very limited number of cases would add far more complexity than value.
For the specific case of faster .contains() checks, you might like Bloom Filters. It's a set-like data structure that very efficiently stores very large sets, with the condition that it may sometimes incorrectly say an element is in the set when it is not (but not the other way around - if it says an element isn't in the set, it's definitely not). Guava provides a nice BloomFilter implementation.
Assuming one needs to store a list of items, but it can be stored in any variable type; what would be the most efficient type, if used mostly for matching?
To clarify, a list of items needs to be contained, but the form it's contained in doesn't matter (enum, list, hashmap, Arraylist, etc..)
This list of items would be matched against on a regular basis, but not edited. What would the most efficient storage method be, assuming you only need to write to the list once, but could be matching multiple times per second?
Note: No multi-threading
A HashSet (and HashMap) offers O(1) complexity. Also note that you should create a large enough HashSet with small loadfactor which means that after a hashcode check the elements in the result bucket will also be found very quickly (in a bucket there is a sequential search). Optimally each bucket should contain 1 element at the most.
You can read more about the concept of capacity and load factor in the Javadoc of HashMap.
An even faster solution would be if the number of items is no more than 64 is to create an Enum for them and use EnumSet or EnumMap which stores the elements in a long and uses simple and very fast bit operations to test if an element is in the set or map (a contains operation is just a simple bitmask test).
If you choose to go with the HashSet and not with the Enum approach, know that HashSet uses the hashCode() and equals() methods of the elements. You might consider overriding them to provide a faster implementation knowing the internals of the items you wish to store.
A trivial optimization of overriding the hashCode() can be for example to cache a once computed hash code in the item itself if it doesn't change (and subsequent calls to hashCode() should just return the cached value).
From your description it seems that order doesn't matter. If this is so, use a Set. Java's standard implementation is the HashSet.
Most efficient for repeated lookup would almost certainly be an EnumSet
... Enum sets are represented internally as bit vectors. This representation is extremely compact and efficient. The space and time performance of this class should be good enough to allow its use as a high-quality, typesafe alternative to traditional int-based "bit flags." Even bulk operations (such as containsAll and retainAll) should run very quickly if their argument is also an enum set.
...
Implementation note: All basic operations execute in constant time. They are likely (though not guaranteed) to be much faster than their HashSet counterparts. Even bulk operations execute in constant time if their argument is also an enum set.
I need to create a Map with enum keys where only a small fraction of the enum constants will be actually inserted. What is the best approach?
An EnumMap would be inefficient if the length of its underlying array is equal to the total number of enum constants.
I suggest using an ordinary HashMap.
Computing hashes for enums is both easy and cheap. There should be no significant memory overhead, since you are not duplicating the enum objects, but instead creating multiple references to the same object. For this reason, there should be little difference between storing an integer key and storing a reference to an enum object.
I'd also go with either a HashMap or a TreeMap (depending on whether you need a deterministic Iterator or not). Since your data is sparse, any real or imagined overhead is unlikely to be a significant performance hurdle.
In one of my Java 6 projects I have an array of LinkedHashMap instances as input to a method which has to iterate through all keys (i.e. through the union of the key sets of all maps) and work with the associated values. Not all keys exist in all maps and the method should not go through each key more than once or alter the input maps.
My current implementation looks like this:
Set<Object> keyset = new HashSet<Object>();
for (Map<Object, Object> map : input) {
for (Object key : map.keySet()) {
if (keyset.add(key)) {
...
}
}
}
The HashSet instance ensures that no key will be acted upon more than once.
Unfortunately this part of the code is rather critical performance-wise, as it is called very frequently. In fact, according to the profiler over 10% of the CPU time is spent in the HashSet.add() method.
I am trying to optimise this code us much as possible. The use of LinkedHashMap with its more efficient iterators (in comparison to the plain HashMap) was a significant boost, but I was hoping to reduce what is essentially book-keeping time to the minimum.
Putting all the keys in the HashSet before-hand, by using addAll() proved to be less efficient, due to the cost of calling HashSet.contains() afterwards.
At the moment I am looking at whether I can use a bitmap (well, a boolean[] to be exact) to avoid the HashSet completely, but it may not be possible at all, depending on my key range.
Is there a more efficient way to do this? Preferrably something that will not pose restrictions on the keys?
EDIT:
A few clarifications and comments:
I do need all the values from the maps - I cannot drop any of them.
I also need to know which map each value came from. The missing part (...) in my code would be something like this:
for (Map<Object, Object> m : input) {
Object v = m.get(key);
// Do something with v
}
A simple example to get an idea of what I need to do with the maps would be to print all maps in parallel like this:
Key Map0 Map1 Map2
F 1 null 2
B 2 3 null
C null null 5
...
That's not what I am actually doing, but you should get the idea.
The input maps are extremely variable. In fact, each call of this method uses a different set of them. Therefore I would not gain anything by caching the union of their keys.
My keys are all String instances. They are sort-of-interned on the heap using a separate HashMap, since they are pretty repetitive, therefore their hash code is already cached and most hash validations (when the HashMap implementation is checking whether two keys are actually equal, after their hash codes match) boil down to an identity comparison (==). The profiler confirms that only 0.5% of the CPU time is spent on String.equals() and String.hashCode().
EDIT 2:
Based on the suggestions in the answers, I made a few tests, profiling and benchmarking along the way. I ended up with roughly a 7% increase in performance. What I did:
I set the initial capacity of the HashSet to double the collective size of all input maps. This gained me something in the region of 1-2%, by eliminating most (all?) resize() calls in the HashSet.
I used Map.entrySet() for the map I am currently iterating. I had originally avoided this approach due to the additional code and the fear that the extra checks and Map.Entry getter method calls would outweigh any advantages. It turned out that the overall code was slightly faster.
I am sure that some people will start screaming at me, but here it is: Raw types. More specifically I used the raw form of HashSet in the code above. Since I was already using Object as its content type, I do not lose any type safety. The cost of that useless checkcast operation when calling HashSet.add() was apparently important enough to produce a 4% increase in performance when removed. Why the JVM insists on checking casts to Object is beyond me...
Can't provide a replacement for your approach but a few suggestions to (slightly) optimize the existing code.
Consider initializing the hash set with a capacity (the sum of the sizes of all maps). This avoids/reduces resizing of the set during an add operation
Consider not using the keySet() as it will always create a new set in the background. Use the entrySet(), that should be much faster
Have a look at the implementations of equals() and hashCode() - if they are "expensive", then you have a negative impact on the add method.
How you avoid using a HashSet depends on what you are doing.
I would only calculate the union once each time the input is changed. This should be relatively rare conmpared with the number of lookups.
// on an update.
Map<Key, Value> union = new LinkedHashMap<Key, Value>();
for (Map<Key, Value> map : input)
union.putAll(map);
// on a lookup.
Value value = union.get(key);
// process each key once
for(Entry<Key, Value> entry: union) {
// do something.
}
Option A is to use the .values() method and iterate through it. But I suppose you already had thought of it.
If the code is called so often, then it might be worth creating additional structures (depending of how often the data is changed). Create a new HashMap; every key in any of your hashmaps is a key in this one and the list keeps the HashMaps where that key appears.
This will help if the data is somewhat static (related to the frequency of queries), so the overload from managing the structure is relatively small, and if the key space is not very dense (keys do not repeat themselves a lot in different HashMaps), as it will save a lot of unneeded contains().
Of course, if you are mixing data structures it is better if you encapsulate all in your own data structure.
You could take a look at Guava's Sets.union() http://guava-libraries.googlecode.com/svn/tags/release04/javadoc/com/google/common/collect/Sets.html#union(java.util.Set,%20java.util.Set)
In most cases, there will be only 0-5 parameters in the map. I guess TreeMap might have a smaller footprint, because it's less sparse then HashMap. But I'm not sure.
Or, maybe it's even better to write my own Map in such case?
The main difference is that TreeMap is a SortedMap, and HashMap is not. If you need your map to be sorted, use a TreeMap, if not then use a HashMap. The performance characteristics and memory usage can vary, but if you only have 0-5 entries then there will be no noticeable difference.
I would not recommend you write your own map unless you need functionality which is not available from the standard Maps, which it sounds like you don't.
I guess TreeMap might have a smaller
footprint, because it's less sparse
then HashMap.
That may actually be wrong, because empty HashMap slots are null and thus take up little space, and TreeMap entries have a higher overhead than HashMap entries because of the child pointers and color flag.
In any case, it's only a concern if you have hundreds of thousands of such maps.
I guess you don't need order of entries in Map, so HashMap is OK for you.
And 5 entries are not performance concern.
You need to write Map which has dozen of methods to implemented, I don't think that is what you need.
If your about 5 keys are always the same (or part of a small set of keys), and you are usually querying them by string literals anyway, and you only seldom have to really parse the keys from user input or similar, then you may think about using an enum type as key type of a EnumMap. This should be even more efficient than a HashMap. The difference will only matter if you have many of these maps, though.