We use a HashMap<Integer, SomeType>() with more than a million entries. I consider that large.
But integers are their own hash code. Couldn't we save memory with a, say, IntegerHashMap<Integer, SomeType>() that uses a special Map.Entry, using int directly instead of a pointer to an Integer object? In our case, that would save 1000000x the memory required for an Integer object.
Any faults in my line of thought? Too special to be of general interest? (at least, there is an EnumHashMap)
add1. The first generic parameter of IntegerHashMap is used to make it closely similar to the other Map implementations. It could be dropped, of course.
add2. The same should be possible with other maps and collections. For example ToIntegerHashMap<KeyType, Integer>, IntegerHashSet<Integer>, etc.
What you're looking for is a "Primitive collections" library. They are usually much better with memory usage and performance. One of the oldest/popular libraries was called "Trove". However, it is a bit outdated now. The main active libraries in use now are:
Goldman Sach Collections
Fast Util
Koloboke
See Benchmarks Here
Some words of caution:
"integers are their own hash code" I'd be very careful with this statement. Depending on the integers you have, the distribution of keys may be anything from optimal to terrible. Ideally, I'd design the map so that you can pass in a custom IntFunction as hashing strategy. You can still default this to (i) -> i if you want, but you probably want to introduce a modulo factor, or your internal array will be enormous. You may even want to use an IntBinaryOperator, where one param is the int and the other is the number of buckets.
I would drop the first generic param. You probably don't want to implement Map<Integer, SomeType>, because then you will have to box / unbox in all your methods, and you will lose all your optimizations (except space). Trying to make a primitive collection compatible with an object collection will make the whole exercise pointless.
Related
So I am working on some data-structure, whose operations tend to generate lots of different instances of a certain type. The datasets can be large enough to potentially yield millions of such objects. It is crucial that I "memoize" them, since there are recurring patterns in these calculations.
Typically the way memoization is done is that you simply have a set (say, a HashMap with its keys as its values) of all instances ever created. Any time an operation would return a result, instead the set is searched for an existing, identical object. If the object is found, it is returned instead, and the result we looked up with instantly becomes garbage.
Now, this is straight-forward enough. However, HashMap is not a perfect fit for this use case. There are two points I will address here:
Lack of support for custom equality and hashing function
Lack of support for "equivalent key" lookup.
The first has already been discussed (question 5453226), though not sufficiently for my purposes; solutions in the form of "just wrap your key types with another object" are a non-starter. If the key is a relatively small object (e.g. an array or a string of small size) the overhead cost of these wrappers can be nearly 2X.
To illustrate the second point, let's say the data type I'd like to memoize is a (typically small) int[]. Suppose I have made an operation and it has yielded a result, but it is not exactly in an int[], rather, it consists of a int[] x and separately a length int length. What I would like to do now is to look up the set for an array that equals (as a sequence of ints) to the data in x[0..length-1]. In principle, there should be no problem to accomplish this, provided that the user supplies equality and hashing predicates that match the ones used for the key type itself. In principle, the "compatible key" (x and length here) doesn't even have to be materialized as an object.
The lack of "compatible key" lookups may cause to program to create temporary key objects that are likely to be thrown away afterwards.
Problem 1 prevents me from using something like int[] as a key type in a Map, since it doesn't have the hashing and equality functions that I want. Further, I actually want to use shallow / reference-only comparisons and hashing on these objects once I'm past the memoization step, since then x is an identical object to y iff x == y. Here we have the same key type, but different desired equality/hashing behavior.
In C++, the functionality in (1) is offered out of the box by most hash-table based containers. (2) is given by some container types, for example the associative container templates in the Boost library.
A third function I'd be interested in (less important) is the insert_check/insert_commit idea, where we first check if a matching key exist, and we also get some implementation-defined marker back (e.g. bucket index). Then if we do create a new key and want to insert it, we give back the marker and it's inserted to the right place in the data structure. There is no reason why the hash of the same key should be computed twice.
If the compiler (or JIT) is clever enough to avoid double lookup in this scenario, that is a preferable solution - it's easier not to have to worry about it. I just never actually tested if it is.
Are there any libraries known to offer any of these capabilities in Java? Do I have to roll my own? Am I missing something?
We know that EnumSet and EnumMap are faster than HashSet/HashMap due to the power of bit manipulation. But are we actually harnessing the true power of EnumSet/EnumMap when it really matters? If we have a set of millions of record and we want to find out if some object is present in that set or not, can we take advantage of EnumSet's speed?
I checked around but haven't found anything discussing this. Everywhere the usual stuff is found i.e. because EnumSet and EnumMap uses a predefined set of keys lookups on small collections are very fast. I know enums are compile-time constants but can we have best of both worlds - an EnumSet-like data structure without needing enums as keys?
Interesting insight; the short answer is no, but your question is exploring some good data-structure design concepts which I'll try to discuss.
First, lets talk about HashMap (HashSet uses a HashMap internally, so they share most behavior); a hash-based data structure is powerful because it is fast and general. It's fast (i.e. approximately O(1)) because we can find the key we're looking for with a very small number of computations. Roughly, we have an array of lists of keys, convert the key to an integer index into that array, then look through the associated list for the key. As the mapping gets bigger, the backing array is repeatedly resized to hold more lists. Assuming the lists are evenly distributed, this lookup is very fast. And because this works for any generic object (that has a proper .hashcode() and .equals()) it's useful for just about any application.
Enums have several interesting properties, but for the purpose of efficient lookup we only care about two of them - they're generally small, and they have a fixed number of values. Because of this, we can do better than HashMap; specifically, we can map every possible value to a unique integer, meaning we don't need to compute a hash, and we don't need to worry about hashes colliding. So EnumMap simply stores an array of the same size as the enum and looks up directly into it:
// From Java 7's EnumMap
public V get(Object key) {
return (isValidKey(key) ?
unmaskNull(vals[((Enum)key).ordinal()]) : null);
}
Stripping away some of the required Map sanity checks, it's simply:
return vals[key.ordinal()];
Note that this is conceptually no different than a standard HashMap, it's simply avoiding a few computations. EnumSet is a little more clever, using the bits in one or more longs to represent array indices, but functionally it's no different than the EnumMap case - we allocate enough slots to cover all possible enum values and can use their integer .ordinal() rather than compute a hash.
So how much faster than HashMap is EnumMap? It's clearly faster, but in truth it's not that much faster. HashMap is already a very efficient data structure, so any optimization on it will only yield marginally better results. In particular, both HashMap and EnumMap are asymptotically the same speed (O(1)), meaning as they get bigger, they behave equally well. This is the primary reason why there isn't a more general data-structure like EnumMap - because it wouldn't be worth the effort relative to HashMap.
The second reason we don't want a more general "FiniteKeysMap" is it would make our lives as users more complicated, which would be worthwhile if it were a marked speed increase, but since it's not would just be a hassle. We would have to define an interface (and probably also a factory pattern) for any type that could be a key in this map. The interface would need to provide a guarantee that every unique instance returns a unique hashcode in the range [0-n), and also provide a way for the map to get n and potentially all n elements. Those last two operations would be better as static methods, but since we can't define static methods in an interface, they'd have to either be passed in directly to every map we create, or a separate factory object with this information would have to exist and be passed to the map/set at construction. Because enums are part of the language, they get all of those benefits for free, meaning there's no such cost for end-user programmers to need to take advantage of.
Furthermore, it would be very easy to make mistakes with this interface; say you have a type that has exactly 100,000 unique values. Should it implement our interface? It could. But you'd likely actually be shooting yourself in the foot. This would eat up a lot of unnecessary memory, since our FiniteKeysMap would allocate a new 100,000 length array to represent even an empty map. Generally speaking, that sort of wasted space is not worth the marginal improvement such a data structure would provide.
In short, while your idea is possible, it is not practical. HashMap is so efficient that attempting to create a separate data structure for a very limited number of cases would add far more complexity than value.
For the specific case of faster .contains() checks, you might like Bloom Filters. It's a set-like data structure that very efficiently stores very large sets, with the condition that it may sometimes incorrectly say an element is in the set when it is not (but not the other way around - if it says an element isn't in the set, it's definitely not). Guava provides a nice BloomFilter implementation.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
When to use HashMap over LinkedList or ArrayList and vice-versa
Since coming across Maps in Java i have been using them extensively. In particular HashMap is an excellent option for many scenarios. It seems that it trumps an ArrayList in every category - some say that iteration isn't predictable but for that we have the LinkedHashMap.
So, my question is: why not use a HashMap all the time provided we have a solid immutable key?
Additionally, is it appropriate to use a something like a HashMap for a very small (<10) number of items or is there some additional overhead I am not considering?
Use an ArrayList when your keys are sequential integers. (If they aren't based at 0, just use an offset.) It's much more efficient for access (particularly random access) and update. Otherwise, sure, HashMap (or, as you say, LinkedHashMap) are extremely useful data structures.
I believe that the default initial size for a HashMap is 16 buckets, so there's a bit of overhead for very small lists. But unless you're creating a lot of maps, it shouldn't be a factor in your coding.
HashMap has a significant amount of overhead compared to an array (or ArrayList):
You need to hash the key to get an index into the backing array, then store the value and the key. This is slower and uses more memory than an array. This is more important when your key is something large or complicated, since it will take longer to hash or take more space.
You also need to hash the key every time you look up a value.
When you resize an ArrayList, you just create a new array and copy everything. When you resize a HashMap, you create a new array, then calculate the hashes all over again (so they'll be spread out through the new array).
HashMap's perform badly when they're full, so they generally leave something like 25% of their space empty.
These are all pretty minor, so you could get away with just using HashMaps all the time (in fact, this is what PHP seems to do), but it's wasteful to use a HashMap when you don't really need it.
A comparison might be helpful: Anything you can do with an integer, you can also do with a string, so why do we have integers? Because they're smaller and faster to work with (and provide some nice guarantees, like that they always contain a number).
I use Maps all the time - it is one of the most powerful and versatile data structures. I mostly use LinkedHashMap but, when working with Strings as key I use TreeMap because of the additional benefit of having the keys sorted.
However:
if your key is an int and you plan to use all the keys 0..n,use
an array (remember - int is more efficient than Integer). But a map is better if you have "sparse values"
if you need a list of unindexed items, use a linkedlist
if you need to store unique elements, use a set (why waste space to keep the values if you just need the keys)!
Remember - Java gives you very powerful collections (Set, Map,List) and, for each one, multiple implementations with different features - they are there for a reason.
Every data structure has its use, even if many can be implemented using a map as a backend, the most appropriate data structure is... more appropriate (and, usually, more efficient, with less overhead and providing more functionalities)
The size does not matter - 5 or 500 elements, if it looks like a map, use a map (there maybe few exceptions and corner cases where you need maximum efficiency and hard coded values are better). But if it looks like a set - use a set!
I have a class representing a set of values that will be used as a key in maps.
This class is immutable and I want to make it a singleton for every distinct set of values, using the static factory pattern.
The goal is to prevent identical objects from being created many (100+) times and to optimize the equals method.
I am looking for the best way to cache and reuse previous instances of this class.
The first thing that pops to mind is a simple hashmap, but are there alternatives?
There are two situations:
If the number of distinct objects is small and fixed, you should use an enum
They're not instantiatiable beyond the declared constants, and EnumMap is optimized for it
Otherwise, you can cache immutable instances as you planned:
If the values are indexable by numbers in a contiguous range, then an array can be used
This is how e.g. Integer cache instances in a given range for valueOf
Otherwise you can use some sort of Map
Depending on the usage pattern, you may choose to only cache, say, the last N instances, instead of all instances created so far. This is the approach used in e.g. re.compile in Python's regular expression module. If N is small enough (e.g. 5), then a simple array with a linear search may also work just fine.
For Map based solution, perhaps a useful implementation is java.util.LinkedHashMap, which allows you to enforce LRU-like policies if you #Override the removeEldestEntry.
There is also LRUMap from Apache Commons Collections that implement this policy more directly.
See also
Java Tutorials/enums
Effective Java 2nd Edition, Item 1: Consider static factory methods instead of constructors
Wikipedia/Flyweight pattern
Related questions
Easy, simple to use LRU cache in java
LRU LinkedHashMap that limits size based on available memory
Guava MapMaker, soft keys and values, etc
How would you implement an LRU cache in Java 6?
What you're trying to make sounds like an example of the Flyweight Pattern, so looking for references to that might help clarify your thinking.
Storing them in a map of some sort is indeed a common implementation.
What do your objects look like? If your objects are fairly simple I think you should consider not caching them - object creation is usually quite fast. I think you should evaluate whether the possibly small performance boost is worth the added complexity and effort of a cache.
Assume you need to store/retrieve items in a Collection, don't care about ordering, and duplicates are allowed, what type of Collection do you use?
By default, I've always used ArrayList, but I remember reading/hearing somewhere that a Queue implementation may be a better choice. A List allows items to be added/retrieved/removed at arbitrary positions, which incurs a performance penalty. As a Queue does not provide this facility it should in theory be faster when this facility is not required.
I realise that all discussions about performance are somewhat meaningless, the only thing that really matters is measurement. Nevertheless, I'm interested to know what others use for a Collection, when they don't care about ordering, and duplicates are allowed, and why?
"It depends". The question you really need to answer first is "What do I want to use the collection for?"
If you often insert / remove items on one of the ends (beginning, end) a Queue will be better than a ArrayList. However in many cases you create a Collection in order to just read from it. In this case a ArrayList is far more efficient: As it is implemented as an array, you can iterate over it quite efficient (same applies for a LinkedList). However a LinkedList uses references to link single items together. So if you do not need random removals of items (in the middle), a ArrayList is better: An ArrayList will use less memory as the items don't need the storage for the reference to the next/prev item.
To sum it up:
ArrayList = good if you insert once and read often (random access or sequential)
LinkedList = good if you insert/remove often at random positions and read only sequential
ArrayDeque (java6 only) = good if you insert/remove at start/end and read random or sequential
As a default, I tend to prefer LinkedList to ArrayList. Obviously, I use them not through the List interface, but rather through the Collection interface.
Over the time, I've indeed found out that when I need a generic collection, it's more or less to put some things in, then iterate over it. If I need more evolved behaviour (say random access, sorting or unicity checks), I will then maybe change the used implementation, but before that I will change the used interface to the most appropriated. This way, I can ensure feature is provided before to concentrate on optimization and implementation.
ArrayList basicly contains an array inside (that's why it is called ArrayList). And operations like addd/remove at arbitrary positions are done in a straightforward way, so if you don't use them - there is no harm to performance.
If ordering and duplicates is not a problem and case is only for storing,
I use ArrayList, As it implements all the list operations. Never felt any performance issues with these operations (Never impacted my projects either). Actually using these operations have simple usage & I don't need to care how its managed internally.
Only if multiple threads will be accessing this list I use Vector because its methods are synchronized.
Also ArrayList and Vector are collections which you learn first :).
It depends on what you know about it.
If I have no clue, I tend to go for a linked list, since the penalty for adding/removing at the end is constant. If I have a rough idea of the maximum size of it, I go for an arraylist with the capacity specified, because it is faster if the estimation is good. If I really know the exact size I tend to go for a normal array; although that isn't really a collection type.
I realise that all discussions about performance are somewhat meaningless, the only thing that really matters is measurement.
That's not necessarily true.
If your knowledge of how the application is going to work tells you that certain collections are going to be very large, then it is a good idea to pick the right collection type. But the right collection type depends crucially on how the collections are going to be used; i.e. on the algorithms.
For example, if your application is likely to be dominated by testing if a collection holds a given object, the fact that Collection.contains(Object) is O(N) for both LinkedList<T> and ArrayList<T> might mean that neither is an appropriate collection type. Instead, maybe you should represent the collection as a HashMap<T, Integer>, where the Integer represents the number of occurrences of a T in the "collection". That will give you O(1) testing and removal, at the cost of more space overheads and slower (though still O(1)) insertion.
But the thing to stress is that if you are likely to be dealing with really large collections, there should be no such thing as a "default" collection type. You need to think about the collection in the context of the algorithms. (And the flip side is that if the collections are always going to be small, it probably makes little difference which collection type you pick.)