Efficient de-deduplication (memoization) of objects in Java, shortcomings of HashSet

Efficient de-deduplication (memoization) of objects in Java, shortcomings of HashSet - java

So I am working on some data-structure, whose operations tend to generate lots of different instances of a certain type. The datasets can be large enough to potentially yield millions of such objects. It is crucial that I "memoize" them, since there are recurring patterns in these calculations.
Typically the way memoization is done is that you simply have a set (say, a HashMap with its keys as its values) of all instances ever created. Any time an operation would return a result, instead the set is searched for an existing, identical object. If the object is found, it is returned instead, and the result we looked up with instantly becomes garbage.
Now, this is straight-forward enough. However, HashMap is not a perfect fit for this use case. There are two points I will address here:
Lack of support for custom equality and hashing function
Lack of support for "equivalent key" lookup.
The first has already been discussed (question 5453226), though not sufficiently for my purposes; solutions in the form of "just wrap your key types with another object" are a non-starter. If the key is a relatively small object (e.g. an array or a string of small size) the overhead cost of these wrappers can be nearly 2X.
To illustrate the second point, let's say the data type I'd like to memoize is a (typically small) int[]. Suppose I have made an operation and it has yielded a result, but it is not exactly in an int[], rather, it consists of a int[] x and separately a length int length. What I would like to do now is to look up the set for an array that equals (as a sequence of ints) to the data in x[0..length-1]. In principle, there should be no problem to accomplish this, provided that the user supplies equality and hashing predicates that match the ones used for the key type itself. In principle, the "compatible key" (x and length here) doesn't even have to be materialized as an object.
The lack of "compatible key" lookups may cause to program to create temporary key objects that are likely to be thrown away afterwards.
Problem 1 prevents me from using something like int[] as a key type in a Map, since it doesn't have the hashing and equality functions that I want. Further, I actually want to use shallow / reference-only comparisons and hashing on these objects once I'm past the memoization step, since then x is an identical object to y iff x == y. Here we have the same key type, but different desired equality/hashing behavior.
In C++, the functionality in (1) is offered out of the box by most hash-table based containers. (2) is given by some container types, for example the associative container templates in the Boost library.
A third function I'd be interested in (less important) is the insert_check/insert_commit idea, where we first check if a matching key exist, and we also get some implementation-defined marker back (e.g. bucket index). Then if we do create a new key and want to insert it, we give back the marker and it's inserted to the right place in the data structure. There is no reason why the hash of the same key should be computed twice.
If the compiler (or JIT) is clever enough to avoid double lookup in this scenario, that is a preferable solution - it's easier not to have to worry about it. I just never actually tested if it is.
Are there any libraries known to offer any of these capabilities in Java? Do I have to roll my own? Am I missing something?

Related

Are the hashcodes returned by the System.identityHashCode method uniquely assigned to each object?

Are the hashcodes returned by the System.identityHashCode method uniquely assigned to each object?
Since hashcode is an int and therefore the possible values 4,294,967,295, does the jvm guarantee at least one unique hashcode for every object within such a quantity of objects?

The pigeonhole principle applies.
If you have 4 pigeon holes, and 5 pigeons which must all find a pigeon hole to roost in, then at least one pigeon hole is going to have more than one pigeon in it.
Obvious, right?
Same applies here. There are only 2^32 different pigeon holes hash codes (because the value is an int, int in java is defined as a 32-bit number, thus, only 2^32 different possible values exit). That is a big, big number. about 4 billion.
However, there is nothing in the java spec that decrees that no more than 4 billion pigeons objects can ever exist. If ever more than 4 billion objects exist, no algorithm one could possibly design could ever promise uniqueness, because of this principle. QED.
NB: You can also use the pigeonhole principle to prove that a universal compressor (a tool that can compress anything, guaranteeing that the compressed result is always smaller or equal) cannot exist, as long as it actually compresses anything, then there must as a consequence be some stream of bits for which the compressor actually produces a larger file. You can use it to prove that (int) (Math.random() * 10) is not quite uniform random, and why you should therefore use random.nextInt(10) instead (which is). It's a surprisingly useful principle to prove things in computer science!
Now, one could imagine an int based coding system which promises unique codes until you hit 4 billion unique objects, but making such a promise is incredibly complicated and itself a memory hog, if it would have to work for any and all objects.
Java makes no such promises, and System.iHC is as a consequence not guaranteed to have either unique numbers (completely impossible to make that promise) nor that System.iHC has 'perfect' distribution (hashcodes as distributed as they could be, i.e. no reuse until 4 billion objects exist simultaneously). Note that 'exists' is already complicated: When does an object truly 'disappear'?
In practice, iHC is based on memory positions; it's distribution is very very good. But there is a difference between it is highly unlikely any 2 objects would ever have the same identity hashcode and we guarantee no two objects share an identity hash code.

No, this is not unique for all objects. It all depend on your JVM implementation. You can only be sure only that for same object this method returns same value.

From what I've read, rather than reserve space for object's identity hash code, at least some version of the JVM reserve a couple bits in the object's header that indicate which of three conditions applied:
The identity hash code has not yet been examined.
The identity hash code has been examined, and the object has not been moved since then.
The identity hash code was examined, and then the object was moved.
Getting the identity hashCode when an object is in state #1 will change it to state #2 and return a value derived from the object's address. Getting the identity hashCode when an object is in state #2 will simply perform that same computation on the address.
If an object in state #1 needs to be relocated, it will simply stay in state #1. If an object is relocated when it's in state #2, the system will make a copy which has an extra four bytes reserved for the hash code, compute a hash code based on the old address, and store it in those reserved bytes, and mark the object as being in state #3. After that, any attempt to read the hash code will report the saved value.
If an object is created and then either relocated or destroyed, the space it previously occupied will likely be used at some time in the future to hold a new object. Such an object may very well have the same address as the old one; if it does, the identity hash code will likely match.
Note that the hashCode function isn't expected to be good enough to avoid all collisions, but merely good enough to turn a large number of comparisons into a small number. If a hash code can reduce the number of comparisons from 10,000 to 10, that's a much bigger win than reducing the number of comparisons from 10 to 1 would be.

Improve memory usage: IntegerHashMap

We use a HashMap<Integer, SomeType>() with more than a million entries. I consider that large.
But integers are their own hash code. Couldn't we save memory with a, say, IntegerHashMap<Integer, SomeType>() that uses a special Map.Entry, using int directly instead of a pointer to an Integer object? In our case, that would save 1000000x the memory required for an Integer object.
Any faults in my line of thought? Too special to be of general interest? (at least, there is an EnumHashMap)
add1. The first generic parameter of IntegerHashMap is used to make it closely similar to the other Map implementations. It could be dropped, of course.
add2. The same should be possible with other maps and collections. For example ToIntegerHashMap<KeyType, Integer>, IntegerHashSet<Integer>, etc.

What you're looking for is a "Primitive collections" library. They are usually much better with memory usage and performance. One of the oldest/popular libraries was called "Trove". However, it is a bit outdated now. The main active libraries in use now are:
Goldman Sach Collections
Fast Util
Koloboke
See Benchmarks Here

Some words of caution:
"integers are their own hash code" I'd be very careful with this statement. Depending on the integers you have, the distribution of keys may be anything from optimal to terrible. Ideally, I'd design the map so that you can pass in a custom IntFunction as hashing strategy. You can still default this to (i) -> i if you want, but you probably want to introduce a modulo factor, or your internal array will be enormous. You may even want to use an IntBinaryOperator, where one param is the int and the other is the number of buckets.
I would drop the first generic param. You probably don't want to implement Map<Integer, SomeType>, because then you will have to box / unbox in all your methods, and you will lose all your optimizations (except space). Trying to make a primitive collection compatible with an object collection will make the whole exercise pointless.

Using EnumSet or EnumMap on arbitrary keys

We know that EnumSet and EnumMap are faster than HashSet/HashMap due to the power of bit manipulation. But are we actually harnessing the true power of EnumSet/EnumMap when it really matters? If we have a set of millions of record and we want to find out if some object is present in that set or not, can we take advantage of EnumSet's speed?
I checked around but haven't found anything discussing this. Everywhere the usual stuff is found i.e. because EnumSet and EnumMap uses a predefined set of keys lookups on small collections are very fast. I know enums are compile-time constants but can we have best of both worlds - an EnumSet-like data structure without needing enums as keys?

Interesting insight; the short answer is no, but your question is exploring some good data-structure design concepts which I'll try to discuss.
First, lets talk about HashMap (HashSet uses a HashMap internally, so they share most behavior); a hash-based data structure is powerful because it is fast and general. It's fast (i.e. approximately O(1)) because we can find the key we're looking for with a very small number of computations. Roughly, we have an array of lists of keys, convert the key to an integer index into that array, then look through the associated list for the key. As the mapping gets bigger, the backing array is repeatedly resized to hold more lists. Assuming the lists are evenly distributed, this lookup is very fast. And because this works for any generic object (that has a proper .hashcode() and .equals()) it's useful for just about any application.
Enums have several interesting properties, but for the purpose of efficient lookup we only care about two of them - they're generally small, and they have a fixed number of values. Because of this, we can do better than HashMap; specifically, we can map every possible value to a unique integer, meaning we don't need to compute a hash, and we don't need to worry about hashes colliding. So EnumMap simply stores an array of the same size as the enum and looks up directly into it:
// From Java 7's EnumMap
public V get(Object key) {
return (isValidKey(key) ?
unmaskNull(vals[((Enum)key).ordinal()]) : null);
}
Stripping away some of the required Map sanity checks, it's simply:
return vals[key.ordinal()];
Note that this is conceptually no different than a standard HashMap, it's simply avoiding a few computations. EnumSet is a little more clever, using the bits in one or more longs to represent array indices, but functionally it's no different than the EnumMap case - we allocate enough slots to cover all possible enum values and can use their integer .ordinal() rather than compute a hash.
So how much faster than HashMap is EnumMap? It's clearly faster, but in truth it's not that much faster. HashMap is already a very efficient data structure, so any optimization on it will only yield marginally better results. In particular, both HashMap and EnumMap are asymptotically the same speed (O(1)), meaning as they get bigger, they behave equally well. This is the primary reason why there isn't a more general data-structure like EnumMap - because it wouldn't be worth the effort relative to HashMap.
The second reason we don't want a more general "FiniteKeysMap" is it would make our lives as users more complicated, which would be worthwhile if it were a marked speed increase, but since it's not would just be a hassle. We would have to define an interface (and probably also a factory pattern) for any type that could be a key in this map. The interface would need to provide a guarantee that every unique instance returns a unique hashcode in the range [0-n), and also provide a way for the map to get n and potentially all n elements. Those last two operations would be better as static methods, but since we can't define static methods in an interface, they'd have to either be passed in directly to every map we create, or a separate factory object with this information would have to exist and be passed to the map/set at construction. Because enums are part of the language, they get all of those benefits for free, meaning there's no such cost for end-user programmers to need to take advantage of.
Furthermore, it would be very easy to make mistakes with this interface; say you have a type that has exactly 100,000 unique values. Should it implement our interface? It could. But you'd likely actually be shooting yourself in the foot. This would eat up a lot of unnecessary memory, since our FiniteKeysMap would allocate a new 100,000 length array to represent even an empty map. Generally speaking, that sort of wasted space is not worth the marginal improvement such a data structure would provide.
In short, while your idea is possible, it is not practical. HashMap is so efficient that attempting to create a separate data structure for a very limited number of cases would add far more complexity than value.
For the specific case of faster .contains() checks, you might like Bloom Filters. It's a set-like data structure that very efficiently stores very large sets, with the condition that it may sometimes incorrectly say an element is in the set when it is not (but not the other way around - if it says an element isn't in the set, it's definitely not). Guava provides a nice BloomFilter implementation.

Assuming I have a good key when is NOT appropriate to use a map in java [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
When to use HashMap over LinkedList or ArrayList and vice-versa
Since coming across Maps in Java i have been using them extensively. In particular HashMap is an excellent option for many scenarios. It seems that it trumps an ArrayList in every category - some say that iteration isn't predictable but for that we have the LinkedHashMap.
So, my question is: why not use a HashMap all the time provided we have a solid immutable key?
Additionally, is it appropriate to use a something like a HashMap for a very small (<10) number of items or is there some additional overhead I am not considering?

Use an ArrayList when your keys are sequential integers. (If they aren't based at 0, just use an offset.) It's much more efficient for access (particularly random access) and update. Otherwise, sure, HashMap (or, as you say, LinkedHashMap) are extremely useful data structures.
I believe that the default initial size for a HashMap is 16 buckets, so there's a bit of overhead for very small lists. But unless you're creating a lot of maps, it shouldn't be a factor in your coding.

HashMap has a significant amount of overhead compared to an array (or ArrayList):
You need to hash the key to get an index into the backing array, then store the value and the key. This is slower and uses more memory than an array. This is more important when your key is something large or complicated, since it will take longer to hash or take more space.
You also need to hash the key every time you look up a value.
When you resize an ArrayList, you just create a new array and copy everything. When you resize a HashMap, you create a new array, then calculate the hashes all over again (so they'll be spread out through the new array).
HashMap's perform badly when they're full, so they generally leave something like 25% of their space empty.
These are all pretty minor, so you could get away with just using HashMaps all the time (in fact, this is what PHP seems to do), but it's wasteful to use a HashMap when you don't really need it.
A comparison might be helpful: Anything you can do with an integer, you can also do with a string, so why do we have integers? Because they're smaller and faster to work with (and provide some nice guarantees, like that they always contain a number).

I use Maps all the time - it is one of the most powerful and versatile data structures. I mostly use LinkedHashMap but, when working with Strings as key I use TreeMap because of the additional benefit of having the keys sorted.
However:
if your key is an int and you plan to use all the keys 0..n,use
an array (remember - int is more efficient than Integer). But a map is better if you have "sparse values"
if you need a list of unindexed items, use a linkedlist
if you need to store unique elements, use a set (why waste space to keep the values if you just need the keys)!
Remember - Java gives you very powerful collections (Set, Map,List) and, for each one, multiple implementations with different features - they are there for a reason.
Every data structure has its use, even if many can be implemented using a map as a backend, the most appropriate data structure is... more appropriate (and, usually, more efficient, with less overhead and providing more functionalities)
The size does not matter - 5 or 500 elements, if it looks like a map, use a map (there maybe few exceptions and corner cases where you need maximum efficiency and hard coded values are better). But if it looks like a set - use a set!

Writing hashCode methods for heterogeneous keys

I have a Java HashMap whose keys are instances of java.lang.Object, that is: the keys are of different types. The hashCode values of two key objects of different types are likely to be the same when they contain identical variable values.
In order to improve the performance of the get method for my HashMap, I'm inclined to mix the name of the Java type into the hashCode methods of my key objects. I have not seen examples of this elsewhere, and so my this-might-be-wacky alarm went off. Do you think mixing the type into hashCode is a good idea? Should I mix in the class name, or the hashCode of the relevant Class object?

I wouldn't mix the type name in - but if you're controlling the hashCode algorithm already, why not just change it so that they won't clash? For example, if you're using the common "add and multiply" approach, you could start off with different base cases or use different multipliers.
Before you worry about this too much though, have you actually measured how often you're really getting collisions with real data? Is this definitely a problem, or are you just concerned that it might be a problem?

I think your this-might-be-wacky alarm should have gone off when you decided to have keys of different types. But let's assume this is a case where Object is really the way to go.
You should try it without mixing in the type name and stress test the performance if you find that this particular lookup is determined to be a hotspot in the system. Chances are the performance doesn't matter that much.
Like Jon implied, the performance of the hash map is improved by reducing collisions. Mixing in the type name is just as likely to increase collisions as it is to reduce them. To keep your hashmap in peak condition, you want the likelihood of any particular hashcode to be about that same as any other over the domain of valid key values. So the probability of a hashcode of 10 should be about the same as the probability of 100 or any other number. That way the hash table buckets fill evenly (in all likelihood). so whether you have an object of type A or type B should not matter. just the probability distribution of the hashcodes of all occurring key values.

Years later...
Apart from it being a premature optimization, it's not a bad idea and the overhead is tiny. Choy's recommendation to profile first is surely good in general, but sometimes a simple optimization takes much less time than the profiling. This seems to be such a case.
I'd use a different multiplier as already suggested and mix in getClass().getHashCode().
Or maybe getClass().getName().getHashCode() as it stays consistent across JVM invocations, which might be helpful if you want a reproducible HashMap iteration order for easier debugging. Note that you should never rely on such a reproducibility and that there are quite many things destroying it.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.