Here's the sample code from Item 9:
public final class PhoneNumber {
private final short areaCode;
private final short prefix;
private final short lineNumber;
#Override
public int hashCode() {
int result = 17;
result = 31 * result + areaCode;
result = 31 * result + prefix;
result = 31 * result + lineNumber;
return result;
}
}
Pg 48 states: "the value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as muiltiplication by 2 is equivalent to shifting."
I understand the concept of multiplication by 2 being equivalent to bit shifting. I also know that we'll still get an overflow (hence information loss) when we multiply a large number by a large odd prime number. What I don't get is why information loss arising from multiplication by large odd primes is preferable to information loss arising from multiplication by large even numbers.
With an even multiplier the least significant bit, after multiplication, is always zero. With an odd multiplier the least significant bit is either one or zero depending on what the previous value of result was. Hence the even multiplier is losing uncertainty about the low bit, while the odd multiplier is preserving it.
There is no such thing as a large even prime - the only even prime is 2.
That aside - the general point of using a medium-sized prime # rather than a small one like 3 or 5 is to minimize the chance that two objects will end up with the same hash value, overflow or not.
The risk of overflow is not the issue per se; the real issue is how distributed the hashcode values will be for the set of objects being hashed. Because hashcodes are used in data structures like HashSet, HashMap etc., you want to minimize the # of objects that could potentially share the same hash code to optimize lookup times on those collections.
Related
I'm noodling through an anagram hash function, already solved several different ways, but I'm looking for extreme performance as an exercise. I already submitted a solution that passed all the given tests (beating out 100% of all competitors by at least 1ms), but I believe that although it "won", it has a weakness that just wasn't triggered. It is subject to integer overflow in a way that could affect the results.
The gist of the solution was to combine multiple commutative operations, each taking some number of bits, and concatenate them into one long variable. I chose xor, sum, and product. The xor operation cleanly fits within a fixed number of bits. The sum operation might overflow, but because of the way overflow is addressed, it would still arrive at the same result if letters and their corresponding values are rearranged. I wouldn't worry, for example, about whether this function would overflow.
private short sumHash(String s) {
short hash=0;
for (char c:s.toCharArray()) {
hash+=c;
}
return hash;
}
Where I run into trouble is in the inclusion of products. If I make a function that returns the product of a list of values (such as character values in a String), then, at the very least, the result could be rendered inaccurate if the product overflowed to exactly zero.
private short productHash(String s) {
short hash=1;
for (char c:s.toCharArray()) {
hash*=c;
}
return hash;
}
Is there any safe and performant way to avoid this weakness so that the function gains the benefit of the commutative property of multiplication to produce the same value for anagrams, but can't ever encounter a product that overflows to zero?
Sure, if you're willing to go to some lengths to do it. The simplest solution that occurs to me is to write
hash *= primes[c];
where primes is an array that maps each possible character to a distinct odd prime. Overflowing to zero can only happen if the "true" product in infinite-precision arithmetic is a multiple of 2^32, and if you're multiplying by odd primes, that's impossible.
(You do run into the problem that the hash itself will always be odd, but you could shift it right one bit to obtain a more fully mixed hash.)
You will only hit zero if
a * b = 0 mod 2^64
which is equivalent to there being an integer k such that
a * b = k * 2^64
That is, we get in trouble if factors divide 2^64, i.e. if factors are even. Therefore, the easiest solution is ensuring that all factors are odd, for instance like this:
for (char ch : chars) {
hash *= (ch << 1) | 1;
}
This allows you to keep 63 bits of information.
Note however that this technique will only avoid collisions caused by overflow, but not collisions caused by multipliers that share a common factor. If you wish to avoid that, too, you'll need coprime multipliers, which is easiest achieved if they are prime.
The naive way to avoid overflow, is to use a larger type such as int or long. However, for your purposes, modulo arithmetic might make more sense. You can do (a * b) % p for a prime p to maintain commutativity. (There is some deep mathematics here called Group Theory, if you are interested in learning more.) You will need to limit p to be small enough that each a * b does not overflow. The easiest way to do this is to pick a p so that (p - 1)^2 can still be represented in a short or whatever data type you are using.
I have the following piece of code:
long[] blocks = new long[(someClass.getMemberArray().length - 1) / 64 + 1];
Basically the someClass.getMemberArray() can return an array that could be much larger than 64 and the code tries to determine how many blocks of len 64 are needed for subsequent processing.
I am confused about the logic and how does this work. It seems to me that just doing:
long[] blocks = new long[(int) Math.ceil(someClass.getMemberArray().length / 64.0)];
should work too any looks simpler.
Can someone help me understanding the -1 and +1 reasoning in the original snippet, how it works and if the ceil would fail in some cases?
as you correctly commented, the -1/+1 is required to get the correct number of blocks, including only partially filled ones. It effectively rounds up.
(But it has something that could be considered a bug: if the array has length 0, which would required 0 blocks, it returns 1. This is because integer division usually truncates on most systems, i.e. rounds UP for negative numbers, so (0 - 1)/64 yields 0. However, this may be a feature if zero blocks for some reasons are not allowed. It definitively requires a comment though.)
The reasoning for the first, original line is that it only uses integer arithmetics, which should translate on just a few basic and fast machine instructions on mostcomputers.
The second solution involved casting floating-point arithmetic and casting. Traditionally, floating-point arithmetic was MUCH slower on most processors, which probably was the reasoning for the first solution. However, on modern CPUs with integrated floating-point support, the performance depends more on other things like cache lines and pipelining.
Personally, I don't really like both solutions, as it's not really obvious what they do. So I would suggest the following solution:
int arrayLength = someClass.getMemberArray().length;
int blockCount = ceilDiv(arrayLength, 64);
long[] blocks = new long[blockCount];
//...
/**
* Integer division, rounding up.
* #return the quotient a/b, rounded up.
*/
static int ceilDiv(int a, int b) {
assert b >= 0 : b; // Doesn't work for negative divisor.
// Divide.
int quotient = a / b;
// If a is not a multiple of b, round up.
if (a % b != 0) {
quotient++;
}
return quotient;
}
It's wordy, but at least it's clear what should happen, and it provides a general solution which works for all integers (except negative divisors). Unfortunately, most languages don't provide an elegant "round up integer division" solution.
I don't see why the -1 would be necessary, but the +1 is likely there to rectify the case where the result of the division gets rounded down to the nearest non-decimal value (which should be, well, every case except those where the result is without decimals)
I'm playing with hash tables and using a corpus of ~350,000 English words which I'd like to try to evenly distribute. Thus, I try to fit them into an array of length 810,049 (the closest prime larger than two times the input size) and I was baffled to see that a straightforward FNV1 implementation like this:
public int getHash(String s, int mod) {
final BigInteger MOD = new BigInteger(Integer.toString(mod));
final BigInteger FNV_offset_basis = new BigInteger("14695981039346656037");
final BigInteger FNV_prime = new BigInteger("1099511628211");
BigInteger hash = new BigInteger(FNV_offset_basis.toString());
for (int i = 0; i < s.length(); i++) {
int charValue = s.charAt(i);
hash = hash.multiply(FNV_prime).mod(MOD);
hash = hash.xor(BigInteger.valueOf((int) charValue & 0xffff)).mod(MOD);
}
return hash.mod(MOD).intValue();
}
results in 64,000 collisions which is a lot, 20% of the input basically. What's wrong with my implementation? Is the approach somehow flawed?
EDIT: to add to that, I've also tried and implemented other hashing algorithms like sdbm and djb2 and they all perform just the same, equally poorly. All have these ~65k collisions on this corpus. When I changed the corpus to just 350,000 integers represented as strings, a bit of variance starts to occur (like one algorithms has 20,000 collisions and the other has 40,000) but still the number of collision is astoundingly high. Why?
EDIT2: I've just tested it and the Java's built-in .hashCode() results in equally as many collisions and even if you do something ridiculously naive, like a hash being a product of multiplicating charcodes of all the characters modulo 810,049, it performs only half worse than all those notorious algorithms (60k collisions vs. 90k with the naive approach).
Since mod is a parameter to your hash function I presume it is the range into which you want the hash normalized, i.e. for your specific use case you are expecting it to be 810,049. I assume this because:
The algorithm calls for the calculations to be done modulo 2n where n is the number of bits in the desired hash.
Given that the offset basis and FNV Prime are constants within the module, and are equal to the parameters for a 64-bit hash, the value of mod should also be fixed at 264.
Since it is not, I assume it is the desired final output range.
In other words, given a fixed offset basis and FNV Prime, there is no reason to pass in the mod parameter -- it is dictated by the other two FNV parameters.
If all the above is correct then the implementation is wrong. You should be doing the calculations mod 264 and applying a final remainder operation with 810,049.
Also (but this may not be important), the algorithm calls for xoring the lower 8 bits with an ASCII character, whereas you are xoring with 16 bits. I am not sure this will make a difference since for ASCII the high-order byte will be zero anyway and it will behave exactly as if you were xoring only 8 bits.
I review the source code of Arrays.hashCode(char[] c)
I am not very confirm that the algorithm it applies well work well in all cases.
public static int hashCode(int a[]) {
if (a == null)
return 0;
int result = 1;
for (int element : a)
result = 31 * result + element;
return result;
}
Does the hash function implement here really uniformly distributes the all the input arrays.And Why we use prime 31 here .
Why use the prime number 31?
This can be split in two parts?
Why a prime number?
Here we need to understand that our goal is to get a unique HashCode for an object which will help us to find that object in O(1) time.
The key word here, is unique.
Primes
Primes are unique numbers. They are unique in that, the product of a
prime with any other number has the best chance of being unique (not
as unique as the prime itself of-course) due to the fact that a prime
is used to compose it. This property is used in hashing functions.
.
Why number 31?
From Effective Java
Because it's an odd prime, and it's "traditional" to use primes.
It's also one less than a power of two, which permits for bitwise
optimization
Here's the full quote,
from Item 9: Always override
hashCode when you override equals:
The value 31 was chosen because it's an odd prime. If it were even and
multiplication overflowed, information would be lost, as
multiplication by 2 is equivalent to shifting. The advantage of using
a prime is less clear, but it is traditional.
A nice property of 31 is that the multiplication can be replaced by a
shift (ยง15.19) and subtraction for better performance:
31 * i == (i << 5) - i Modern VMs do this sort of optimization
automatically.
While the recipe in this item yields reasonably good hash functions,
it does not yield state-of-the-art hash functions, nor do Java
platform libraries provide such hash functions as of release 1.6.
Writing such hash functions is a research topic, best left to
mathematicians and theoretical computer scientists.
Perhaps a later release of the platform will provide state-of-the-art
hash functions for its classes and utility methods to allow average
programmers to construct such hash functions. In the meantime, the
techniques described in this item should be adequate for most
applications.
This is a very Good source.
The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.
See this post: Why does Java's hashCode() in String use 31 as a multiplier?
That's where TheEwook's answer is from.
Generally, you use primes because they don't have any factors and will distribute better modulo N where N is the size of the range you are binning into. 31 is a small, odd prime so it works well. However, as the various sources you will find on the Internets will indicate, a small prime like 31 may lead to more collisions than a larger prime (especially if the values being hashed are not well distributed to begin with), so you could pick a larger prime if you found the performance to not be as good as you'd like.
Why were 181783497276652981 and 8682522807148012 chosen in Random.java?
Here's the relevant source code from Java SE JDK 1.7:
/**
* Creates a new random number generator. This constructor sets
* the seed of the random number generator to a value very likely
* to be distinct from any other invocation of this constructor.
*/
public Random() {
this(seedUniquifier() ^ System.nanoTime());
}
private static long seedUniquifier() {
// L'Ecuyer, "Tables of Linear Congruential Generators of
// Different Sizes and Good Lattice Structure", 1999
for (;;) {
long current = seedUniquifier.get();
long next = current * 181783497276652981L;
if (seedUniquifier.compareAndSet(current, next))
return next;
}
}
private static final AtomicLong seedUniquifier
= new AtomicLong(8682522807148012L);
So, invoking new Random() without any seed parameter takes the current "seed uniquifier" and XORs it with System.nanoTime(). Then it uses 181783497276652981 to create another seed uniquifier to be stored for the next time new Random() is called.
The literals 181783497276652981L and 8682522807148012L are not placed in constants, but they don't appear anywhere else.
At first the comment gives me an easy lead. Searching online for that article yields the actual article. 8682522807148012 doesn't appear in the paper, but 181783497276652981 does appear -- as a substring of another number, 1181783497276652981, which is 181783497276652981 with a 1 prepended.
The paper claims that 1181783497276652981 is a number that yields good "merit" for a linear congruential generator. Was this number simply mis-copied into Java? Does 181783497276652981 have an acceptable merit?
And why was 8682522807148012 chosen?
Searching online for either number yields no explanation, only this page that also notices the dropped 1 in front of 181783497276652981.
Could other numbers have been chosen that would have worked as well as these two numbers? Why or why not?
Was this number simply mis-copied into Java?
Yes, seems to be a typo.
Does 181783497276652981 have an acceptable merit?
This could be determined using the evaluation algorithm presented in the paper. But the merit of the "original" number is probably higher.
And why was 8682522807148012 chosen?
Seems to be random. It could be the result of System.nanoTime() when the code was written.
Could other numbers have been chosen that would have worked as well as these two numbers?
Not every number would be equally "good". So, no.
Seeding Strategies
There are differences in the default-seeding schema between different versions and implementation of the JRE.
public Random() { this(System.currentTimeMillis()); }
public Random() { this(++seedUniquifier + System.nanoTime()); }
public Random() { this(seedUniquifier() ^ System.nanoTime()); }
The first one is not acceptable if you create multiple RNGs in a row. If their creation times fall in the same millisecond range, they will give completely identical sequences. (same seed => same sequence)
The second one is not thread safe. Multiple threads can get identical RNGs when initializing at the same time. Additionally, seeds of subsequent initializations tend to be correlated. Depending on the actual timer resolution of the system, the seed sequence could be linearly increasing (n, n+1, n+2, ...). As stated in How different do random seeds need to be? and the referenced paper Common defects in initialization of pseudorandom number generators, correlated seeds can generate correlation among the actual sequences of multiple RNGs.
The third approach creates randomly distributed and thus uncorrelated seeds, even across threads and subsequent initializations.
So the current java docs:
This constructor sets the seed of the random number generator to a
value very likely to be distinct from any other invocation of this
constructor.
could be extended by "across threads" and "uncorrelated"
Seed Sequence Quality
But the randomness of the seeding sequence is only as good as the underlying RNG.
The RNG used for the seed sequence in this java implementation uses a multiplicative linear congruential generator (MLCG) with c=0 and m=2^64. (The modulus 2^64 is implicitly given by the overflow of 64bit long integers)
Because of the zero c and the power-of-2-modulus, the "quality" (cycle length, bit-correlation, ...) is limited. As the paper says, besides the overall cycle length, every single bit has an own cycle length, which decreases exponentially for less significant bits. Thus, lower bits have a smaller repetition pattern. (The result of seedUniquifier() should be bit-reversed, before it is truncated to 48-bits in the actual RNG)
But it is fast! And to avoid unnecessary compare-and-set-loops, the loop body should be fast. This probably explains the usage of this specific MLCG, without addition, without xoring, just one multiplication.
And the mentioned paper presents a list of good "multipliers" for c=0 and m=2^64, as 1181783497276652981.
All in all: A for effort # JRE-developers ;) But there is a typo.
(But who knows, unless someone evaluates it, there is the possibility that the missing leading 1 actually improves the seeding RNG.)
But some multipliers are definitely worse:
"1" leads to a constant sequence.
"2" leads to a single-bit-moving sequence (somehow correlated)
...
The inter-sequence-correlation for RNGs is actually relevant for (Monte Carlo) Simulations, where multiple random sequences are instantiated and even parallelized. Thus a good seeding strategy is necessary to get "independent" simulation runs. Therefore the C++11 standard introduces the concept of a Seed Sequence for generating uncorrelated seeds.
If you consider that the equation used for the random number generator is:
Where X(n+1) is the next number, a is the multipler, X(n) is the current number, c is the increment and m is the modulus.
If you look further into Random, a, c and m are defined in the header of the class
private static final long multiplier = 0x5DEECE66DL; //= 25214903917 -- 'a'
private static final long addend = 0xBL; //= 11 -- 'c'
private static final long mask = (1L << 48) - 1; //= 2 ^ 48 - 1 -- 'm'
and looking at the method protected int next(int bits) this is were the equation is implemented
nextseed = (oldseed * multiplier + addend) & mask;
//X(n+1) = (X(n) * a + c ) mod m
This implies that the method seedUniquifier() is actually getting X(n) or in the first case at initialisation X(0) which is actually 8682522807148012 * 181783497276652981, this value is then modified further by the value of System.nanoTime(). This algorithm is consistent with the equation above but with the following X(0) = 8682522807148012, a = 181783497276652981, m = 2 ^ 64 and c = 0. But as the mod m of is preformed by the long overflow the above equation just becomes
Looking at the paper, the value of a = 1181783497276652981 is for m = 2 ^ 64, c = 0. So it appears to just be a typo and the value 8682522807148012 for X(0) which appears to be a seeming randomly chosen number from legacy code for Random. As seen here. But the merit of these chosen numbers could still be valid but as mentioned by Thomas B. probably not as "good" as the one in the paper.
EDIT - Below original thoughts have since been clarified so can be disregarded but leaving it for reference
This leads me the conclusions:
The reference to the paper is not for the value itself but for the methods used to obtain the values due to the different values of a, c and m
It is mere coincidence that the value is otherwise the same other than the leading 1 and the comment is misplaced (still struggling to believe this though)
OR
There has been a serious misunderstanding of the tables in the paper and the developers have just chosen a value at random as by the time it is multiplied out what was the point in using the table value in the first place especially as you can just provide your own seed value any way in which case these values are not even taken into account
So to answer your question
Could other numbers have been chosen that would have worked as well as these two numbers? Why or why not?
Yes, any number could have been used, in fact if you specify a seed value when you Instantiate Random you are using any other value. This value does not have any effect on the performance of the generator, this is determined by the values of a,c and m which are hard coded within the class.
As per the link you provided, they have chosen (after adding the missing 1 :) ) the best yield from 2^64 because long can't have have a number from 2^128