I'm reading through Chapter 3 of Joshua Bloch's Effective Java. In Item 8: Always override hashCode when you override equals, the author uses the following combining step in his hashing function:
result = 37 * result + c;
He then explains why 37 was chosen (emphasis added):
The multiplier 37 was chosen because it is an odd prime. If it was even and
the multiplication overflowed, information would be lost because multiplication
by two is equivalent to shifting. The advantages of using a prime number are less
clear, but it is traditional to use primes for this purpose.
My question is why does it matter that the combining factor (37) is odd? Wouldn't multiplication overflow result in a loss of information regardless of whether the factor was odd or even?
Consider what happens when a positive value is repeatedly multiplied by two in a base-2 representation -- all the set bits eventually march off the end, leaving you with zero.
An even multiplier would result in hash codes with less diversity.
Odd numbers, on the other hand, may result in overflow, but without loss of diversity.
The purpose of a hashCode is to have random bits based on the input (especially the lower bits as these are often used more)
When you multiple by 2 the lowest bit can only be 0, which lacks randomness. If you multiple by an odd number the lowest bit can be odd or even.
A similar question is what do you get here
public static void main(String... args) {
System.out.println(factorial(66));
}
public static long factorial(int n) {
long product = 1;
for (; n > 1; n--)
product *= n;
return product;
}
prints
0
Every second number is an even and every forth a multiple of 4 etc.
The solution lies in Number Theory and the Lowest common denominator of your multiplier and your modulo number.
An example may help. Lets say instead of 32bit you only got 2 bit to represent a number. So you got 4 numbers(classes). 0, 1, 2 and 3
An overflow in the CPU is the same as a modulo operation
Class - x2 - mod 4 - x2 - mod 4
0 0 0 0 0
1 2 2 4 0
2 4 0 0 0
3 6 2 4 0
After 2 operations you only got 1 possible number(class) left. So you have 'lost' information.
Class - x3 - mod 4 - x3 - mod 4 ...
0 0 0 0 0
1 3 3 9 1
2 6 2 6 2
3 9 1 3 3
This can go on forever and you still have all 4 classes. So you dont lose information.
The key is, that the LCD of your muliplier and your modulo class is 1. That holds true for all odd numbers because your modulo number is currently always a power of 2. They dont have to be primes and they dont have to be 37 specifically. But information loss is just one criteria why 37 is picked other criterias are distribution of values etc.
Non-math simple version of why...
Prime numbers are used for hashing to keep diversity.
Perhaps diversity is more important because of Set and Map implementations. These implementations use last bits of object hash numbers to index internal arrays of entries.
For example, in a HashMap with internal table (array) for entries with size 8 it will use last 3 bits of hash numbers to adress table entry.
static int indexFor(int h, int length) {
return h & (length-1);
}
In fact it's not but if Integer object would have
hash = 4 * number;
most of table elements will be empty but some will contain too many entries. This would lead to extra iterations and comparison operations while searching for particular entry.
I guess the main concern of Joshua Bloch was to distribute hash integers as even as possible to optimize performance of collections by distributing objects evenly in Maps and Sets. Prime numbers intuitively are seems to be a good factor of distribution.
Prime numbers aren't strictly necessary to ensure diversity; what's necessary is that the factor be relatively prime to the modulus.
Since the modulus for binary arithmetic is always a power of two, any odd number is relatively prime, and would suffice. If you were to take a modulus other than by overflow, though, a prime number would continue to ensure diversity (assuming you didn't choose the same prime...).
Related
I was just wondering why is that primes are used in a class's hashCode() method? For example, when using Eclipse to generate my hashCode() method there is always the prime number 31 used:
public int hashCode() {
final int prime = 31;
//...
}
References:
Here is a good primer on Hashcode and article on how hashing works that I found (C# but the concepts are transferrable):
Eric Lippert's Guidelines and rules for GetHashCode()
Prime numbers are chosen to best distribute data among hash buckets. If the distribution of inputs is random and evenly spread, then the choice of the hash code/modulus does not matter. It only has an impact when there is a certain pattern to the inputs.
This is often the case when dealing with memory locations. For example, all 32-bit integers are aligned to addresses divisible by 4. Check out the table below to visualize the effects of using a prime vs. non-prime modulus:
Input Modulo 8 Modulo 7
0 0 0
4 4 4
8 0 1
12 4 5
16 0 2
20 4 6
24 0 3
28 4 0
Notice the almost-perfect distribution when using a prime modulus vs. a non-prime modulus.
However, although the above example is largely contrived, the general principle is that when dealing with a pattern of inputs, using a prime number modulus will yield the best distribution.
Because you want the number you are multiplying by and the number of buckets you are inserting into to have orthogonal prime factorizations.
Suppose there are 8 buckets to insert into. If the number you are using to multiply by is some multiple of 8, then the bucket inserted into will only be determined by the least significant entry (the one not multiplied at all). Similar entries will collide. Not good for a hash function.
31 is a large enough prime that the number of buckets is unlikely to be divisible by it (and in fact, modern java HashMap implementations keep the number of buckets to a power of 2).
For what it's worth, Effective Java 2nd Edition hand-waives around the mathematics issue and just say that the reason to choose 31 is:
Because it's an odd prime, and it's "traditional" to use primes
It's also one less than a power of two, which permits for bitwise optimization
Here's the full quote, from Item 9: Always override hashCode when you override equals:
The value 31 was chosen because it's an odd prime. If it were even and multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional.
A nice property of 31 is that the multiplication can be replaced by a shift (§15.19) and subtraction for better performance:
31 * i == (i << 5) - i
Modern VMs do this sort of optimization automatically.
While the recipe in this item yields reasonably good hash functions, it does not yield state-of-the-art hash functions, nor do Java platform libraries provide such hash functions as of release 1.6. Writing such hash functions is a research topic, best left to mathematicians and theoretical computer scientists.
Perhaps a later release of the platform will provide state-of-the-art hash functions for its classes and utility methods to allow average programmers to construct such hash functions. In the meantime, the techniques described in this item should be adequate for most applications.
Rather simplistically, it can be said that using a multiplier with numerous divisors will result in more hash collisions. Since for effective hashing we want to minimize the number of collisions, we try to use a multiplier that has fewer divisors. A prime number by definition has exactly two distinct, positive divisors.
Related questions
Java hashCode from one field - the recipe, plus example of using Apache Commons Lang's builders
is it incorrect to define an hashcode of an object as the sum, multiplication, whatever, of all class variables hashcodes?
Absolute Beginner's Guide to Bit Shifting?
I heard that 31 was chosen so that the compiler can optimize the multiplication to left-shift 5 bits then subtract the value.
Here's a citation a little closer to the source.
It boils down to:
31 is prime, which reduces collisions
31 produces a good distribution, with
a reasonable tradeoff in speed
First you compute the hash value modulo 2^32 (the size of an int), so you want something relatively prime to 2^32 (relatively prime means that there are no common divisors). Any odd number would do for that.
Then for a given hash table the index is usually computed from the hash value modulo the size of the hash table, so you want something that is relatively prime to the size of the hash table. Often the sizes of hash tables are chosen as prime numbers for that reason. In the case of Java the Sun implementation makes sure that the size is always a power of two, so an odd number would suffice here, too. There is also some additional massaging of the hash keys to limit collisions further.
The bad effect if the hash table and the multiplier had a common factor n could be that in certain circumstances only 1/n entries in the hash table would be used.
The reason why prime numbers are used is to minimize collisions when the data exhibits some particular patterns.
First things first: If the data is random then there’s no need for a prime number, you can do a mod operation against any number and you will have the same number of collisions for each possible value of the modulus.
But when data is not random then strange things happen. For example consider numeric data that is always a multiple of 10.
If we use mod 4 we find:
10 mod 4 = 2
20 mod 4 = 0
30 mod 4 = 2
40 mod 4 = 0
50 mod 4 = 2
So from the 3 possible values of the modulus (0,1,2,3) only 0 and 2 will have collisions, that is bad.
If we use a prime number like 7:
10 mod 7 = 3
20 mod 7 = 6
30 mod 7 = 2
40 mod 7 = 4
50 mod 7 = 1
etc
We also note that 5 is not a good choice but 5 is prime the reason is that all our keys are a multiple of 5. This means we have to choose a prime number that doesn’t divide our keys, choosing a large prime number is usually enough.
So erring on the side of being repetitive the reason prime numbers are used is to neutralize the effect of patterns in the keys in the distribution of collisions of a hash function.
31 is also specific to Java HashMap which uses a int as hash data type. Thus the max capacity of 2^32. There is no point in using larger Fermat or Mersenne primes.
It generally helps achieve a more even spread of your data among the hash buckets, especially for low-entropy keys.
I am reading the implementation details of Java 8 HashMap, can anyone let me know why Java HashMap initial array size is 16 specifically? What is so special about 16? And why is it the power of two always? Thanks
The reason why powers of 2 appear everywhere is because when expressing numbers in binary (as they are in circuits), certain math operations on powers of 2 are simpler and faster to perform (just think about how easy math with powers of 10 are with the decimal system we use). For example, multication is not a very efficient process in computers - circuits use a method similar to the one you use when multiplying two numbers each with multiple digits. Multiplying or dividing by a power of 2 requires the computer to just move bits to the left for multiplying or the right for dividing.
And as for why 16 for HashMap? 10 is a commonly used default for dynamically growing structures (arbitrarily chosen), and 16 is not far off - but is a power of 2.
You can do modulus very efficiently for a power of 2. n % d = n & (d-1) when d is a power of 2, and modulus is used to determine which index an item maps to in the internal array - which means it occurs very often in a Java HashMap. Modulus requires division, which is also much less efficient than using the bitwise and operator. You can convince yourself of this by reading a book on Digital Logic.
The reason why bitwise and works this way for powers of two is because every power of 2 is expressed as a single bit set to 1. Let's say that bit is t. When you subtract 1 from a power of 2, you set every bit below t to 1, and every bit above t (as well as t) to 0. Bitwise and therefore saves the values of all bits below position t from the number n (as expressed above), and sets the rest to 0.
But how does that help us? Remember that when dividing by a power of 10, you can count the number of zeroes following the 1, and take that number of digits starting from the least significant of the dividend in order to find the remainder. Example: 637989 % 1000 = 989. A similar property applies to binary numbers with only one bit set to 1, and the rest set to 0. Example: 100101 % 001000 = 000101
There's one more thing about choosing the hash & (n - 1) versus modulo and that is negative hashes. hashcode is of type int, which of course can be negative. modulo on a negative number (in Java) is negative also, while & is not.
Another reason is that you want all of the slots in the array to be equally likely to be used. Since hash() is evenly distributed over 32 bits, if the array size didn't divide into the hash space, then there would be a remainder causing lower indexes to have a slightly higher chance of being used. Ideally, not just the hash, but (hash() % array_size) is random and evenly distributed.
But this only really matters for data with a small hash range (like a byte or character).
The hashCode() method of class Boolean is implemented like this:
public int hashCode() {
return value ? 1231 : 1237;
}
Why does it use 1231 and 1237? Why not something else?
1231 and 1237 are just two (sufficiently large) arbitrary prime numbers. Any other two large prime numbers would do fine.
Why primes?
Suppose for a second that we picked composite numbers (non-primes), say 1000 and 2000. When inserting booleans into a hash table, true and false would go into bucket 1000 % N resp 2000 % N (where N is the number of buckets).
Now notice that
1000 % 8 same bucket as 2000 % 8
1000 % 10 same bucket as 2000 % 10
1000 % 20 same bucket as 2000 % 20
....
in other words, it would lead to many collisions.
This is because the factorization of 1000 (23, 53) and the factorization of 2000 (24, 53) have so many common factors. Thus prime numbers are chosen, since they are unlikely to have any common factors with the bucket size.
Why large primes. Wouldn't 2 and 3 do?
When computing hash codes for composite objects it's common to add the hash codes for the components. If too small values are used in a hash set with a large number of buckets there's a risk of ending up with an uneven distribution of objects.
Do collisions matter? Booleans just have two different values anyway?
Maps can contain booleans together with other objects. Also, as pointed out by Drunix, a common way to create hash functions of composite objects is to reuse the subcomponents hash code implementations in which case it's good to return large primes.
Related questions:
Why use a prime number in hashCode?
What is a sensible prime for hashcode calculation?
Why does Java's hashCode() in String use 31 as a multiplier?
In addition to all that's said above it can also be a small easter egg from developers:
true: 1231 => 1 + 2 + 3 + 1 = 7
7 - is a lucky number in European traditions;
false: 1237 => 1 + 2 + 3 + 7 = 13
13 (aka Devil's dozen) - unlucky number.
In the example Josh gives of the flawed random method that generates a positive random number with a given upper bound n, I don't understand the two of the flaws he states.
The method from the book is:
private static final Random rnd = new Random();
//Common but deeply flawed
static int random(int n) {
return Math.abs(rnd.nextInt()) % n;
}
He says that if n is a small power of 2, the sequence of random numbers that are generated will repeat itself after a short period of time. Why is this the case? The documentation for Random.nextInt() says Returns the next pseudorandom, uniformly distributed int value from this random number generator's sequence. So shouldn't it be that if n is a small integer then the sequence will repeat itself, why does this only apply to powers of 2?
Next he says that if n is not a power of 2, some numbers will be returned on average more frequently than others. Why does this occur, if Random.nextInt() generates random integers that are uniformly distributed? (He provides a code snippet which clearly demonstrates this but I don't understand why this is the case, and how this is related to n being a power of 2).
Question 1: if n is a small power of 2, the sequence of random numbers that are generated will repeat itself after a short period of time.
This is not a corollary of anything Josh is saying; rather, it is simply a known property of linear congruential generators. Wikipedia has the following to say:
A further problem of LCGs is that the lower-order bits of the generated sequence have a far shorter period than the sequence as a whole if m is set to a power of 2. In general, the n-th least significant digit in the base b representation of the output sequence, where bk = m for some integer k, repeats with at most period bn.
This is also noted in the Javadoc:
Linear congruential pseudo-random number generators such as the one implemented by this class are known to have short periods in the sequence of values of their low-order bits.
The other version of the function, Random.nextInt(int), works around this by using different bits in this case (emphasis mine):
The algorithm treats the case where n is a power of two specially: it returns the correct number of high-order bits from the underlying pseudo-random number generator.
This is a good reason to prefer Random.nextInt(int) over using Random.nextInt() and doing your own range transformation.
Question 2: Next he says that if n is not a power of 2, some numbers will be returned on average more frequently than others.
There are 232 distinct numbers that can be returned by nextInt(). If you try to put them into n buckets by using % n, and n isn't a power of 2, some buckets will have more numbers than others. This means that some outcomes will occur more frequently than others even though the original distribution was uniform.
Let's look at this using small numbers. Let's say nextInt() returned four equiprobable outcomes, 0, 1, 2 and 3. Let's see what happens if we applied % 3 to them:
0 maps to 0
1 maps to 1
2 maps to 2
3 maps to 0
As you can see, the algorithm would return 0 twice as frequently as it would return each of 1 and 2.
This does not happen when n is a power of two, since one power of two is divisible by the other. Consider n=2:
0 maps to 0
1 maps to 1
2 maps to 0
3 maps to 1
Here, 0 and 1 occur with the same frequency.
Additional resources
Here are some additional -- if only tangentially relevant -- resources related to LCGs:
Spectral tests are statistical tests used to assess the quality of LCGs. Read more here and here.
A collection of classical pseudorandom number generators with linear structures has some pretty scatterplots (the generator used in Java is called DRAND48).
There is an interesting discussion on crypto.SE about predicting values from Java's generator.
1) When n is a power of 2, rnd % n is equivalent to selecting a few lower bits of the original. Lower bits of numbers generated by the type of generators used by java are known to be "less random" than the higher bits. It's just the property of the formula used for generating the numbers.
2) Imagine, that the largest possible value, returned by random() is 10, and n = 7. Now doing n % 7 maps numbers 7, 8, 9 and 10 into 0, 1, 2, 3 respectively. Therefore, if the original number is uniformly distributed, the result will be heavily biased towards the lower numbers, because they will appear twice as often as 4, 5 and 6. In this case, this does happen regardless of whether n is a power of two or not, but, if instead of 10 we chose, say, 15 (which is 2^4-1), then any n, that is a power of two would result in a uniform distribution, because there would be no "excess" numbers left at the end of the range to cause bias, because the total number of possible values would be exactly divisible by the number of possible remainders.
I was just wondering why is that primes are used in a class's hashCode() method? For example, when using Eclipse to generate my hashCode() method there is always the prime number 31 used:
public int hashCode() {
final int prime = 31;
//...
}
References:
Here is a good primer on Hashcode and article on how hashing works that I found (C# but the concepts are transferrable):
Eric Lippert's Guidelines and rules for GetHashCode()
Prime numbers are chosen to best distribute data among hash buckets. If the distribution of inputs is random and evenly spread, then the choice of the hash code/modulus does not matter. It only has an impact when there is a certain pattern to the inputs.
This is often the case when dealing with memory locations. For example, all 32-bit integers are aligned to addresses divisible by 4. Check out the table below to visualize the effects of using a prime vs. non-prime modulus:
Input Modulo 8 Modulo 7
0 0 0
4 4 4
8 0 1
12 4 5
16 0 2
20 4 6
24 0 3
28 4 0
Notice the almost-perfect distribution when using a prime modulus vs. a non-prime modulus.
However, although the above example is largely contrived, the general principle is that when dealing with a pattern of inputs, using a prime number modulus will yield the best distribution.
Because you want the number you are multiplying by and the number of buckets you are inserting into to have orthogonal prime factorizations.
Suppose there are 8 buckets to insert into. If the number you are using to multiply by is some multiple of 8, then the bucket inserted into will only be determined by the least significant entry (the one not multiplied at all). Similar entries will collide. Not good for a hash function.
31 is a large enough prime that the number of buckets is unlikely to be divisible by it (and in fact, modern java HashMap implementations keep the number of buckets to a power of 2).
For what it's worth, Effective Java 2nd Edition hand-waives around the mathematics issue and just say that the reason to choose 31 is:
Because it's an odd prime, and it's "traditional" to use primes
It's also one less than a power of two, which permits for bitwise optimization
Here's the full quote, from Item 9: Always override hashCode when you override equals:
The value 31 was chosen because it's an odd prime. If it were even and multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional.
A nice property of 31 is that the multiplication can be replaced by a shift (§15.19) and subtraction for better performance:
31 * i == (i << 5) - i
Modern VMs do this sort of optimization automatically.
While the recipe in this item yields reasonably good hash functions, it does not yield state-of-the-art hash functions, nor do Java platform libraries provide such hash functions as of release 1.6. Writing such hash functions is a research topic, best left to mathematicians and theoretical computer scientists.
Perhaps a later release of the platform will provide state-of-the-art hash functions for its classes and utility methods to allow average programmers to construct such hash functions. In the meantime, the techniques described in this item should be adequate for most applications.
Rather simplistically, it can be said that using a multiplier with numerous divisors will result in more hash collisions. Since for effective hashing we want to minimize the number of collisions, we try to use a multiplier that has fewer divisors. A prime number by definition has exactly two distinct, positive divisors.
Related questions
Java hashCode from one field - the recipe, plus example of using Apache Commons Lang's builders
is it incorrect to define an hashcode of an object as the sum, multiplication, whatever, of all class variables hashcodes?
Absolute Beginner's Guide to Bit Shifting?
I heard that 31 was chosen so that the compiler can optimize the multiplication to left-shift 5 bits then subtract the value.
Here's a citation a little closer to the source.
It boils down to:
31 is prime, which reduces collisions
31 produces a good distribution, with
a reasonable tradeoff in speed
First you compute the hash value modulo 2^32 (the size of an int), so you want something relatively prime to 2^32 (relatively prime means that there are no common divisors). Any odd number would do for that.
Then for a given hash table the index is usually computed from the hash value modulo the size of the hash table, so you want something that is relatively prime to the size of the hash table. Often the sizes of hash tables are chosen as prime numbers for that reason. In the case of Java the Sun implementation makes sure that the size is always a power of two, so an odd number would suffice here, too. There is also some additional massaging of the hash keys to limit collisions further.
The bad effect if the hash table and the multiplier had a common factor n could be that in certain circumstances only 1/n entries in the hash table would be used.
The reason why prime numbers are used is to minimize collisions when the data exhibits some particular patterns.
First things first: If the data is random then there’s no need for a prime number, you can do a mod operation against any number and you will have the same number of collisions for each possible value of the modulus.
But when data is not random then strange things happen. For example consider numeric data that is always a multiple of 10.
If we use mod 4 we find:
10 mod 4 = 2
20 mod 4 = 0
30 mod 4 = 2
40 mod 4 = 0
50 mod 4 = 2
So from the 3 possible values of the modulus (0,1,2,3) only 0 and 2 will have collisions, that is bad.
If we use a prime number like 7:
10 mod 7 = 3
20 mod 7 = 6
30 mod 7 = 2
40 mod 7 = 4
50 mod 7 = 1
etc
We also note that 5 is not a good choice but 5 is prime the reason is that all our keys are a multiple of 5. This means we have to choose a prime number that doesn’t divide our keys, choosing a large prime number is usually enough.
So erring on the side of being repetitive the reason prime numbers are used is to neutralize the effect of patterns in the keys in the distribution of collisions of a hash function.
31 is also specific to Java HashMap which uses a int as hash data type. Thus the max capacity of 2^32. There is no point in using larger Fermat or Mersenne primes.
It generally helps achieve a more even spread of your data among the hash buckets, especially for low-entropy keys.