Does hashcode implementation of Java Arrays.hashcode() uniformly distribute

Does hashcode implementation of Java Arrays.hashcode() uniformly distribute - java

I review the source code of Arrays.hashCode(char[] c)
I am not very confirm that the algorithm it applies well work well in all cases.
public static int hashCode(int a[]) {
if (a == null)
return 0;
int result = 1;
for (int element : a)
result = 31 * result + element;
return result;
}
Does the hash function implement here really uniformly distributes the all the input arrays.And Why we use prime 31 here .

Why use the prime number 31?
This can be split in two parts?
Why a prime number?
Here we need to understand that our goal is to get a unique HashCode for an object which will help us to find that object in O(1) time.
The key word here, is unique.
Primes
Primes are unique numbers. They are unique in that, the product of a
prime with any other number has the best chance of being unique (not
as unique as the prime itself of-course) due to the fact that a prime
is used to compose it. This property is used in hashing functions.
.
Why number 31?
From Effective Java
Because it's an odd prime, and it's "traditional" to use primes.
It's also one less than a power of two, which permits for bitwise
optimization
Here's the full quote,
from Item 9: Always override
hashCode when you override equals:
The value 31 was chosen because it's an odd prime. If it were even and
multiplication overflowed, information would be lost, as
multiplication by 2 is equivalent to shifting. The advantage of using
a prime is less clear, but it is traditional.
A nice property of 31 is that the multiplication can be replaced by a
shift (§15.19) and subtraction for better performance:
31 * i == (i << 5) - i Modern VMs do this sort of optimization
automatically.
While the recipe in this item yields reasonably good hash functions,
it does not yield state-of-the-art hash functions, nor do Java
platform libraries provide such hash functions as of release 1.6.
Writing such hash functions is a research topic, best left to
mathematicians and theoretical computer scientists.
Perhaps a later release of the platform will provide state-of-the-art
hash functions for its classes and utility methods to allow average
programmers to construct such hash functions. In the meantime, the
techniques described in this item should be adequate for most
applications.
This is a very Good source.

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.

See this post: Why does Java's hashCode() in String use 31 as a multiplier?
That's where TheEwook's answer is from.
Generally, you use primes because they don't have any factors and will distribute better modulo N where N is the size of the range you are binning into. 31 is a small, odd prime so it works well. However, as the various sources you will find on the Internets will indicate, a small prime like 31 may lead to more collisions than a larger prime (especially if the values being hashed are not well distributed to begin with), so you could pick a larger prime if you found the performance to not be as good as you'd like.

Related

Is there a way to adjust for integer overflow?

I'm noodling through an anagram hash function, already solved several different ways, but I'm looking for extreme performance as an exercise. I already submitted a solution that passed all the given tests (beating out 100% of all competitors by at least 1ms), but I believe that although it "won", it has a weakness that just wasn't triggered. It is subject to integer overflow in a way that could affect the results.
The gist of the solution was to combine multiple commutative operations, each taking some number of bits, and concatenate them into one long variable. I chose xor, sum, and product. The xor operation cleanly fits within a fixed number of bits. The sum operation might overflow, but because of the way overflow is addressed, it would still arrive at the same result if letters and their corresponding values are rearranged. I wouldn't worry, for example, about whether this function would overflow.
private short sumHash(String s) {
short hash=0;
for (char c:s.toCharArray()) {
hash+=c;
}
return hash;
}
Where I run into trouble is in the inclusion of products. If I make a function that returns the product of a list of values (such as character values in a String), then, at the very least, the result could be rendered inaccurate if the product overflowed to exactly zero.
private short productHash(String s) {
short hash=1;
for (char c:s.toCharArray()) {
hash*=c;
}
return hash;
}
Is there any safe and performant way to avoid this weakness so that the function gains the benefit of the commutative property of multiplication to produce the same value for anagrams, but can't ever encounter a product that overflows to zero?

Sure, if you're willing to go to some lengths to do it. The simplest solution that occurs to me is to write
hash *= primes[c];
where primes is an array that maps each possible character to a distinct odd prime. Overflowing to zero can only happen if the "true" product in infinite-precision arithmetic is a multiple of 2^32, and if you're multiplying by odd primes, that's impossible.
(You do run into the problem that the hash itself will always be odd, but you could shift it right one bit to obtain a more fully mixed hash.)

You will only hit zero if
a * b = 0 mod 2^64
which is equivalent to there being an integer k such that
a * b = k * 2^64
That is, we get in trouble if factors divide 2^64, i.e. if factors are even. Therefore, the easiest solution is ensuring that all factors are odd, for instance like this:
for (char ch : chars) {
hash *= (ch << 1) | 1;
}
This allows you to keep 63 bits of information.
Note however that this technique will only avoid collisions caused by overflow, but not collisions caused by multipliers that share a common factor. If you wish to avoid that, too, you'll need coprime multipliers, which is easiest achieved if they are prime.

The naive way to avoid overflow, is to use a larger type such as int or long. However, for your purposes, modulo arithmetic might make more sense. You can do (a * b) % p for a prime p to maintain commutativity. (There is some deep mathematics here called Group Theory, if you are interested in learning more.) You will need to limit p to be small enough that each a * b does not overflow. The easiest way to do this is to pick a p so that (p - 1)^2 can still be represented in a short or whatever data type you are using.

Faster mod operation for large numbers in Java

I need to check whether X is divisible by Y or Not. I don't need the actual remainder in other cases.
I'm using the "mod" operator.
if (X.mod(Y).equals(BigInteger.ZERO))
{
do something
}
Now, I'm interested only when X is divisible by Y, I don't need the actual remainder on other cases.
Finding a faster way to check divisibility when the dividend is fixed. More precisely, to check for large number (potential to be prime) with many smaller primes before going to Lucas-Lehmer test.
I was just wondering, can we make some assumption (Look ahead type) depending on the last one or two digits of X & Y and we can make a decision to go for the mod or not (when there is no chance to get zero).

Java BigIntegers (like most numbers in computers since about 1980) are binary, so the only modulo that can be optimized by looking at the last 'digits' (binary digits = bits) are powers of 2, and the only power of 2 that is prime is 21. BigInteger.testBit(0) tests that directly. However, most software that generates large should-be primes is for cryptography (like RSA, Rabin, DH, DSA) and ensures never to test an even candidate in the first place; see e.g. FIPS186-4 A.1.1.2 (or earlier).
Since your actual goal is not as stated in the title, but to test if one (large) integer is not divisible by any of several small primes, the mathematically fastest way is to form their product -- in general any common multiple, preferably the least, but for distinct primes the product is the LCM -- and compute its GCD with the candidate using Euclid's algorithm. If the GCD is 1, no prime factor in the product is common with, and thus divides, the candidate. This requires several BigInteger divideAndRemainder operations, but it handles all of your tests in one fwoop.
A middle way is to 'bunch' several small primes whose product is less than 231 or 263, take BigInteger.mod (or .remainder) that product as .intValue() or .longValue() respectively, and test that (if nonzero) for divisibility by each of the small primes using int or long operations, which are much faster than the BigInteger ones. Repeat for several bunches if needed. BigInteger.probablePrime and related routines does exactly this (primes 3..41 against a long) for candidates up to 95 bits, above which it considers an Erastosthenes-style sieve more efficient. (In either case followed by Miller-Rabin and Lucas-Lehmer.)
When measuring things like this in Java, remember that if you execute some method 'a lot', where the exact definition of 'a lot' can vary and be hard to pin down, all common JVMs will JIT-compile the code, changing the performance radically. If you are doing it a lot be sure to measure the compiled performance, and if you aren't doing it a lot the performance usually doesn't matter. There are many existing questions here on SO about pitfalls in 'microbenchmark(s)' for Java.

There are algorithms to check divisibility but it's plural and each algorithm covers a particular group of numbers, e.g. dividable by 3, dividable by 4, etc. A list of some algorithms can be found e.g. at Wikipedia. There is no general, high performance algorithm that could be used for any given number, otherwise the one who found it would be famous and every dividable-by-implementation out there would use it.

Why #AutoValue annotation uses the specific integer 1000003 for calculating hash code? [duplicate]

I was just wondering why is that primes are used in a class's hashCode() method? For example, when using Eclipse to generate my hashCode() method there is always the prime number 31 used:
public int hashCode() {
final int prime = 31;
//...
}
References:
Here is a good primer on Hashcode and article on how hashing works that I found (C# but the concepts are transferrable):
Eric Lippert's Guidelines and rules for GetHashCode()

Prime numbers are chosen to best distribute data among hash buckets. If the distribution of inputs is random and evenly spread, then the choice of the hash code/modulus does not matter. It only has an impact when there is a certain pattern to the inputs.
This is often the case when dealing with memory locations. For example, all 32-bit integers are aligned to addresses divisible by 4. Check out the table below to visualize the effects of using a prime vs. non-prime modulus:
Input Modulo 8 Modulo 7
0 0 0
4 4 4
8 0 1
12 4 5
16 0 2
20 4 6
24 0 3
28 4 0
Notice the almost-perfect distribution when using a prime modulus vs. a non-prime modulus.
However, although the above example is largely contrived, the general principle is that when dealing with a pattern of inputs, using a prime number modulus will yield the best distribution.

Because you want the number you are multiplying by and the number of buckets you are inserting into to have orthogonal prime factorizations.
Suppose there are 8 buckets to insert into. If the number you are using to multiply by is some multiple of 8, then the bucket inserted into will only be determined by the least significant entry (the one not multiplied at all). Similar entries will collide. Not good for a hash function.
31 is a large enough prime that the number of buckets is unlikely to be divisible by it (and in fact, modern java HashMap implementations keep the number of buckets to a power of 2).

For what it's worth, Effective Java 2nd Edition hand-waives around the mathematics issue and just say that the reason to choose 31 is:
Because it's an odd prime, and it's "traditional" to use primes
It's also one less than a power of two, which permits for bitwise optimization
Here's the full quote, from Item 9: Always override hashCode when you override equals:
The value 31 was chosen because it's an odd prime. If it were even and multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional.
A nice property of 31 is that the multiplication can be replaced by a shift (§15.19) and subtraction for better performance:
31 * i == (i << 5) - i
Modern VMs do this sort of optimization automatically.
While the recipe in this item yields reasonably good hash functions, it does not yield state-of-the-art hash functions, nor do Java platform libraries provide such hash functions as of release 1.6. Writing such hash functions is a research topic, best left to mathematicians and theoretical computer scientists.
Perhaps a later release of the platform will provide state-of-the-art hash functions for its classes and utility methods to allow average programmers to construct such hash functions. In the meantime, the techniques described in this item should be adequate for most applications.
Rather simplistically, it can be said that using a multiplier with numerous divisors will result in more hash collisions. Since for effective hashing we want to minimize the number of collisions, we try to use a multiplier that has fewer divisors. A prime number by definition has exactly two distinct, positive divisors.
Related questions
Java hashCode from one field - the recipe, plus example of using Apache Commons Lang's builders
is it incorrect to define an hashcode of an object as the sum, multiplication, whatever, of all class variables hashcodes?
Absolute Beginner's Guide to Bit Shifting?

I heard that 31 was chosen so that the compiler can optimize the multiplication to left-shift 5 bits then subtract the value.

Here's a citation a little closer to the source.
It boils down to:
31 is prime, which reduces collisions
31 produces a good distribution, with
a reasonable tradeoff in speed

First you compute the hash value modulo 2^32 (the size of an int), so you want something relatively prime to 2^32 (relatively prime means that there are no common divisors). Any odd number would do for that.
Then for a given hash table the index is usually computed from the hash value modulo the size of the hash table, so you want something that is relatively prime to the size of the hash table. Often the sizes of hash tables are chosen as prime numbers for that reason. In the case of Java the Sun implementation makes sure that the size is always a power of two, so an odd number would suffice here, too. There is also some additional massaging of the hash keys to limit collisions further.
The bad effect if the hash table and the multiplier had a common factor n could be that in certain circumstances only 1/n entries in the hash table would be used.

The reason why prime numbers are used is to minimize collisions when the data exhibits some particular patterns.
First things first: If the data is random then there’s no need for a prime number, you can do a mod operation against any number and you will have the same number of collisions for each possible value of the modulus.
But when data is not random then strange things happen. For example consider numeric data that is always a multiple of 10.
If we use mod 4 we find:
10 mod 4 = 2
20 mod 4 = 0
30 mod 4 = 2
40 mod 4 = 0
50 mod 4 = 2
So from the 3 possible values of the modulus (0,1,2,3) only 0 and 2 will have collisions, that is bad.
If we use a prime number like 7:
10 mod 7 = 3
20 mod 7 = 6
30 mod 7 = 2
40 mod 7 = 4
50 mod 7 = 1
etc
We also note that 5 is not a good choice but 5 is prime the reason is that all our keys are a multiple of 5. This means we have to choose a prime number that doesn’t divide our keys, choosing a large prime number is usually enough.
So erring on the side of being repetitive the reason prime numbers are used is to neutralize the effect of patterns in the keys in the distribution of collisions of a hash function.

31 is also specific to Java HashMap which uses a int as hash data type. Thus the max capacity of 2^32. There is no point in using larger Fermat or Mersenne primes.

It generally helps achieve a more even spread of your data among the hash buckets, especially for low-entropy keys.

What happens on data if prime number used for calculating hash code? [duplicate]

I was just wondering why is that primes are used in a class's hashCode() method? For example, when using Eclipse to generate my hashCode() method there is always the prime number 31 used:
public int hashCode() {
final int prime = 31;
//...
}
References:
Here is a good primer on Hashcode and article on how hashing works that I found (C# but the concepts are transferrable):
Eric Lippert's Guidelines and rules for GetHashCode()

Prime numbers are chosen to best distribute data among hash buckets. If the distribution of inputs is random and evenly spread, then the choice of the hash code/modulus does not matter. It only has an impact when there is a certain pattern to the inputs.
This is often the case when dealing with memory locations. For example, all 32-bit integers are aligned to addresses divisible by 4. Check out the table below to visualize the effects of using a prime vs. non-prime modulus:
Input Modulo 8 Modulo 7
0 0 0
4 4 4
8 0 1
12 4 5
16 0 2
20 4 6
24 0 3
28 4 0
Notice the almost-perfect distribution when using a prime modulus vs. a non-prime modulus.
However, although the above example is largely contrived, the general principle is that when dealing with a pattern of inputs, using a prime number modulus will yield the best distribution.

Because you want the number you are multiplying by and the number of buckets you are inserting into to have orthogonal prime factorizations.
Suppose there are 8 buckets to insert into. If the number you are using to multiply by is some multiple of 8, then the bucket inserted into will only be determined by the least significant entry (the one not multiplied at all). Similar entries will collide. Not good for a hash function.
31 is a large enough prime that the number of buckets is unlikely to be divisible by it (and in fact, modern java HashMap implementations keep the number of buckets to a power of 2).

I heard that 31 was chosen so that the compiler can optimize the multiplication to left-shift 5 bits then subtract the value.

Here's a citation a little closer to the source.
It boils down to:
31 is prime, which reduces collisions
31 produces a good distribution, with
a reasonable tradeoff in speed

First you compute the hash value modulo 2^32 (the size of an int), so you want something relatively prime to 2^32 (relatively prime means that there are no common divisors). Any odd number would do for that.
Then for a given hash table the index is usually computed from the hash value modulo the size of the hash table, so you want something that is relatively prime to the size of the hash table. Often the sizes of hash tables are chosen as prime numbers for that reason. In the case of Java the Sun implementation makes sure that the size is always a power of two, so an odd number would suffice here, too. There is also some additional massaging of the hash keys to limit collisions further.
The bad effect if the hash table and the multiplier had a common factor n could be that in certain circumstances only 1/n entries in the hash table would be used.

The reason why prime numbers are used is to minimize collisions when the data exhibits some particular patterns.
First things first: If the data is random then there’s no need for a prime number, you can do a mod operation against any number and you will have the same number of collisions for each possible value of the modulus.
But when data is not random then strange things happen. For example consider numeric data that is always a multiple of 10.
If we use mod 4 we find:
10 mod 4 = 2
20 mod 4 = 0
30 mod 4 = 2
40 mod 4 = 0
50 mod 4 = 2
So from the 3 possible values of the modulus (0,1,2,3) only 0 and 2 will have collisions, that is bad.
If we use a prime number like 7:
10 mod 7 = 3
20 mod 7 = 6
30 mod 7 = 2
40 mod 7 = 4
50 mod 7 = 1
etc
We also note that 5 is not a good choice but 5 is prime the reason is that all our keys are a multiple of 5. This means we have to choose a prime number that doesn’t divide our keys, choosing a large prime number is usually enough.
So erring on the side of being repetitive the reason prime numbers are used is to neutralize the effect of patterns in the keys in the distribution of collisions of a hash function.

31 is also specific to Java HashMap which uses a int as hash data type. Thus the max capacity of 2^32. There is no point in using larger Fermat or Mersenne primes.

It generally helps achieve a more even spread of your data among the hash buckets, especially for low-entropy keys.

Why does java.util.Random use the mask?

Simplified (i.e., leaving concurrency out) Random.next(int bits) looks like
protected int next(int bits) {
seed = (seed * multiplier + addend) & mask;
return (int) (seed >>> (48 - bits));
}
where masking gets used to reduce the seed to 48 bits. Why is it better than just
protected int next(int bits) {
seed = seed * multiplier + addend;
return (int) (seed >>> (64 - bits));
}
? I've read quite a lot about random numbers, but see no reason for this.

The reason is that the lower bits tend to have a lower period (at least with the algorithm Java uses)
From Wikipedia - Linear Congruential Generator:
As shown above, LCG's do not always use all of the bits in the values they produce. The Java implementation produces 48 bits with each iteration but only returns the 32 most significant bits from these values. This is because the higher-order bits have longer periods than the lower order bits (see below). LCG's that use this technique produce much better values than those that do not.
edit:
after further reading (conveniently, on Wikipedia), the values of a, c, and m must satisfy these conditions to have the full period:
c and m must be relatively primes
a-1 is divisible by all prime factors of m
a-1 is a multiple of 4 if m is a multiple of 4
The only one that I can clearly tell is still satisfied is #3. #1 and #2 need to be checked, and I have a feeling that one (or both) of these fail.

From the docs at the top of java.util.Random:
The algorithm is described in The Art of Computer Programming,
Volume 2 by Donald Knuth in Section 3.2.1. It is a 48-bit seed,
linear congruential formula.
So the entire algorithm is designed to operate of 48-bit seeds, not 64 bit ones. I guess you can take it up with Mr. Knuth ;p

From wikipedia (the quote alluded to by the quote that #helloworld922 posted):
A further problem of LCGs is that the lower-order bits of the generated sequence have a far shorter period than the sequence as a whole if m is set to a power of 2. In general, the nth least significant digit in the base b representation of the output sequence, where bk = m for some integer k, repeats with at most period bn.
And furthermore, it continues (my italics):
The low-order bits of LCGs when m is a power of 2 should never be relied on for any degree of randomness whatsoever. Indeed, simply substituting 2n for the modulus term reveals that the low order bits go through very short cycles. In particular, any full-cycle LCG when m is a power of 2 will produce alternately odd and even results.
In the end, the reason is probably historical: the folks at Sun wanted something to work reliably, and the Knuth formula gave 32 significant bits. Note that the java.util.Random API says this (my italics):
If two instances of Random are created with the same seed, and the same sequence of method calls is made for each, they will generate and return identical sequences of numbers. In order to guarantee this property, particular algorithms are specified for the class Random. Java implementations must use all the algorithms shown here for the class Random, for the sake of absolute portability of Java code. However, subclasses of class Random are permitted to use other algorithms, so long as they adhere to the general contracts for all the methods.
So we're stuck with it as a reference implementation. However that doesn't mean you can't use another generator (and subclass Random or create a new class):
from the same Wikipedia page:
MMIX by Donald Knuth m=264 a=6364136223846793005 c=1442695040888963407
There's a 64-bit formula for you.
Random numbers are tricky (as Knuth notes) and depending on your needs, you might be fine with just calling java.util.Random twice and concatenating the bits if you need a 64-bit number. If you really care about the statistical properties, use something like Mersenne Twister, or if you care about information leakage / unpredictability use java.security.SecureRandom.

It doesn't look like there was a good reason for doing this.
Applying the mask is an conservative approach using a proven design.
Leaving it out most probably leads to a better generator, however, without knowing the math well, it's a risky step.
Another small advantage of masking is a speed gain on 8-bit architectures, since it uses 6 bytes instead of 8.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.