Why does java.util.Random use the mask?

Why does java.util.Random use the mask? - java

Simplified (i.e., leaving concurrency out) Random.next(int bits) looks like
protected int next(int bits) {
seed = (seed * multiplier + addend) & mask;
return (int) (seed >>> (48 - bits));
}
where masking gets used to reduce the seed to 48 bits. Why is it better than just
protected int next(int bits) {
seed = seed * multiplier + addend;
return (int) (seed >>> (64 - bits));
}
? I've read quite a lot about random numbers, but see no reason for this.

The reason is that the lower bits tend to have a lower period (at least with the algorithm Java uses)
From Wikipedia - Linear Congruential Generator:
As shown above, LCG's do not always use all of the bits in the values they produce. The Java implementation produces 48 bits with each iteration but only returns the 32 most significant bits from these values. This is because the higher-order bits have longer periods than the lower order bits (see below). LCG's that use this technique produce much better values than those that do not.
edit:
after further reading (conveniently, on Wikipedia), the values of a, c, and m must satisfy these conditions to have the full period:
c and m must be relatively primes
a-1 is divisible by all prime factors of m
a-1 is a multiple of 4 if m is a multiple of 4
The only one that I can clearly tell is still satisfied is #3. #1 and #2 need to be checked, and I have a feeling that one (or both) of these fail.

From the docs at the top of java.util.Random:
The algorithm is described in The Art of Computer Programming,
Volume 2 by Donald Knuth in Section 3.2.1. It is a 48-bit seed,
linear congruential formula.
So the entire algorithm is designed to operate of 48-bit seeds, not 64 bit ones. I guess you can take it up with Mr. Knuth ;p

From wikipedia (the quote alluded to by the quote that #helloworld922 posted):
A further problem of LCGs is that the lower-order bits of the generated sequence have a far shorter period than the sequence as a whole if m is set to a power of 2. In general, the nth least significant digit in the base b representation of the output sequence, where bk = m for some integer k, repeats with at most period bn.
And furthermore, it continues (my italics):
The low-order bits of LCGs when m is a power of 2 should never be relied on for any degree of randomness whatsoever. Indeed, simply substituting 2n for the modulus term reveals that the low order bits go through very short cycles. In particular, any full-cycle LCG when m is a power of 2 will produce alternately odd and even results.
In the end, the reason is probably historical: the folks at Sun wanted something to work reliably, and the Knuth formula gave 32 significant bits. Note that the java.util.Random API says this (my italics):
If two instances of Random are created with the same seed, and the same sequence of method calls is made for each, they will generate and return identical sequences of numbers. In order to guarantee this property, particular algorithms are specified for the class Random. Java implementations must use all the algorithms shown here for the class Random, for the sake of absolute portability of Java code. However, subclasses of class Random are permitted to use other algorithms, so long as they adhere to the general contracts for all the methods.
So we're stuck with it as a reference implementation. However that doesn't mean you can't use another generator (and subclass Random or create a new class):
from the same Wikipedia page:
MMIX by Donald Knuth m=264 a=6364136223846793005 c=1442695040888963407
There's a 64-bit formula for you.
Random numbers are tricky (as Knuth notes) and depending on your needs, you might be fine with just calling java.util.Random twice and concatenating the bits if you need a 64-bit number. If you really care about the statistical properties, use something like Mersenne Twister, or if you care about information leakage / unpredictability use java.security.SecureRandom.

It doesn't look like there was a good reason for doing this.
Applying the mask is an conservative approach using a proven design.
Leaving it out most probably leads to a better generator, however, without knowing the math well, it's a risky step.
Another small advantage of masking is a speed gain on 8-bit architectures, since it uses 6 bytes instead of 8.

Related

Faster mod operation for large numbers in Java

I need to check whether X is divisible by Y or Not. I don't need the actual remainder in other cases.
I'm using the "mod" operator.
if (X.mod(Y).equals(BigInteger.ZERO))
{
do something
}
Now, I'm interested only when X is divisible by Y, I don't need the actual remainder on other cases.
Finding a faster way to check divisibility when the dividend is fixed. More precisely, to check for large number (potential to be prime) with many smaller primes before going to Lucas-Lehmer test.
I was just wondering, can we make some assumption (Look ahead type) depending on the last one or two digits of X & Y and we can make a decision to go for the mod or not (when there is no chance to get zero).

Java BigIntegers (like most numbers in computers since about 1980) are binary, so the only modulo that can be optimized by looking at the last 'digits' (binary digits = bits) are powers of 2, and the only power of 2 that is prime is 21. BigInteger.testBit(0) tests that directly. However, most software that generates large should-be primes is for cryptography (like RSA, Rabin, DH, DSA) and ensures never to test an even candidate in the first place; see e.g. FIPS186-4 A.1.1.2 (or earlier).
Since your actual goal is not as stated in the title, but to test if one (large) integer is not divisible by any of several small primes, the mathematically fastest way is to form their product -- in general any common multiple, preferably the least, but for distinct primes the product is the LCM -- and compute its GCD with the candidate using Euclid's algorithm. If the GCD is 1, no prime factor in the product is common with, and thus divides, the candidate. This requires several BigInteger divideAndRemainder operations, but it handles all of your tests in one fwoop.
A middle way is to 'bunch' several small primes whose product is less than 231 or 263, take BigInteger.mod (or .remainder) that product as .intValue() or .longValue() respectively, and test that (if nonzero) for divisibility by each of the small primes using int or long operations, which are much faster than the BigInteger ones. Repeat for several bunches if needed. BigInteger.probablePrime and related routines does exactly this (primes 3..41 against a long) for candidates up to 95 bits, above which it considers an Erastosthenes-style sieve more efficient. (In either case followed by Miller-Rabin and Lucas-Lehmer.)
When measuring things like this in Java, remember that if you execute some method 'a lot', where the exact definition of 'a lot' can vary and be hard to pin down, all common JVMs will JIT-compile the code, changing the performance radically. If you are doing it a lot be sure to measure the compiled performance, and if you aren't doing it a lot the performance usually doesn't matter. There are many existing questions here on SO about pitfalls in 'microbenchmark(s)' for Java.

There are algorithms to check divisibility but it's plural and each algorithm covers a particular group of numbers, e.g. dividable by 3, dividable by 4, etc. A list of some algorithms can be found e.g. at Wikipedia. There is no general, high performance algorithm that could be used for any given number, otherwise the one who found it would be famous and every dividable-by-implementation out there would use it.

Why does Random.nextLong not generate all possible long values in Java?

The Javadoc of the nextLong() method of the Random class states that
Because class Random uses a seed with only 48 bits, this algorithm will not return all possible long values. (Random javadoc)
The implementation is:
return ((long)next(32) << 32) + next(32);
The way I see it is as follows: to create any possible long, we should generate any possible bit pattern of 64 bits with equal likelihood. Assuming the calls to next(int) give us 32 random bits, then the concatenation of these bits will be a sequence of 64 random bits and hence we generate each 64 bit pattern with equal likelihood. And therefore all possible long values.
I suppose that the person who wrote the javadoc knows better and that my reasoning is flaw somehow. Can anyone explain where my reasoning is incorrect and what kind of longs will be returned then?

Since Random is pseudo-random we know that given the same seed it will return the same values. Taking the docs at their word there are 48 bits of seed. This means there are at most 2^48 unique values that can be printed out. If there were more that would mean that some value that we used before in position < 2^48 gives us a different value this time than it did last time.
If we try to join up two results what do we see?
|a|b|c|d|e|f|...|(2^48)-1|
Above are some values. How many pairs are there? a-b, b-c, c-d,... (2^48)-1-a. There are also 2^48 pairs. We can't fill all values of 2^64 with only the 2^48 pairs.

Pseudo-Random Number Generators are like giant rings of numbers. You start somewhere, and then move around the ring step by step, as you pull numbers out. This means that with a given seed - an initial internal state - all subsequent numbers are predetermined. Therefor, since the internal state is only 48 bits wide, only 2 to the power 48 random numbers are possible. So since the next number is given by the previous number, it is now clear why that implementation of nextLong will not generate all possible long values.

Let's say a perfect pseudo random K-bit generator is one that creates all possible 2^K seed values in 2^K trys. We can't do better, as there are only 2^K states, and every state is completly determined by the previous state and determines itself the next state.
Assume we write the output of the 48-bit generator down in binary. We get 2^48 * 48 bits that way.
And now we can say exactly how many 64-bit sequences we can get by going through the list and noting the next 64 bits (wrapping to the start when needed). It is exactly the number of bits we have: 13510798882111488.
Even if we assume that all those 64-bit sequences are pairwise different (which is not at all obvious), we have a long way to go until 2^64: 18446744073709551616.
I write the numbers again:
18446744073709551616 pairwise different 64 bit sequences we need
13510798882111488 64 bit sequences we can get with a 48 bit seed.
This proves that the javadoc writer was right. Only 1/1844th of all long values can be produced with the random generator

Does hashcode implementation of Java Arrays.hashcode() uniformly distribute

I review the source code of Arrays.hashCode(char[] c)
I am not very confirm that the algorithm it applies well work well in all cases.
public static int hashCode(int a[]) {
if (a == null)
return 0;
int result = 1;
for (int element : a)
result = 31 * result + element;
return result;
}
Does the hash function implement here really uniformly distributes the all the input arrays.And Why we use prime 31 here .

Why use the prime number 31?
This can be split in two parts?
Why a prime number?
Here we need to understand that our goal is to get a unique HashCode for an object which will help us to find that object in O(1) time.
The key word here, is unique.
Primes
Primes are unique numbers. They are unique in that, the product of a
prime with any other number has the best chance of being unique (not
as unique as the prime itself of-course) due to the fact that a prime
is used to compose it. This property is used in hashing functions.
.
Why number 31?
From Effective Java
Because it's an odd prime, and it's "traditional" to use primes.
It's also one less than a power of two, which permits for bitwise
optimization
Here's the full quote,
from Item 9: Always override
hashCode when you override equals:
The value 31 was chosen because it's an odd prime. If it were even and
multiplication overflowed, information would be lost, as
multiplication by 2 is equivalent to shifting. The advantage of using
a prime is less clear, but it is traditional.
A nice property of 31 is that the multiplication can be replaced by a
shift (§15.19) and subtraction for better performance:
31 * i == (i << 5) - i Modern VMs do this sort of optimization
automatically.
While the recipe in this item yields reasonably good hash functions,
it does not yield state-of-the-art hash functions, nor do Java
platform libraries provide such hash functions as of release 1.6.
Writing such hash functions is a research topic, best left to
mathematicians and theoretical computer scientists.
Perhaps a later release of the platform will provide state-of-the-art
hash functions for its classes and utility methods to allow average
programmers to construct such hash functions. In the meantime, the
techniques described in this item should be adequate for most
applications.
This is a very Good source.

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.

See this post: Why does Java's hashCode() in String use 31 as a multiplier?
That's where TheEwook's answer is from.
Generally, you use primes because they don't have any factors and will distribute better modulo N where N is the size of the range you are binning into. 31 is a small, odd prime so it works well. However, as the various sources you will find on the Internets will indicate, a small prime like 31 may lead to more collisions than a larger prime (especially if the values being hashed are not well distributed to begin with), so you could pick a larger prime if you found the performance to not be as good as you'd like.

What's with 181783497276652981 and 8682522807148012 in Random (Java 7)?

Why were 181783497276652981 and 8682522807148012 chosen in Random.java?
Here's the relevant source code from Java SE JDK 1.7:
/**
* Creates a new random number generator. This constructor sets
* the seed of the random number generator to a value very likely
* to be distinct from any other invocation of this constructor.
*/
public Random() {
this(seedUniquifier() ^ System.nanoTime());
}
private static long seedUniquifier() {
// L'Ecuyer, "Tables of Linear Congruential Generators of
// Different Sizes and Good Lattice Structure", 1999
for (;;) {
long current = seedUniquifier.get();
long next = current * 181783497276652981L;
if (seedUniquifier.compareAndSet(current, next))
return next;
}
}
private static final AtomicLong seedUniquifier
= new AtomicLong(8682522807148012L);
So, invoking new Random() without any seed parameter takes the current "seed uniquifier" and XORs it with System.nanoTime(). Then it uses 181783497276652981 to create another seed uniquifier to be stored for the next time new Random() is called.
The literals 181783497276652981L and 8682522807148012L are not placed in constants, but they don't appear anywhere else.
At first the comment gives me an easy lead. Searching online for that article yields the actual article. 8682522807148012 doesn't appear in the paper, but 181783497276652981 does appear -- as a substring of another number, 1181783497276652981, which is 181783497276652981 with a 1 prepended.
The paper claims that 1181783497276652981 is a number that yields good "merit" for a linear congruential generator. Was this number simply mis-copied into Java? Does 181783497276652981 have an acceptable merit?
And why was 8682522807148012 chosen?
Searching online for either number yields no explanation, only this page that also notices the dropped 1 in front of 181783497276652981.
Could other numbers have been chosen that would have worked as well as these two numbers? Why or why not?

Was this number simply mis-copied into Java?
Yes, seems to be a typo.
Does 181783497276652981 have an acceptable merit?
This could be determined using the evaluation algorithm presented in the paper. But the merit of the "original" number is probably higher.
And why was 8682522807148012 chosen?
Seems to be random. It could be the result of System.nanoTime() when the code was written.
Could other numbers have been chosen that would have worked as well as these two numbers?
Not every number would be equally "good". So, no.
Seeding Strategies
There are differences in the default-seeding schema between different versions and implementation of the JRE.
public Random() { this(System.currentTimeMillis()); }
public Random() { this(++seedUniquifier + System.nanoTime()); }
public Random() { this(seedUniquifier() ^ System.nanoTime()); }
The first one is not acceptable if you create multiple RNGs in a row. If their creation times fall in the same millisecond range, they will give completely identical sequences. (same seed => same sequence)
The second one is not thread safe. Multiple threads can get identical RNGs when initializing at the same time. Additionally, seeds of subsequent initializations tend to be correlated. Depending on the actual timer resolution of the system, the seed sequence could be linearly increasing (n, n+1, n+2, ...). As stated in How different do random seeds need to be? and the referenced paper Common defects in initialization of pseudorandom number generators, correlated seeds can generate correlation among the actual sequences of multiple RNGs.
The third approach creates randomly distributed and thus uncorrelated seeds, even across threads and subsequent initializations.
So the current java docs:
This constructor sets the seed of the random number generator to a
value very likely to be distinct from any other invocation of this
constructor.
could be extended by "across threads" and "uncorrelated"
Seed Sequence Quality
But the randomness of the seeding sequence is only as good as the underlying RNG.
The RNG used for the seed sequence in this java implementation uses a multiplicative linear congruential generator (MLCG) with c=0 and m=2^64. (The modulus 2^64 is implicitly given by the overflow of 64bit long integers)
Because of the zero c and the power-of-2-modulus, the "quality" (cycle length, bit-correlation, ...) is limited. As the paper says, besides the overall cycle length, every single bit has an own cycle length, which decreases exponentially for less significant bits. Thus, lower bits have a smaller repetition pattern. (The result of seedUniquifier() should be bit-reversed, before it is truncated to 48-bits in the actual RNG)
But it is fast! And to avoid unnecessary compare-and-set-loops, the loop body should be fast. This probably explains the usage of this specific MLCG, without addition, without xoring, just one multiplication.
And the mentioned paper presents a list of good "multipliers" for c=0 and m=2^64, as 1181783497276652981.
All in all: A for effort # JRE-developers ;) But there is a typo.
(But who knows, unless someone evaluates it, there is the possibility that the missing leading 1 actually improves the seeding RNG.)
But some multipliers are definitely worse:
"1" leads to a constant sequence.
"2" leads to a single-bit-moving sequence (somehow correlated)
...
The inter-sequence-correlation for RNGs is actually relevant for (Monte Carlo) Simulations, where multiple random sequences are instantiated and even parallelized. Thus a good seeding strategy is necessary to get "independent" simulation runs. Therefore the C++11 standard introduces the concept of a Seed Sequence for generating uncorrelated seeds.

If you consider that the equation used for the random number generator is:
Where X(n+1) is the next number, a is the multipler, X(n) is the current number, c is the increment and m is the modulus.
If you look further into Random, a, c and m are defined in the header of the class
private static final long multiplier = 0x5DEECE66DL; //= 25214903917 -- 'a'
private static final long addend = 0xBL; //= 11 -- 'c'
private static final long mask = (1L << 48) - 1; //= 2 ^ 48 - 1 -- 'm'
and looking at the method protected int next(int bits) this is were the equation is implemented
nextseed = (oldseed * multiplier + addend) & mask;
//X(n+1) = (X(n) * a + c ) mod m
This implies that the method seedUniquifier() is actually getting X(n) or in the first case at initialisation X(0) which is actually 8682522807148012 * 181783497276652981, this value is then modified further by the value of System.nanoTime(). This algorithm is consistent with the equation above but with the following X(0) = 8682522807148012, a = 181783497276652981, m = 2 ^ 64 and c = 0. But as the mod m of is preformed by the long overflow the above equation just becomes
Looking at the paper, the value of a = 1181783497276652981 is for m = 2 ^ 64, c = 0. So it appears to just be a typo and the value 8682522807148012 for X(0) which appears to be a seeming randomly chosen number from legacy code for Random. As seen here. But the merit of these chosen numbers could still be valid but as mentioned by Thomas B. probably not as "good" as the one in the paper.
EDIT - Below original thoughts have since been clarified so can be disregarded but leaving it for reference
This leads me the conclusions:
The reference to the paper is not for the value itself but for the methods used to obtain the values due to the different values of a, c and m
It is mere coincidence that the value is otherwise the same other than the leading 1 and the comment is misplaced (still struggling to believe this though)
OR
There has been a serious misunderstanding of the tables in the paper and the developers have just chosen a value at random as by the time it is multiplied out what was the point in using the table value in the first place especially as you can just provide your own seed value any way in which case these values are not even taken into account
So to answer your question
Could other numbers have been chosen that would have worked as well as these two numbers? Why or why not?
Yes, any number could have been used, in fact if you specify a seed value when you Instantiate Random you are using any other value. This value does not have any effect on the performance of the generator, this is determined by the values of a,c and m which are hard coded within the class.

As per the link you provided, they have chosen (after adding the missing 1 :) ) the best yield from 2^64 because long can't have have a number from 2^128

Seeding java.util.Random with consecutive numbers

I've simplified a bug I'm experiencing down to the following lines of code:
int[] vals = new int[8];
for (int i = 0; i < 1500; i++)
vals[new Random(i).nextInt(8)]++;
System.out.println(Arrays.toString(vals));
The output is: [0, 0, 0, 0, 0, 1310, 190, 0]
Is this just an artifact of choosing consecutive numbers to seed Random and then using nextInt with a power of 2? If so, are there other pitfalls like this I should be aware of, and if not, what am I doing wrong? (I'm not looking for a solution to the above problem, just some understanding about what else could go wrong)
Dan, well-written analysis. As the javadoc is pretty explicit about how numbers are calculated, it's not a mystery as to why this happened as much as if there are other anomalies like this to watch out for-- I didn't see any documentation about consecutive seeds, and I'm hoping someone with some experience with java.util.Random can point out other common pitfalls.
As for the code, the need is for several parallel agents to have repeatably random behavior who happen to choose from an enum 8 elements long as their first step. Once I discovered this behavior, the seeds all come from a master Random object created from a known seed. In the former (sequentially-seeded) version of the program, all behavior quickly diverged after that first call to nextInt, so it took quite a while for me to narrow the program's behavior down to the RNG library, and I'd like to avoid that situation in the future.

As much as possible, the seed for an RNG should itself be random. The seeds that you are using are only going to differ in one or two bits.
There's very rarely a good reason to create two separate RNGs in the one program. Your code is not one of those situations where it makes sense.
Just create one RNG and reuse it, then you won't have this problem.
In response to comment from mmyers:
Do you happen to know java.util.Random
well enough to explain why it picks 5
and 6 in this case?
The answer is in the source code for java.util.Random, which is a linear congruential RNG. When you specify a seed in the constructor, it is manipulated as follows.
seed = (seed ^ 0x5DEECE66DL) & mask;
Where the mask simply retains the lower 48 bits and discards the others.
When generating the actual random bits, this seed is manipulated as follows:
randomBits = (seed * 0x5DEECE66DL + 0xBL) & mask;
Now if you consider that the seeds used by Parker were sequential (0 -1499), and they were used once then discarded, the first four seeds generated the following four sets of random bits:
101110110010000010110100011000000000101001110100
101110110001101011010101011100110010010000000111
101110110010110001110010001110011101011101001110
101110110010011010010011010011001111000011100001
Note that the top 10 bits are indentical in each case. This is a problem because he only wants to generate values in the range 0-7 (which only requires a few bits) and the RNG implementation does this by shifting the higher bits to the right and discarding the low bits. It does this because in the general case the high bits are more random than the low bits. In this case they are not because the seed data was poor.
Finally, to see how these bits convert into the decimal values that we get, you need to know that java.util.Random makes a special case when n is a power of 2. It requests 31 random bits (the top 31 bits from the above 48), multiplies that value by n and then shifts it 31 bits to the right.
Multiplying by 8 (the value of n in this example) is the same as shifting left 3 places. So the net effect of this procedure is to shift the 31 bits 28 places to the right. In each of the 4 examples above, this leaves the bit pattern 101 (or 5 in decimal).
If we didn't discard the RNGs after just one value, we would see the sequences diverge. While the four sequences above all start with 5, the second values of each are 6, 0, 2 and 4 respectively. The small differences in the initial seeds start to have an influence.
In response to the updated question: java.util.Random is thread-safe, you can share one instance across multiple threads, so there is still no need to have multiple instances. If you really have to have multiple RNG instances, make sure that they are seeded completely independently of each other, otherwise you can't trust the outputs to be independent.
As to why you get these kind of effects, java.util.Random is not the best RNG. It's simple, pretty fast and, if you don't look too closely, reasonably random. However, if you run some serious tests on its output, you'll see that it's flawed. You can see that visually here.
If you need a more random RNG, you can use java.security.SecureRandom. It's a fair bit slower, but it works properly. One thing that might be a problem for you though is that it is not repeatable. Two SecureRandom instances with the same seed won't give the same output. This is by design.
So what other options are there? This is where I plug my own library. It includes 3 repeatable pseudo-RNGs that are faster than SecureRandom and more random than java.util.Random. I didn't invent them, I just ported them from the original C versions. They are all thread-safe.
I implemented these RNGs because I needed something better for my evolutionary computation code. In line with my original brief answer, this code is multi-threaded but it only uses a single RNG instance, shared between all threads.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.