The Javadoc of the nextLong() method of the Random class states that
Because class Random uses a seed with only 48 bits, this algorithm will not return all possible long values. (Random javadoc)
The implementation is:
return ((long)next(32) << 32) + next(32);
The way I see it is as follows: to create any possible long, we should generate any possible bit pattern of 64 bits with equal likelihood. Assuming the calls to next(int) give us 32 random bits, then the concatenation of these bits will be a sequence of 64 random bits and hence we generate each 64 bit pattern with equal likelihood. And therefore all possible long values.
I suppose that the person who wrote the javadoc knows better and that my reasoning is flaw somehow. Can anyone explain where my reasoning is incorrect and what kind of longs will be returned then?
Since Random is pseudo-random we know that given the same seed it will return the same values. Taking the docs at their word there are 48 bits of seed. This means there are at most 2^48 unique values that can be printed out. If there were more that would mean that some value that we used before in position < 2^48 gives us a different value this time than it did last time.
If we try to join up two results what do we see?
|a|b|c|d|e|f|...|(2^48)-1|
Above are some values. How many pairs are there? a-b, b-c, c-d,... (2^48)-1-a. There are also 2^48 pairs. We can't fill all values of 2^64 with only the 2^48 pairs.
Pseudo-Random Number Generators are like giant rings of numbers. You start somewhere, and then move around the ring step by step, as you pull numbers out. This means that with a given seed - an initial internal state - all subsequent numbers are predetermined. Therefor, since the internal state is only 48 bits wide, only 2 to the power 48 random numbers are possible. So since the next number is given by the previous number, it is now clear why that implementation of nextLong will not generate all possible long values.
Let's say a perfect pseudo random K-bit generator is one that creates all possible 2^K seed values in 2^K trys. We can't do better, as there are only 2^K states, and every state is completly determined by the previous state and determines itself the next state.
Assume we write the output of the 48-bit generator down in binary. We get 2^48 * 48 bits that way.
And now we can say exactly how many 64-bit sequences we can get by going through the list and noting the next 64 bits (wrapping to the start when needed). It is exactly the number of bits we have: 13510798882111488.
Even if we assume that all those 64-bit sequences are pairwise different (which is not at all obvious), we have a long way to go until 2^64: 18446744073709551616.
I write the numbers again:
18446744073709551616 pairwise different 64 bit sequences we need
13510798882111488 64 bit sequences we can get with a 48 bit seed.
This proves that the javadoc writer was right. Only 1/1844th of all long values can be produced with the random generator
Related
The Requirement:
I need to generate 4-digit non-duplicate number - even my application get closed, generation of number must not have to be duplicate.
I don't want to store all previous number in any storage.
Is there any algorithm which has highest possibility to produce most of unique number in a day ?
Thank you
Don't generate a random number. Instead, generate a sequential number from 0000 to 9999, and then obfuscate it using the technique described in https://stackoverflow.com/a/34420445/56778.
That way, the only thing you have to save is the next sequential number.
That example uses a multiplicative inverse to map the numbers from 0 to 100 to other numbers within the same range. Every number from 0 to 100 will be mapped to a unique number between 0 and 100. It's quick and easy, and you can change the mapping by changing the constants.
More information at http://blog.mischel.com/2017/06/20/how-to-generate-random-looking-keys/
Ten thousand is too few
Generating more than a few random numbers within a range of 0 to 9,999 is not likely to go well. Depending on how many such numbers you need, you are quite likely to run into duplicates. That seems rather obvious.
Cryptographically-strong random number generator
As for generating the most randomness, you need a cryptographically-strong random number generator that produces non-deterministic output.
java.security.SecureRandom
Java provides such a beast in the java.security.SecureRandom class. Study the class documentation for options.
SecureRandom secureRandom = new SecureRandom();
Notice that a SecureRandom is a java.util.Random. On that interface you will find convenient methods such as nextInt(int bound). You get a pseudorandom, uniformly distributed int value between 0 (inclusive) and the specified value (exclusive).
int r = secureRandom.nextInt( 10_000 ) ;
Let's try it.
SecureRandom secureRandom = new SecureRandom();
for( int i = 1 ; i <= 20 ; i ++ )
{
System.out.println( secureRandom.nextInt( 10_000 ) ) ;
}
7299
3581
7106
1195
8713
9517
6954
5558
6136
1623
7006
2910
5855
6273
1691
588
5629
7347
7123
6973
If you need more than few such numbers, or they absolutely must be distinct (no duplicates), than you must change your constraints.
For a small range like 10,000, keep a collection of generated numbers, checking each newly generated number to see if it has been used already.
Dramatically expand your limit far beyond 10,000.
UUID
If you need a universally unique identifier, use, well, a Universally Unique Identifier (UUID). Java represents a UUID value with the java.util.UUID class.
UUIDs were designed for systems to be able to independently generate a virtually unique identifier without coordinating with a central authority. They are 128-bit values. That is quadruple the 32-bits of a int integer primitive in Java. UUIDs are often displayed to humans as a canonically-formatted 36-character hexadecimal string grouped with hyphens. Do not confuse this textual representation for a UUID value itself; a UUID is a 128-bit value, not text.
The Version 1 type of UUIDs are ideal, as they represent a point in both space and time by combining a date-time, a MAC address, and a small arbitrary number. This makes it virtually impossible to have duplicates. By “virtually impossible”, I mean literally astronomically-large numbers. Java does not bundle a Version 1 generator because of security/privacy concerns. You can add a library, make a web services call, or ask your database such as Postgres to generate one for you.
Or, for most cases where we need a relatively small number of instances, use the Version 4 UUID where 122 of the 128 bits are randomly generated. Java bundles a Version 4 generator. Simply call UUID.randomUUID.
I agree with Basil Bourque and others about whether what you propose is the "right" approach. However, if you really want to do what you are proposing, then one way to achieve your goal (generate as many numbers as possible within range in pseudorandom order without having to store all previous values generated):
find a random number generator that has a period roughly within the range that you require;
to generate the next ID, take the next number generated, discarding ones that are not within range.
So for four digit numbers, one option would be an XORShift generator configured to generate 16 bit numbers (i.e. in the range 1-16383, near enough to 1-999):
private int seed = 1;
public int nextJobID() {
do {
seed = (seed ^ (seed << 5)) & 0x3fff;
seed = (seed ^ (seed >>> 3)) & 0x3fff;
seed = (seed ^ (seed << 7)) & 0x3fff;
} while (seed >= 10000);
return seed;
}
To generate a new sequence each "day", set 'seed' to any number between 1 and 16383. [It'll be the same sequence, just starting at a different point. You could vary the sequence a little e.g. by taking every nth job ID where n is a small number, reversing the pattern of shifts (do >>>, <<, >>> instead of <<, >>>, <<) or finding some other combination of shifts (not 5/3/7) that produce a complete period.]
This technique is guaranteed to produce all numbers in range in "random" order. It doesn't offer any other guarantee, e.g. that it will be the most efficient or produce the "highest quality" of randomness achievable. It's probably good enough -- if not as good as you can get -- given the requirements you set out.
I am reading the implementation details of Java 8 HashMap, can anyone let me know why Java HashMap initial array size is 16 specifically? What is so special about 16? And why is it the power of two always? Thanks
The reason why powers of 2 appear everywhere is because when expressing numbers in binary (as they are in circuits), certain math operations on powers of 2 are simpler and faster to perform (just think about how easy math with powers of 10 are with the decimal system we use). For example, multication is not a very efficient process in computers - circuits use a method similar to the one you use when multiplying two numbers each with multiple digits. Multiplying or dividing by a power of 2 requires the computer to just move bits to the left for multiplying or the right for dividing.
And as for why 16 for HashMap? 10 is a commonly used default for dynamically growing structures (arbitrarily chosen), and 16 is not far off - but is a power of 2.
You can do modulus very efficiently for a power of 2. n % d = n & (d-1) when d is a power of 2, and modulus is used to determine which index an item maps to in the internal array - which means it occurs very often in a Java HashMap. Modulus requires division, which is also much less efficient than using the bitwise and operator. You can convince yourself of this by reading a book on Digital Logic.
The reason why bitwise and works this way for powers of two is because every power of 2 is expressed as a single bit set to 1. Let's say that bit is t. When you subtract 1 from a power of 2, you set every bit below t to 1, and every bit above t (as well as t) to 0. Bitwise and therefore saves the values of all bits below position t from the number n (as expressed above), and sets the rest to 0.
But how does that help us? Remember that when dividing by a power of 10, you can count the number of zeroes following the 1, and take that number of digits starting from the least significant of the dividend in order to find the remainder. Example: 637989 % 1000 = 989. A similar property applies to binary numbers with only one bit set to 1, and the rest set to 0. Example: 100101 % 001000 = 000101
There's one more thing about choosing the hash & (n - 1) versus modulo and that is negative hashes. hashcode is of type int, which of course can be negative. modulo on a negative number (in Java) is negative also, while & is not.
Another reason is that you want all of the slots in the array to be equally likely to be used. Since hash() is evenly distributed over 32 bits, if the array size didn't divide into the hash space, then there would be a remainder causing lower indexes to have a slightly higher chance of being used. Ideally, not just the hash, but (hash() % array_size) is random and evenly distributed.
But this only really matters for data with a small hash range (like a byte or character).
Simplified (i.e., leaving concurrency out) Random.next(int bits) looks like
protected int next(int bits) {
seed = (seed * multiplier + addend) & mask;
return (int) (seed >>> (48 - bits));
}
where masking gets used to reduce the seed to 48 bits. Why is it better than just
protected int next(int bits) {
seed = seed * multiplier + addend;
return (int) (seed >>> (64 - bits));
}
? I've read quite a lot about random numbers, but see no reason for this.
The reason is that the lower bits tend to have a lower period (at least with the algorithm Java uses)
From Wikipedia - Linear Congruential Generator:
As shown above, LCG's do not always use all of the bits in the values they produce. The Java implementation produces 48 bits with each iteration but only returns the 32 most significant bits from these values. This is because the higher-order bits have longer periods than the lower order bits (see below). LCG's that use this technique produce much better values than those that do not.
edit:
after further reading (conveniently, on Wikipedia), the values of a, c, and m must satisfy these conditions to have the full period:
c and m must be relatively primes
a-1 is divisible by all prime factors of m
a-1 is a multiple of 4 if m is a multiple of 4
The only one that I can clearly tell is still satisfied is #3. #1 and #2 need to be checked, and I have a feeling that one (or both) of these fail.
From the docs at the top of java.util.Random:
The algorithm is described in The Art of Computer Programming,
Volume 2 by Donald Knuth in Section 3.2.1. It is a 48-bit seed,
linear congruential formula.
So the entire algorithm is designed to operate of 48-bit seeds, not 64 bit ones. I guess you can take it up with Mr. Knuth ;p
From wikipedia (the quote alluded to by the quote that #helloworld922 posted):
A further problem of LCGs is that the lower-order bits of the generated sequence have a far shorter period than the sequence as a whole if m is set to a power of 2. In general, the nth least significant digit in the base b representation of the output sequence, where bk = m for some integer k, repeats with at most period bn.
And furthermore, it continues (my italics):
The low-order bits of LCGs when m is a power of 2 should never be relied on for any degree of randomness whatsoever. Indeed, simply substituting 2n for the modulus term reveals that the low order bits go through very short cycles. In particular, any full-cycle LCG when m is a power of 2 will produce alternately odd and even results.
In the end, the reason is probably historical: the folks at Sun wanted something to work reliably, and the Knuth formula gave 32 significant bits. Note that the java.util.Random API says this (my italics):
If two instances of Random are created with the same seed, and the same sequence of method calls is made for each, they will generate and return identical sequences of numbers. In order to guarantee this property, particular algorithms are specified for the class Random. Java implementations must use all the algorithms shown here for the class Random, for the sake of absolute portability of Java code. However, subclasses of class Random are permitted to use other algorithms, so long as they adhere to the general contracts for all the methods.
So we're stuck with it as a reference implementation. However that doesn't mean you can't use another generator (and subclass Random or create a new class):
from the same Wikipedia page:
MMIX by Donald Knuth m=264 a=6364136223846793005 c=1442695040888963407
There's a 64-bit formula for you.
Random numbers are tricky (as Knuth notes) and depending on your needs, you might be fine with just calling java.util.Random twice and concatenating the bits if you need a 64-bit number. If you really care about the statistical properties, use something like Mersenne Twister, or if you care about information leakage / unpredictability use java.security.SecureRandom.
It doesn't look like there was a good reason for doing this.
Applying the mask is an conservative approach using a proven design.
Leaving it out most probably leads to a better generator, however, without knowing the math well, it's a risky step.
Another small advantage of masking is a speed gain on 8-bit architectures, since it uses 6 bytes instead of 8.
I need to generate a hash value used for uniqueness of many billions of records in Java. Trouble is, I only have 16 numeric digits to play with. In researching this, I have found algorithms for 32-bit hash, which return Java integers. But this is too small, as it only has a range of +/ 2 billion, and have will have more records that that. I cannot go to a 64-bit hash, as that will give me numeric values back that are too large (+/ 4 quintillion, or 19 digits). Trouble is, I am dealing with a legacy system that is forcing me into a static key length of 16 digits.
Suggestions? I know no hash function will guarantee uniqueness, but I need a good one that will fit into these restrictions.
Thanks
If your generated hash is too large you can just mod it with your keyspace max to make it fit.
myhash = hash64bitvalue % 10^16
If you are limited to 16 decimal digits, your key space contains 10^16 values.
Even if you find a hash that gives uniform distribution on your data set, due to Birthday Paradox you will have a 50% chance of collision on ~10^8 items of data, which is an order of magnitude less than your billions of records.
This means that you cannot use any kind of hash alone and rely on uniqueness.
A straightforward solution is to use a global counter instead. If global counter is infeasible, counters with preallocated ranges can be used. For example, 6 most significant digits denote fixed data source index, 10 least significant digits contain monotonous counter maintained by that data source.
So your restriction is 53 bit?
For my understanding order number of bit in hashcode doesn't affect its value (order and value of bit are fully independent from each other). So you could get 64-bit hash function and use only last 53 bits from it. And you must use binary operations for this ( hash64 & (1<<54 - 1) ) not arithmetic.
You don't have to store your hashes in a human readable form (hex, as you said). Just store the 64-bit long datatype (generated by a 64-bit hash function) in your database, which is only 8 bytes. And not the 19 bytes of which you were scared off.
If that isn't a solution, improve the legacy system.
Edit: Wait!
64-bit: 264 =
18446744073709551616
16 hex-digits: 1616 =
18446744073709551616
Exact fit! So make a hex representation of your 64-bit hash, and there you are.
If you can save 16 alphanumeric characters then you can use a hexadecimal representation and pack 16^16 bits into 16 chars. 16^16 is 2^64.
I've simplified a bug I'm experiencing down to the following lines of code:
int[] vals = new int[8];
for (int i = 0; i < 1500; i++)
vals[new Random(i).nextInt(8)]++;
System.out.println(Arrays.toString(vals));
The output is: [0, 0, 0, 0, 0, 1310, 190, 0]
Is this just an artifact of choosing consecutive numbers to seed Random and then using nextInt with a power of 2? If so, are there other pitfalls like this I should be aware of, and if not, what am I doing wrong? (I'm not looking for a solution to the above problem, just some understanding about what else could go wrong)
Dan, well-written analysis. As the javadoc is pretty explicit about how numbers are calculated, it's not a mystery as to why this happened as much as if there are other anomalies like this to watch out for-- I didn't see any documentation about consecutive seeds, and I'm hoping someone with some experience with java.util.Random can point out other common pitfalls.
As for the code, the need is for several parallel agents to have repeatably random behavior who happen to choose from an enum 8 elements long as their first step. Once I discovered this behavior, the seeds all come from a master Random object created from a known seed. In the former (sequentially-seeded) version of the program, all behavior quickly diverged after that first call to nextInt, so it took quite a while for me to narrow the program's behavior down to the RNG library, and I'd like to avoid that situation in the future.
As much as possible, the seed for an RNG should itself be random. The seeds that you are using are only going to differ in one or two bits.
There's very rarely a good reason to create two separate RNGs in the one program. Your code is not one of those situations where it makes sense.
Just create one RNG and reuse it, then you won't have this problem.
In response to comment from mmyers:
Do you happen to know java.util.Random
well enough to explain why it picks 5
and 6 in this case?
The answer is in the source code for java.util.Random, which is a linear congruential RNG. When you specify a seed in the constructor, it is manipulated as follows.
seed = (seed ^ 0x5DEECE66DL) & mask;
Where the mask simply retains the lower 48 bits and discards the others.
When generating the actual random bits, this seed is manipulated as follows:
randomBits = (seed * 0x5DEECE66DL + 0xBL) & mask;
Now if you consider that the seeds used by Parker were sequential (0 -1499), and they were used once then discarded, the first four seeds generated the following four sets of random bits:
101110110010000010110100011000000000101001110100
101110110001101011010101011100110010010000000111
101110110010110001110010001110011101011101001110
101110110010011010010011010011001111000011100001
Note that the top 10 bits are indentical in each case. This is a problem because he only wants to generate values in the range 0-7 (which only requires a few bits) and the RNG implementation does this by shifting the higher bits to the right and discarding the low bits. It does this because in the general case the high bits are more random than the low bits. In this case they are not because the seed data was poor.
Finally, to see how these bits convert into the decimal values that we get, you need to know that java.util.Random makes a special case when n is a power of 2. It requests 31 random bits (the top 31 bits from the above 48), multiplies that value by n and then shifts it 31 bits to the right.
Multiplying by 8 (the value of n in this example) is the same as shifting left 3 places. So the net effect of this procedure is to shift the 31 bits 28 places to the right. In each of the 4 examples above, this leaves the bit pattern 101 (or 5 in decimal).
If we didn't discard the RNGs after just one value, we would see the sequences diverge. While the four sequences above all start with 5, the second values of each are 6, 0, 2 and 4 respectively. The small differences in the initial seeds start to have an influence.
In response to the updated question: java.util.Random is thread-safe, you can share one instance across multiple threads, so there is still no need to have multiple instances. If you really have to have multiple RNG instances, make sure that they are seeded completely independently of each other, otherwise you can't trust the outputs to be independent.
As to why you get these kind of effects, java.util.Random is not the best RNG. It's simple, pretty fast and, if you don't look too closely, reasonably random. However, if you run some serious tests on its output, you'll see that it's flawed. You can see that visually here.
If you need a more random RNG, you can use java.security.SecureRandom. It's a fair bit slower, but it works properly. One thing that might be a problem for you though is that it is not repeatable. Two SecureRandom instances with the same seed won't give the same output. This is by design.
So what other options are there? This is where I plug my own library. It includes 3 repeatable pseudo-RNGs that are faster than SecureRandom and more random than java.util.Random. I didn't invent them, I just ported them from the original C versions. They are all thread-safe.
I implemented these RNGs because I needed something better for my evolutionary computation code. In line with my original brief answer, this code is multi-threaded but it only uses a single RNG instance, shared between all threads.