Java provides an cryptographically secure random number generator in the package java.secure.random.
Is it possible to use this number generator if I consider things like seeding and cyclic re-instantiation of the RNG? Or can I use the number generator 'as it is'?
Has anyone experience with this generator?
EDIT: the requirements are:
a) Be statistically independent
b) Be fairly distributed (within statistically expected bounds) over their range
c) Pass various recognized statistical tests
d) Be cryptographically strong.
As others say, the secure RNG can have limited throughput. To mitigate this
you can either stretch that secure randomness by seeding a CPRNG, or you can
try to optimise your usage of the bitstream.
To shuffle a pack of cards, for example, you need only 226 bits, but a naive
algorithm (calling nextInt(n) for each card) will likely use 1600 or 3200
bits, wasting 85% of your entropy and making you seven times more susceptible
to delays.
For this situation I think the Doctor Jacques
method would be appropriate.
To go with that, here's some performance analysis against progressively more
costly entropy sources (also contains code):
Bit recycling for scaling random number generators
I would lean towards efficient usage rather than stretching, because I think it
would be a lot easier to prove the fairness of an efficient consumer of a
trustworthy entropy stream, than to prove the fairness of any drawing method
with a well-seeded PRNG.
EDIT2:
I don't really know Java, but I put this together:
public class MySecureRandom extends java.security.SecureRandom {
private long m = 1;
private long r = 0;
#Override
public final int nextInt(int n) {
while (true) {
if (m < 0x80000000L) {
m <<= 32;
r <<= 32;
r += (long)next(32) - Integer.MIN_VALUE;
}
long q = m / n;
if (r < n * q) {
int x = (int)(r % n);
m = q;
r /= n;
return x;
}
m -= n * q;
r -= n * q;
}
}
}
This does away with the greedy default uniform [0,n-1] generator and replaces it with a modified Doctor Jacques version. Timing a card-shuffle range of values shows almost a 6x speed-up over the SecureRandom.nextInt(n).
My previous version of this code (only 2x speed-up) assumed that SecureRandom.next(b) was efficient, but it turns out that call was discarding entropy and dragging the whole loop down. This version manages its own chunking.
Documentation about SecureRandom says it may block waiting for the system to generate more entropy (for instance, in Linux it takes random numbers from /dev/random), so if you are going to use it, maybe you will need some help from hardware: install random number generator card (a hardware device that generates real random numbers, not pseudo-random ones) this way your system will generate random numbers with enough speed so your program doesn't get blocked.
You can use java.security.SecureRandom as you can use java.util.Random for such things.
Only be aware that java.security.SecureRandom can depend on some entropy from the computer running the program. If you get to many random values from it then it may block until enough entropy is generated by the computer (e.g. on linux java.security.SecureRandom is using /dev/urandom).
So if you want to generate many random values and can life with a PRNG use java.util.Random.
Related
I have a system which communicate to external system via webservices in which we used to send random nos as msg id and same is getting stored as a primary key of table in our database. Problem here is since we have approx 80-90 k call on daily basis i have seen so many exceptions saying that duplicate primary key. I am generating random nos in java. How can i be sure that whatever random number i will generate will not be duplicated.
below is code for the generating random nos:
private static int getRandomNumberInRange(int min, int max) {
if (min >= max) {
throw new IllegalArgumentException("max must be greater than min");
}
Random r = new Random();
return r.nextInt((max - min) + 1) + min;
}
There's nothing wrong with using a random number as a primary key. You just need to make sure that numbers are chosen from a range large enough to make the chance of picking a number more than once is virtually zero.
If you generate 100k identifiers per day for 30 years, that's about 1 billion identifiers. So, using a 100-bit number will make a collision virtually impossible over that time. 13 bytes, or maybe 12 if you feel lucky.
I define "virtually zero" as 2-40. There's not much point in defining it as less than 2-50, because things like RAM and hard drives are more likely than that to suffer undetected errors. When you have to satisfy a uniqueness constraint, estimates involving a 50% chance of collision are useless.
There is nothing magic about UUIDs. They are just 122-bit numbers with a verbose encoding. They will work, but they are overkill for this application.
You need to use a large random number, and a good source of randomness. int is not large enough, and you're restricting your range to less than that with your min and max.
The rule of thumb is that you should expect a 50% chance of collision for every 2n/2 numbers, where n is the number of bits in your random number.
The Random class in java.util isn't a good source for truely random numbers (among other problems, it uses a 48 bit seed). You should use SecureRandom, and at least a long. You should also construct it outside your method to avoid the overhead of initialisation.
As others have suggested, a UUID would solve your problem.
As part of a Monte Carlo simulation, I have to roll a group of dice until certain values show up a certain amount of times. My code that does this calls upon a dice class which generates a random number between 1 and 6, and returns it. Originally the code looked like
public void roll() {
value = (int)(Math.random()*6) + 1;
}
and it wasn't very fast. By exchanging Math.random() for
ThreadLocalRandom.current().nextInt(1, 7);
It ran a section in roughly 60% of the original time, which called this around about 250 million times.
As part of the full simulation it will call upon this method billions of times at the very least, so is there any faster way to do this?
Pick a random generator that is as fast and as good as you need it to be, and that isn't slowed down to a tiny fraction of its normal speed by thread safety mechanisms. Then pick a method of generating the [1..6] integer distribution that is a fast and as precise as you need it to be.
The fastest simple generator that is of sufficiently high quality to beat standard tests for PRNGs like TestU01 (instead of failing systematically, like the Mersenne Twister) is Sebastiano Vigna's xorshift64*. I'm showing it as C code but Sebastiano has it in Java as well:
uint64_t xorshift64s (int64_t &x)
{
x ^= x >> 12;
x ^= x << 25;
x ^= x >> 27;
return x * 2685821657736338717ull;
}
Sebastiano Vigna's site has lots of useful info, links and benchmark results. Including papers, for the mathematically inclined.
At that high resolution you can simply use 1 + xorshift64s(state) % 6 and the bias will be immeasurably small. If that is not fast enough, implement the modulo division by multiplication with the inverse. If that is not fast enough - if you cannot afford two MULs per variate - then it gets tricky and you need to come back here. xorshift1024* (Java) plus some bit trickery for the variate would be an option.
Batching - generating an array full of numbers and processing that, then refilling the array and so on - can unlock some speed reserves. Needlessly wrapping things in classes achieves the opposite.
P.S.: if ThreadLocalRandom and xorshift* are not fast enough for your purposes even with batching then you might be going about things in the wrong way, or you might be doing it in the wrong language. Or both.
P.P.S.: in languages like Java (or C#, or Delphi), abstraction is not free, it has a cost. In Java you also have to reckon with things like mandatory gratuitous array bounds checking, unless you have a compiler that can eliminate those checks. Teasing high performance out of a Java program can get very involved... In C++ you get abstraction and performance for free.
Darth is correct that Xorshift* is probably the best generator to use. Use it to fill a ring buffer of bytes, then fetch the bytes one at a time to roll your dice, refilling the buffer when you've fetched enough. To get the actual die roll, avoid division and bias by using rejection sampling. The rest of the code then looks something like this (in C):
do {
if (bp >= buffer + sizeof buffer) {
// refill buffer with Xorshifts
}
v = *bp++ & 7;
} while (v > 5);
return v;
This will allow you to get on average 6 die rolls per 64-bit random value.
I am going through the source code for Math.Random and I have noticed that the source code for nextBoolean()
public boolean nextBoolean() {
return next(1) != 0;
}
calls for a new draw of pseudorandom bits, via next(int bits) which "iterates" the LC-PRNG to the next state, i.e. "draws" a whole set of new bits, even though only one bit is used in nextBoolean. This effectively means that rest of the bits (47 to be exact) are pretty much wasted in that particular iteration.
I have looked at another PRNG which appears to do essentially the same thing, even though the underlying generator is different. Since multiple bits from the same iteration are used for other method calls (such as nextInt(), nextLong(), ...) and the consecutive bits are assumed to be "independent enough" from one another.
So my question is: Why only one bit is used from a draw of the PRNG using nextBoolean()? It should be possible to cache, say 16-bits (if one wants to use the highest quality bits), for successive calls to nextBoolean(), am I mistaken here?
Edit: What I mean by caching the results is something like this:
private long booleanBits = 0L;
private int c = Long.SIZE;
public boolean nextBoolean(){
if(c == 0){
booleanBits = nextLong();
c = Long.SIZE;
}
boolean b = (booleanBits & 1) != 0;
booleanBits >>>= 1;
return b;
//return ( next() & 1 ) != 0;
}
Sure, it's not sure and pretty as the commented out text, but it ends up in 64x less draws. At the cost of 1 int comparison, and 1 right-shift operation per call to nextBoolean(). Am I mistaken?
Edit2: Ok, I had to test the timings, see the code here. The output is as follows:
Uncached time lapse: 13891
Cached time lapse: 8672
Testing if the order matters..:
Cached time lapse: 6751
Uncached time lapse: 8737
Which suggest that caching the bits is not a computational burden but an improvement instead. A couple of things I should note about this test:
I use a custom implementation of xorshift* generators that is heavily inspired from Sebastiano Vigna's work on xorshift* generators.
xorshift* generators are actually much faster than Java's native generator. So if I were to use java.util.Random to draw my bits, caching would make a larger impact. Or that's what I would expect at least.
Single-thread application assumed here, so no sych issues. But that's of course common in both conditions.
Conditionals of any kind can be quite expensive (see Why is it faster to process a sorted array than an unsorted array?), and next itself doesn't do that many more operations: I count five arithmetic operations plus a compareAndSet, which shouldn't cost much in a single-threaded context.
The compareAndSet points out another issue -- thread-safety -- which is much harder when you have two variables that need to be kept in sync, such as booleanBits and c. The synchronization overhead of keeping those in sync for multithreaded use would almost certainly exceed the cost of a next() call.
Why were 181783497276652981 and 8682522807148012 chosen in Random.java?
Here's the relevant source code from Java SE JDK 1.7:
/**
* Creates a new random number generator. This constructor sets
* the seed of the random number generator to a value very likely
* to be distinct from any other invocation of this constructor.
*/
public Random() {
this(seedUniquifier() ^ System.nanoTime());
}
private static long seedUniquifier() {
// L'Ecuyer, "Tables of Linear Congruential Generators of
// Different Sizes and Good Lattice Structure", 1999
for (;;) {
long current = seedUniquifier.get();
long next = current * 181783497276652981L;
if (seedUniquifier.compareAndSet(current, next))
return next;
}
}
private static final AtomicLong seedUniquifier
= new AtomicLong(8682522807148012L);
So, invoking new Random() without any seed parameter takes the current "seed uniquifier" and XORs it with System.nanoTime(). Then it uses 181783497276652981 to create another seed uniquifier to be stored for the next time new Random() is called.
The literals 181783497276652981L and 8682522807148012L are not placed in constants, but they don't appear anywhere else.
At first the comment gives me an easy lead. Searching online for that article yields the actual article. 8682522807148012 doesn't appear in the paper, but 181783497276652981 does appear -- as a substring of another number, 1181783497276652981, which is 181783497276652981 with a 1 prepended.
The paper claims that 1181783497276652981 is a number that yields good "merit" for a linear congruential generator. Was this number simply mis-copied into Java? Does 181783497276652981 have an acceptable merit?
And why was 8682522807148012 chosen?
Searching online for either number yields no explanation, only this page that also notices the dropped 1 in front of 181783497276652981.
Could other numbers have been chosen that would have worked as well as these two numbers? Why or why not?
Was this number simply mis-copied into Java?
Yes, seems to be a typo.
Does 181783497276652981 have an acceptable merit?
This could be determined using the evaluation algorithm presented in the paper. But the merit of the "original" number is probably higher.
And why was 8682522807148012 chosen?
Seems to be random. It could be the result of System.nanoTime() when the code was written.
Could other numbers have been chosen that would have worked as well as these two numbers?
Not every number would be equally "good". So, no.
Seeding Strategies
There are differences in the default-seeding schema between different versions and implementation of the JRE.
public Random() { this(System.currentTimeMillis()); }
public Random() { this(++seedUniquifier + System.nanoTime()); }
public Random() { this(seedUniquifier() ^ System.nanoTime()); }
The first one is not acceptable if you create multiple RNGs in a row. If their creation times fall in the same millisecond range, they will give completely identical sequences. (same seed => same sequence)
The second one is not thread safe. Multiple threads can get identical RNGs when initializing at the same time. Additionally, seeds of subsequent initializations tend to be correlated. Depending on the actual timer resolution of the system, the seed sequence could be linearly increasing (n, n+1, n+2, ...). As stated in How different do random seeds need to be? and the referenced paper Common defects in initialization of pseudorandom number generators, correlated seeds can generate correlation among the actual sequences of multiple RNGs.
The third approach creates randomly distributed and thus uncorrelated seeds, even across threads and subsequent initializations.
So the current java docs:
This constructor sets the seed of the random number generator to a
value very likely to be distinct from any other invocation of this
constructor.
could be extended by "across threads" and "uncorrelated"
Seed Sequence Quality
But the randomness of the seeding sequence is only as good as the underlying RNG.
The RNG used for the seed sequence in this java implementation uses a multiplicative linear congruential generator (MLCG) with c=0 and m=2^64. (The modulus 2^64 is implicitly given by the overflow of 64bit long integers)
Because of the zero c and the power-of-2-modulus, the "quality" (cycle length, bit-correlation, ...) is limited. As the paper says, besides the overall cycle length, every single bit has an own cycle length, which decreases exponentially for less significant bits. Thus, lower bits have a smaller repetition pattern. (The result of seedUniquifier() should be bit-reversed, before it is truncated to 48-bits in the actual RNG)
But it is fast! And to avoid unnecessary compare-and-set-loops, the loop body should be fast. This probably explains the usage of this specific MLCG, without addition, without xoring, just one multiplication.
And the mentioned paper presents a list of good "multipliers" for c=0 and m=2^64, as 1181783497276652981.
All in all: A for effort # JRE-developers ;) But there is a typo.
(But who knows, unless someone evaluates it, there is the possibility that the missing leading 1 actually improves the seeding RNG.)
But some multipliers are definitely worse:
"1" leads to a constant sequence.
"2" leads to a single-bit-moving sequence (somehow correlated)
...
The inter-sequence-correlation for RNGs is actually relevant for (Monte Carlo) Simulations, where multiple random sequences are instantiated and even parallelized. Thus a good seeding strategy is necessary to get "independent" simulation runs. Therefore the C++11 standard introduces the concept of a Seed Sequence for generating uncorrelated seeds.
If you consider that the equation used for the random number generator is:
Where X(n+1) is the next number, a is the multipler, X(n) is the current number, c is the increment and m is the modulus.
If you look further into Random, a, c and m are defined in the header of the class
private static final long multiplier = 0x5DEECE66DL; //= 25214903917 -- 'a'
private static final long addend = 0xBL; //= 11 -- 'c'
private static final long mask = (1L << 48) - 1; //= 2 ^ 48 - 1 -- 'm'
and looking at the method protected int next(int bits) this is were the equation is implemented
nextseed = (oldseed * multiplier + addend) & mask;
//X(n+1) = (X(n) * a + c ) mod m
This implies that the method seedUniquifier() is actually getting X(n) or in the first case at initialisation X(0) which is actually 8682522807148012 * 181783497276652981, this value is then modified further by the value of System.nanoTime(). This algorithm is consistent with the equation above but with the following X(0) = 8682522807148012, a = 181783497276652981, m = 2 ^ 64 and c = 0. But as the mod m of is preformed by the long overflow the above equation just becomes
Looking at the paper, the value of a = 1181783497276652981 is for m = 2 ^ 64, c = 0. So it appears to just be a typo and the value 8682522807148012 for X(0) which appears to be a seeming randomly chosen number from legacy code for Random. As seen here. But the merit of these chosen numbers could still be valid but as mentioned by Thomas B. probably not as "good" as the one in the paper.
EDIT - Below original thoughts have since been clarified so can be disregarded but leaving it for reference
This leads me the conclusions:
The reference to the paper is not for the value itself but for the methods used to obtain the values due to the different values of a, c and m
It is mere coincidence that the value is otherwise the same other than the leading 1 and the comment is misplaced (still struggling to believe this though)
OR
There has been a serious misunderstanding of the tables in the paper and the developers have just chosen a value at random as by the time it is multiplied out what was the point in using the table value in the first place especially as you can just provide your own seed value any way in which case these values are not even taken into account
So to answer your question
Could other numbers have been chosen that would have worked as well as these two numbers? Why or why not?
Yes, any number could have been used, in fact if you specify a seed value when you Instantiate Random you are using any other value. This value does not have any effect on the performance of the generator, this is determined by the values of a,c and m which are hard coded within the class.
As per the link you provided, they have chosen (after adding the missing 1 :) ) the best yield from 2^64 because long can't have have a number from 2^128
For example if java produces the pseudorandom sequence: 9 3 2 5 6
by using 23 as a seed, how can I do the inverse? i.e. getting 23 out of the sequence 9 3 2 5 6.
Or how do I assign a seed for a certain sequence?
It is easy to do if there is a database - just assign a random key for the sequence
INSERT INTO SEQUENCE_TABLE VALUES (RANDOM_KEY, SEQUENCE)
However if I'm not permitted to use a database, Is there a formula to do such a thing?
Yes, it's absolutely easy to reverse engineer the number stream of a poorly designed pseudo random number generator, such as the Linear Congruential PRNG implementation in the Java programming language (java.util.Random).
In fact, with as few as TWO values from that particular generator, and the information on the order in which the values emerged, the entire stream can be predicted.
Random random = new Random();
long v1 = random.nextInt();
long v2 = random.nextInt();
for (int i = 0; i < 65536; i++) {
long seed = v1 * 65536 + i;
if (((seed * multiplier + addend) & mask) >>> 16) == v2) {
System.out.println("Seed found: " + seed);
break;
}
}
This is precisely why it's critical to use cryptographically secure random number generators that have been vetted by the community at large for implementations that require security.
There is much more information on reverse engineering PRNGs, including java.util.Random here. ...
The point of random number generators is that this is impossible. SecureRandom is designed to be especially cryptographically strong, but generally speaking, if you're writing a random number generator and this is possible or easy, you're doing it wrong.
That said, it's likely that it's not impossible with Java's built in Random class. (SecureRandom is another story, though.) But it will require staggering amounts of math.
To be more specific: if a polynomial-time algorithm existed to do what you want, for some particular pseudorandom number generator, then it would by definition fail the "next-bit test" described in the linked Wikipedia article, since you could predict the next elements that would be generated.
It is certainly possible to recover the seed used by java.util.Random. This post describes the math behind Random's linear congruential formula, and here is a function to discover the current seed from the last two integers returned from nextInt().
public static long getCurrentSeed(int i1, int i2) {
final long multiplier = 0x5DEECE66DL;
final long inv_mult = 0xDFE05BCB1365L;
final long increment = 0xBL;
final long mask = ((1L << 48) - 1);
long suffix = 0L;
long lastSeed;
long currSeed;
int lastInt;
for (long i=0; i < (1<<16); i++) {
suffix = i;
currSeed = ((long)i2 << 16) | suffix;
lastSeed = ((currSeed - increment) * inv_mult) & mask;
lastInt = (int)(lastSeed >>> 16);
if (lastInt == i1) {
/* We've found the current seed, need to roll back 2 seeds */
currSeed = lastSeed;
lastSeed = ((currSeed - increment) * inv_mult) & mask;
return lastSeed ^ multiplier;
}
}
/* Error, current seed not found */
System.err.println("current seed not found");
return 0;
}
This function returns a value that can be used with rand.setSeed() to generate a pseudorandom sequence of numbers starting with i1 and i2.
If you're OK with using a String as your seed, you can use this:
String seed = "9 3 2 5 6";
Then your generator would look like:
String[] numbers = seed.split(" ");
If you truly want to reverse engineer the "random" number generator in java, that's going to be quite difficult (I think).
It would be better to do it the other way around if you can: Start with a seed, produce the sequence, then work out from there.
You want to take arbitrary sequences of numbers, then determine a short (fixed length?) key which will allow you to regenerate that sequence of numbers, without storing the original? Unfortunately, what you want is technically impossible. Here's why:
This is a particular case of compression. You have a long sequence of data, which you want to be able to recreate losslessly from a smaller piece of information. If what you are requesting were possible, then I would be able to compress the whole of stack overflow into a single integer (since the entire website could be serialized into a sequence of numbers, albeit a very long one!)
Unfortunately, mathematics doesn't work that way. Any given sequence has a particular measure of entropy - the average amount of complexity in that sequence. In order to reproduce that sequence losslessly, you must be able to encode at least enough information to represent its entropy.
For certain sequences, there may in fact be a seed that is capable of generating a long, specific sequence, but that is only because there is a hard-coded mathematical function which takes that seed and produces a particular sequence of numbers. However, to take an arbitrary sequence of values and produce such a seed, you would need both a seed, and a function capable of producing that sequence from that seed. In order to encode both of these things, you'd find that you've got a lot more data than you'd expect!