How can I randomly generate letters according to their frequency of use in common speech?
Any pseudo-code appreciated, but an implementation in Java would be fantastic. Otherwise just a poke in the right direction would be helpful.
Note: I don't need to generate the frequencies of usage - I'm sure I can look that up easily enough.
I am assuming that you store the frequencies as floating point numbers between 0 and 1 that total to make 1.
First you should prepare a table of cumulative frequencies, i.e. the sum of the frequency of that letter and all letters before it.
To simplify, if you start with this frequency distribution:
A 0.1
B 0.3
C 0.4
D 0.2
Your cumulative frequency table would be:
A 0.1
B 0.4 (= 0.1 + 0.3)
C 0.8 (= 0.1 + 0.3 + 0.4)
D 1.0 (= 0.1 + 0.3 + 0.4 + 0.2)
Now generate a random number between 0 and 1 and see where in this list that number lies. Choose the letter that has the smallest cumulative frequency larger than your random number. Some examples:
Say you randomly pick 0.612. This lies between 0.4 and 0.8, i.e. between B and C, so you'd choose C.
If your random number was 0.039, that comes before 0.1, i.e. before A, so choose A.
I hope that makes sense, otherwise feel free to ask for clarifications!
One quick way to do it would be to generate a list of letters, where each letter appeared in the list in accordance with its frequency. Say, if "e" was used 25.6% of the time, and your list had length 1000, it would have 256 "e"s.
Then you could just randomly pick spots from the list by using (int) (Math.random() * 1000) to generate random numbers between 0 and 999.
What I would do is scale the relative frequencies as floating point numbers such that their sum is 1.0. Then I would create an array of the cumulative totals per letter, i.e. the number that must be topped to get that letter and all those "below" it. Say the frequency of A is 10%, b is 2% and z is 1%; then your table would look something like this:
0.000 A ; from 0% to 10% gets you an A
0.100 B ; above 10% is at least a B
0.120 C ; 12% for C...
...
0.990 Z ; if your number is >= 99% then you get a Z
Then you generate yourself a random number between 0.0 and 1.0 and do a binary search in the array for the first number smaller than your random number. Then pick the letter at that position. Done.
Not even a pseudo-code, but a possible approach is as follows:
Let p1, p2, ..., pk be the frequencies that you want to match.
Calculate the cumulative frequencies: p1, p1+p2, p1+p2+p3, ... , 1
Generate a random uniform (0,1) number x
Check which interval of the cumulative frequencies x belongs to: if it is between, say, p1+..+pi and p1+...+pi+p(i+1), then output the (i+1)st letter
Depending on how you implement the interval-finding, the procedure is usually more efficient if the p1,p2,... are sorted in decreasing order, because you will usually find the interval containing x sooner.
Using a binary tree gives you a nice, clean way to find the right entry. Here, you start with a frequency map, where the keys are the symbols (English letters), and the values are the frequency of their occurrence. This gets inverted, and a NavigableMap is created where the keys are cumulative probability, and the values are symbols. That makes the lookup easy.
private final Random generator = new Random();
private final NavigableMap<Float, Integer> table =
new TreeMap<Float, Integer>();
private final float max;
public Frequency(Map<Integer, Float> frequency)
{
float total = 0;
for (Map.Entry<Integer, Float> e : frequency.entrySet()) {
total += e.getValue();
table.put(total, e.getKey());
}
max = total;
}
/**
* Choose a random symbol. The choices are weighted by frequency.
*/
public int roll()
{
Float key = generator.nextFloat() * max;
return table.higherEntry(key).getValue();
}
Related
I want to count all pythagorean triples (primitive and non-primitive) whose hypotenuse is <= N for a given N. This OEIS link gives this count for powers of 10. A simple but somewhat efficient pseudocode can be easily derived from this link or from wikipedia, using the famous Euclid's formula. Interpreted in Java, this can be made into:
public class countPTS {
public static void main(String[] args)
{
long N = 1000000L;
long count = 0;
for (long m = 2 ; m * m + 1 <= N ; m++)
for (long n = 1 + m % 2 ; n < m && n * n + m * m <= N ; n += 2)
if (gcd(m , n) == 1)
count += N / (n * n + m * m);
System.out.println(count);
}
public static long gcd(long a, long b)
{
if (a == 0)
return b;
return gcd(b % a, a);
}
}
Which gives the correct results, but is a bit "slow". Regardless of the actual time complexity of this algorithm in big O notation, it seems to grow like a little bit worse than O(n). It can be easily checked that almost the entire time is spent in gcd calculations.
If C(N) is the count of such triples, then this piece of code written and ran on the latest version of Eclipse IDE, using a modern PC and single threaded, yields C(10^10) = 34465432859 in about 90 seconds. This makes me wonder how the high values in the OEIS link were obtained, the largest being C(10^17).
I suspect that an algorithm with a better time complexity was used, perhaps O(n^2/3) or O(n^3/4), maybe with some more advanced mathematical insight. It is somewhat plausible that the above algorithm was ran but with significant improvements and perhaps multithreaded. Can anyone shed light on this topic and higher values?
TD;LR: There is an approach that is slightly worse than O(N^(2/3)). It consists of enumerating primitive triples with an efficient algorithm up to N^(2/3) and counting the resulting Pythagorean triples, and then using inclusion-exclusion to count Pythagorean primitive triples past that and finding a count of the rest. The rest of this is a detailed explanation of how this all works.
Your direct calculation over primitive triples cannot be made better than O(N). The reason why is that over 60% of pairs of integers are relatively prime. So we can put a lower bound on the number of pairs of relative primes as follows.
There are floor(sqrt(n/2)) choose 2 = O(N) pairs of integers at most sqrt(n/2). The ones that are relatively prime with one even give rise to a primitive Pythagorean triple.
60% of them are relatively prime.
This 60% is contained in the 75% that do not have both numbers even. Therefore when we get rid of pairs with both numbers odd, at least 60% - 25% = 35% of pairs are relatively prime with one of them even.
Therefore there are at least O(n) distinct Pythagorean triples to find whose hypotenuse is less than n.
Given that there are O(N) primitive triples, you'll have to do at least O(N) work to enumerate them. So you don't want to enumerate them all. Instead we'll enumerate up to some point, and then we'll do something else with the rest. But let's make the enumeration more efficient.
Now as you noticed, your GCD checks are taking most of your time. Can we enumerate all of them with hypotenuse less than, say, L more efficiently?
Here is the most efficient approach that I found.
Enumerate the primes up to sqrt(L) with any decent technique. This is O(sqrt(L) log(L) with the Sieve of Eratosthenes, that is good enough.
Using this list, we can produce the set of odd numbers up to sqrt(L), complete with their prime factorizations, fairly efficiently. (Just recursively produce prime factorizations that fit in the range.)
Given an odd number and its prime factorizations, we can run through the even numbers and quickly find which are relatively prime. That gives us the primitive Pythagorean triples.
For step 3, create a PriorityQueue whose elements start off pairs (2*p, 2*p) for each distinct prime factor p. For example for 15 we'd have a queue with (6, 6) and (10, 10). We then do this:
i = the odd number we started with
j = 2
while i*i + j*j < L:
if queue.peek()[0] == i:
while queue.peek()[0] == i:
(x, y) = queue.pop()
queue.add((x+y, y))
else:
i*i + j*j is the hypotenuse of a primitive Pythagorean triple
This is very slightly superlinear. In fact the expected complexity of finding all the numbers out to n that are relatively prime to i is O(n * log(log(log(i)))). Why? Because the average number of distinct prime factors is on average O(log(log(i))). The queue operations are O(log(THAT)) and we do on average O(n) queue operations. (I'm waving my hands past some number theory, but the result is correct.)
OK, what is our alternative to actually enumerating many of the primitive triples?
The answer is to use inclusion-exclusion to count the number below some bound, L Which I will demonstrate by counting the number of primitive triples whose hypotenuse is at most 100.
The first step is to implement a function EvenOddPairCount(L) that counts how many pairs of even/odd numbers there are whose squares sum up to at most L . To do this we traverse the fringe of maximal even/odds. This takes O(sqrt(L)) to traverse. Here is that calculation for 100:
1*1 + 8*8, 4 pairs
3*3 + 8*8, 4 pairs
5*5 + 8*5, 4 pairs
7*7 + 6*6, 3 pairs
9*9 + 2*2, 1 pair
Which gives 16 pairs.
Now the problem is that we have counted things like 3*3 + 6*6 = 45 that are not relatively prime. But we can subtract off EvenOddPairCount(L/9) to find those that are both divisible by 3. We can likewise subtract off EvenOddPairCount(L/25) to find the ones that are divisible by 5. But now the ones divisible by 15 have been added once, and subtracted twice. So we have to add back in EvenOddPairCount(L/(15*15)) to again not count those.
Keep going and you get a sum is over all distinct products x of odd primes of EvenOddPairCount(L/(x*x)) where we add if we had an even number of prime factors (including 0), and subtract if we had an odd number. Since we already have the odd primes, we can easily produce the sequence of inclusion-exclusion terms. It starts off 1, -9, -25, -49, -121, -169, +225, -289, -361, +441 and so on.
But how much work does this take? If l = sqrt(L) it takes l + l/3 + l/5 + l/7 + l/11 + l/13 + l/15 + ... + 1/l < l*(1 + 1/2 + 1/3 + 1/4 + ... + 1/l) = O(l log(l)) = O(sqrt(L) log(L)).
And so we can calculate PrimitiveTriples(L) in time O(sqrt(L) log(L)).
What does this buy us? Well, PrimitiveTriples(N) - PrimitiveTriples(N/2) gives us the number of primitive Pythagorean triples whose hypotenuse is in the range from N/2 to N. And 2*(PrimitiveTriples(N/2) - PrimitiveTriples(N/3)) gives us the number of Pythagorean triples which are multiples of primitive ones in the range from N/3 to N/2. And 3*(PrimitiveTriples(N/3) - PrimitiveTriples(N/4) gives us the number of Pythagorean triples in the range from N/4 to N/3.
Therefore we can enumerate small primitive Pythagorean triples, and figure out how many Pythagorean triples are a multiple of those. And we can use inclusion-exclusion to count the multiples of large Pythagorean triples. Where is the cutoff between the two strategies?
Let's ignore the various log factors for the moment. If we cut off at N^c then we do roughly O(N^c) work on the small. And we need to calculate N^(1-c) thresholds of PrimitiveTriples, the last of which takes work roughly O(N^(c/2)). So let's try setting N^c = N^(1-c) * N^(c/2). That works out when c = (1-c) + c/2 which happens when c = 2/3.
And this is the final algorithm.
Find all primes up to sqrt(N).
Use those primes to enumerate all primitive triples up to O(n^(2/3)) and figure out how many Pythagorean triples they give.
Find all of the inclusion exclusion terms up to N (there are O(sqrt(N)) of them and can be produced directly from the factorization).
Calculate PythagoreanTriples(N/d) for all d up to N^(1/3). Use that to count the rest of the triples.
And all of this runs in time o(N^(2/3 + epsilon)) for any epsilon greater than 0.
You can avoid the GCD calculation altogether:
Make a list of all primes < sqrt(N)
Iterate through all subsets of those primes with total product < sqrt(N)
For each subset, the primes in the set will be "m primes", while the other primes will be "n primes"
Let "m_base" be the product of all m primes
For m, iterate through all combinations of at m primes with product < sqrt(N)/m_base, yielding m = m_base * that combination and:
Calculate the maximum value of n for each m
For n, iterate through all combinations of n primes with product < max
I am studying a text generation example https://github.com/eclipse/deeplearning4j-examples/blob/master/dl4j-examples/src/main/java/org/deeplearning4j/examples/advanced/modelling/charmodelling/generatetext/GenerateTxtCharCompGraphModel.java.
The output of lstm network is a probability distribution, as I understand it, this is an double array, where each value shows the probability of the character corresponding to the index in the array. So I cannot understand the following code where we get the character index from the distribution:
/** Given a probability distribution over discrete classes, sample from the distribution
* and return the generated class index.
* #param distribution Probability distribution over classes. Must sum to 1.0
*/
static int sampleFromDistribution(double[] distribution, Random rng){
double d = 0.0;
double sum = 0.0;
for( int t=0; t<10; t++ ) {
d = rng.nextDouble();
sum = 0.0;
for( int i=0; i<distribution.length; i++ ){
sum += distribution[i];
if( d <= sum ) return i;
}
//If we haven't found the right index yet, maybe the sum is slightly
//lower than 1 due to rounding error, so try again.
}
//Should be extremely unlikely to happen if distribution is a valid probability distribution
throw new IllegalArgumentException("Distribution is invalid? d="+d+", sum="+sum);
}
It seems that we are getting a random value. Why don't we just choose the index where the value is highest? What should I do if I want to select not one, but two or three most likely next characters?
This function samples from the distribution, instead of simply returning the most probable character class.
That also means that you aren't getting the most likely character, instead, you are getting a random character with the probability that the given probability distribution defines.
This works by first getting a random value between 0 and 1 from a uniform distribution (rng.nextDouble()) and then finding where that value falls in the given distribution.
You can imagine it to be something like this (if your had only a to f in your alphabet):
[ a | b | c | d | e | f ]
0.0 0.3 0.5 1.0
If the random value that is drawn is just over 0.5, it would produce an e, if it is just less than that it would be a d.
Each letter occupies a proportional amount of space on this line between 0 and 1 according to the weight it has in the distribution.
We can easily get random floating point numbers within a desired range [X,Y) (note that X is inclusive and Y is exclusive) with the function listed below since Math.random() (and most pseudorandom number generators, AFAIK) produce numbers in [0,1):
function randomInRange(min, max) {
return Math.random() * (max-min) + min;
}
// Notice that we can get "min" exactly but never "max".
How can we get a random number in a desired range inclusive to both bounds, i.e. [X,Y]?
I suppose we could "increment" our value from Math.random() (or equivalent) by "rolling" the bits of an IEE-754 floating point double precision to put the maximum possible value at 1.0 exactly but that seems like a pain to get right, especially in languages poorly suited for bit manipulation. Is there an easier way?
(As an aside, why do random number generators produce numbers in [0,1) instead of [0,1]?)
[Edit] Please note that I have no need for this and I am fully aware that the distinction is pedantic. Just being curious and hoping for some interesting answers. Feel free to vote to close if this question is inappropriate.
I believe there is much better decision but this one should work :)
function randomInRange(min, max) {
return Math.random() < 0.5 ? ((1-Math.random()) * (max-min) + min) : (Math.random() * (max-min) + min);
}
First off, there's a problem in your code: Try randomInRange(0,5e-324) or just enter Math.random()*5e-324 in your browser's JavaScript console.
Even without overflow/underflow/denorms, it's difficult to reason reliably about floating point ops. After a bit of digging, I can find a counterexample:
>>> a=1.0
>>> b=2**-54
>>> rand=a-2*b
>>> a
1.0
>>> b
5.551115123125783e-17
>>> rand
0.9999999999999999
>>> (a-b)*rand+b
1.0
It's easier to explain why this happens with a=253 and b=0.5: 253-1 is the next representable number down. The default rounding mode ("round to nearest even") rounds 253-0.5 up (because 253 is "even" [LSB = 0] and 253-1 is "odd" [LSB = 1]), so you subtract b and get 253, multiply to get 253-1, and add b to get 253 again.
To answer your second question: Because the underlying PRNG almost always generates a random number in the interval [0,2n-1], i.e. it generates random bits. It's very easy to pick a suitable n (the bits of precision in your floating point representation) and divide by 2n and get a predictable distribution. Note that there are some numbers in [0,1) that you will will never generate using this method (anything in (0,2-53) with IEEE doubles).
It also means that you can do a[Math.floor(Math.random()*a.length)] and not worry about overflow (homework: In IEEE binary floating point, prove that b < 1 implies a*b < a for positive integer a).
The other nice thing is that you can think of each random output x as representing an interval [x,x+2-53) (the not-so-nice thing is that the average value returned is slightly less than 0.5). If you return in [0,1], do you return the endpoints with the same probability as everything else, or should they only have half the probability because they only represent half the interval as everything else?
To answer the simpler question of returning a number in [0,1], the method below effectively generates an integer [0,2n] (by generating an integer in [0,2n+1-1] and throwing it away if it's too big) and dividing by 2n:
function randominclusive() {
// Generate a random "top bit". Is it set?
while (Math.random() >= 0.5) {
// Generate the rest of the random bits. Are they zero?
// If so, then we've generated 2^n, and dividing by 2^n gives us 1.
if (Math.random() == 0) { return 1.0; }
// If not, generate a new random number.
}
// If the top bits are not set, just divide by 2^n.
return Math.random();
}
The comments imply base 2, but I think the assumptions are thus:
0 and 1 should be returned equiprobably (i.e. the Math.random() doesn't make use of the closer spacing of floating point numbers near 0).
Math.random() >= 0.5 with probability 1/2 (should be true for even bases)
The underlying PRNG is good enough that we can do this.
Note that random numbers are always generated in pairs: the one in the while (a) is always followed by either the one in the if or the one at the end (b). It's fairly easy to verify that it's sensible by considering a PRNG that returns either 0 or 0.5:
a=0 b=0 : return 0
a=0 b=0.5: return 0.5
a=0.5 b=0 : return 1
a=0.5 b=0.5: loop
Problems:
The assumptions might not be true. In particular, a common PRNG is to take the top 32 bits of a 48-bit LCG (Firefox and Java do this). To generate a double, you take 53 bits from two consecutive outputs and divide by 253, but some outputs are impossible (you can't generate 253 outputs with 48 bits of state!). I suspect some of them never return 0 (assuming single-threaded access), but I don't feel like checking Java's implementation right now.
Math.random() is twice for every potential output as a consequence of needing to get the extra bit, but this places more constraints on the PRNG (requiring us to reason about four consecutive outputs of the above LCG).
Math.random() is called on average about four times per output. A bit slow.
It throws away results deterministically (assuming single-threaded access), so is pretty much guaranteed to reduce the output space.
My solution to this problem has always been to use the following in place of your upper bound.
Math.nextAfter(upperBound,upperBound+1)
or
upperBound + Double.MIN_VALUE
So your code would look like this:
double myRandomNum = Math.random() * Math.nextAfter(upperBound,upperBound+1) + lowerBound;
or
double myRandomNum = Math.random() * (upperBound + Double.MIN_VALUE) + lowerBound;
This simply increments your upper bound by the smallest double (Double.MIN_VALUE) so that your upper bound will be included as a possibility in the random calculation.
This is a good way to go about it because it does not skew the probabilities in favor of any one number.
The only case this wouldn't work is where your upper bound is equal to Double.MAX_VALUE
Just pick your half-open interval slightly bigger, so that your chosen closed interval is a subset. Then, keep generating the random variable until it lands in said closed interval.
Example: If you want something uniform in [3,8], then repeatedly regenerate a uniform random variable in [3,9) until it happens to land in [3,8].
function randomInRangeInclusive(min,max) {
var ret;
for (;;) {
ret = min + ( Math.random() * (max-min) * 1.1 );
if ( ret <= max ) { break; }
}
return ret;
}
Note: The amount of times you generate the half-open R.V. is random and potentially infinite, but you can make the expected number of calls otherwise as close to 1 as you like, and I don't think there exists a solution that doesn't potentially call infinitely many times.
Given the "extremely large" number of values between 0 and 1, does it really matter? The chances of actually hitting 1 are tiny, so it's very unlikely to make a significant difference to anything you're doing.
What would be a situation where you would NEED a floating point value to be inclusive of the upper bound? For integers I understand, but for a float, the difference between between inclusive and exclusive is what like 1.0e-32.
Think of it this way. If you imagine that floating-point numbers have arbitrary precision, the chances of getting exactly min are zero. So are the chances of getting max. I'll let you draw your own conclusion on that.
This 'problem' is equivalent to getting a random point on the real line between 0 and 1. There is no 'inclusive' and 'exclusive'.
The question is akin to asking, what is the floating point number right before 1.0? There is such a floating point number, but it is one in 2^24 (for an IEEE float) or one in 2^53 (for a double).
The difference is negligible in practice.
private static double random(double min, double max) {
final double r = Math.random();
return (r >= 0.5d ? 1.5d - r : r) * (max - min) + min;
}
Math.round() will help to include the bound value. If you have 0 <= value < 1 (1 is exclusive), then Math.round(value * 100) / 100 returns 0 <= value <= 1 (1 is inclusive). A note here is that the value now has only 2 digits in its decimal place. If you want 3 digits, try Math.round(value * 1000) / 1000 and so on. The following function has one more parameter, that is the number of digits in decimal place - I called as precision:
function randomInRange(min, max, precision) {
return Math.round(Math.random() * Math.pow(10, precision)) /
Math.pow(10, precision) * (max - min) + min;
}
How about this?
function randomInRange(min, max){
var n = Math.random() * (max - min + 0.1) + min;
return n > max ? randomInRange(min, max) : n;
}
If you get stack overflow on this I'll buy you a present.
--
EDIT: never mind about the present. I got wild with:
randomInRange(0, 0.0000000000000000001)
and got stack overflow.
I am fairly less experienced, So I am also looking for solutions as well.
This is my rough thought:
Random number generators produce numbers in [0,1) instead of [0,1],
Because [0,1) is an unit length that can be followed by [1,2) and so on without overlapping.
For random[x, y],
You can do this:
float randomInclusive(x, y){
float MIN = smallest_value_above_zero;
float result;
do{
result = random(x, (y + MIN));
} while(result > y);
return result;
}
Where all values in [x, y] has the same possibility to be picked, and you can reach y now.
Generating a "uniform" floating-point number in a range is non-trivial. For example, the common practice of multiplying or dividing a random integer by a constant, or by scaling a "uniform" floating-point number to the desired range, have the disadvantage that not all numbers a floating-point format can represent in the range can be covered this way, and may have subtle bias problems. These problems are discussed in detail in "Generating Random Floating-Point Numbers by Dividing Integers: a Case Study" by F. Goualard.
Just to show how non-trivial the problem is, the following pseudocode generates a random "uniform-behaving" floating-point number in the closed interval [lo, hi], where the number is of the form FPSign * FPSignificand * FPRADIX^FPExponent. The pseudocode below was reproduced from my section on floating-point number generation. Note that it works for any precision and any base (including binary and decimal) of floating-point numbers.
METHOD RNDRANGE(lo, hi)
losgn = FPSign(lo)
hisgn = FPSign(hi)
loexp = FPExponent(lo)
hiexp = FPExponent(hi)
losig = FPSignificand(lo)
hisig = FPSignificand(hi)
if lo > hi: return error
if losgn == 1 and hisgn == -1: return error
if losgn == -1 and hisgn == 1
// Straddles negative and positive ranges
// NOTE: Changes negative zero to positive
mabs = max(abs(lo),abs(hi))
while true
ret=RNDRANGE(0, mabs)
neg=RNDINT(1)
if neg==0: ret=-ret
if ret>=lo and ret<=hi: return ret
end
end
if lo == hi: return lo
if losgn == -1
// Negative range
return -RNDRANGE(abs(lo), abs(hi))
end
// Positive range
expdiff=hiexp-loexp
if loexp==hiexp
// Exponents are the same
// NOTE: Automatically handles
// subnormals
s=RNDINTRANGE(losig, hisig)
return s*1.0*pow(FPRADIX, loexp)
end
while true
ex=hiexp
while ex>MINEXP
v=RNDINTEXC(FPRADIX)
if v==0: ex=ex-1
else: break
end
s=0
if ex==MINEXP
// Has FPPRECISION or fewer digits
// and so can be normal or subnormal
s=RNDINTEXC(pow(FPRADIX,FPPRECISION))
else if FPRADIX != 2
// Has FPPRECISION digits
s=RNDINTEXCRANGE(
pow(FPRADIX,FPPRECISION-1),
pow(FPRADIX,FPPRECISION))
else
// Has FPPRECISION digits (bits), the highest
// of which is always 1 because it's the
// only nonzero bit
sm=pow(FPRADIX,FPPRECISION-1)
s=RNDINTEXC(sm)+sm
end
ret=s*1.0*pow(FPRADIX, ex)
if ret>=lo and ret<=hi: return ret
end
END METHOD
I'm working on an image editor and I'm about to implement filters. I decided to go with some kind of blur or noise.
Anyway, I decided I wanted a uniform filter so I read up on Random.nextGaussian().
However, since I'm working with RGB values that range from 0 to 255. How can I scale this random double value to fit within 0 and 255?
The random value returned by the nextGaussian() can range from -1 to 1 I believe.
So, I want to preserve the relative difference between the random values. "Scale" or "move the number range" if that makes sense.
I know it's possible but I can't figure it out. Thanks in advance.
Essentially, I need it to be a number between 0 and 1.
In that case you should use nextDouble().
The Gaussian distribution is a distribution that ranges over the entire collection of double values (mathematically speaking, from minus infinity to positive infinity) with a peak around zero. The Gaussian distribution is thus not uniform.
The nextDouble() method draws numbers uniformly between 0 and 1 (0 included, 1 excluded).
You can thus draw a number randomly between 0 and 255 (both inclusive) using the following code:
Random rand = new Random();
int value = (int) math.floor(256.0*rand.nextDouble());
A faster algorithm is however masking a random integer (Random.nextInt):
Random rand = new Random();
int value = rand.nextInt()&0xff;
This algorithm isn't faster in big-oh analysis, but it saves one the more expensive nextDouble method call as well as a floating point multiplication and a float-to-int conversion.
You can use nextGaussian() with Math.abs() so that you can obtain positive values of Gaussian distribution.
Random random = new Random();
double positiveRandomValue = Math.abs(random.nextGaussian());
You can fit Normal (Gaussian) distribution between ~[0,1] with adjusting mean and std. For example, use mean = 0.5, std = 0.15, and you will get value between [0,1] with total probability of 99.91%. In the end, you can ensure that value is strictly between [0,1].
Supplier<Double> alphaSupplier = new Supplier<Double>() {
#Override
public Double get() {
double value = new Random().nextGaussian() * 0.15 + 0.5;
return Math.max(0, Math.min(1, value));
}
};
double random value = alphaSupplier.get();
I have the following piece of code:
public class Main {
private static final Random rnd = new Random();
private static int getRand(int n) {
return (Math.abs(rnd.nextInt())%n);
}
public static void main(String[] args) {
int count=0, n = 2 * (Integer.MAX_VALUE/3);
for(int i=0; i<1000000; i++) {
if(getRand(n) < n/2) {
count++;
}
}
System.out.print(count);
}
}
This always gives me a number close to 666,666. Meaning two-thirds of the numbers generated are below the lower half of n. Not that this is obtained when n = 2/3 * Integer.MAX_VALUE. 4/7 is another fraction that gives me a similar spread (~5714285). However, I get an even spread if n = Integer.MAX_VALUE or if n = Integer.MAX_VALUE/2. How does this behavior differ with the fraction used. Can somebody throw some light on it.
PS: I got this problem from the book Effective Java by Joshua Bloch.
The problem is in the modulo (%) operator which results in an uneven distribution of numbers.
For example, imagine MAX_INT is 10, and n = 7, the mod operator will map the values 8, 9 and 10 to 1, 2 and 3, respectively. This will result that the numbers 1, 2 and 3 will have double the probability of all other numbers.
One way to solve this is by checking the output of rnd.nextInt() and try again while it's bigger than N.
You would get 50-50 if you kept only values of Math.abs(rnd.nextInt()) in the range of [0..2/3(Integer.MAX_VALUE)]. For the rest 1/3*Integer.MAX_VALUE numbers, due to modulo you will get a smaller number in the range of [0..1/3 Integer.MAX_VALUE].
All in all, numbers in the range of [0..1/3 Integer.MAX_VALUE] have double the chance to appear.
The Random class is designed to generate pseudo-random numbers. That means they are elements of a defined sequence that have an uniform distribution. If you don't know the sequence, they seem to be random.
Having said that, the problem is that you mess up the uniform distribution you get by using the modulus operator. On coding horror, there is a very nice article that explains this issue, although for a slightly different problem. Now, you can find a solution to your problem along with a proof here.
As observed above, getRand does not generate uniformly distributed random numbers over the range [0, n].
In general, suppose that n = a * Integer.MAX_VALUE / b, where a/b > 0.5
For ease of writing, let M = Integer.MAX_VALUE
The Probability Density Function (PDF) of getRand(n) is given by:
PDF(x) = 2/M for 0 < x < (b-a)M/b
= 1/M for (b-a)M/b < x < aM/b
n/2 corresponds to the mid-point of the range [0, aM/b] = aM/2b
Integrating the PDF over the 'first-half' range [0, n/2] we find that the probability (P) that getRand(n) is less than n/2 is given by:
P = a/b
Examples:
a=2, b=3. P = 2/3 = 2/3 = 0.66666... as computed by the questioner.
a=4, b=7. P = 4/7 = 0.5714... close to the questioner's computational result.