Understanding Java Random next Integer with bound algorithm - java

I'm looking at the way the Java Random library generates an integer given an upper bound, but I don't quite understand the algorithm. In the docs it says:
The algorithm is slightly tricky. It rejects values that would result
in an uneven distribution (due to the fact that 2^31 is not divisible
by n). The probability of a value being rejected depends on n. The
worst case is n=2^30+1, for which the probability of a reject is 1/2,
and the expected number of iterations before the loop terminates is 2.
But I really don't see how this implementation takes this into account, specifically the while condition in the code. To me it seems that this would (almost) always succeed with 50% success rate. Especially when looking at very low values for bound (which I think is used a lot when imposing a bound). It seems to me like the condition in the while is just checking the sign of bits, so why bother with the line they use?
public int nextInt(int bound) {
if (bound <= 0)
throw new IllegalArgumentException("bound must be positive");
if ((bound & -bound) == bound) // i.e., bound is a power of 2
return (int)((bound * (long)next(31)) >> 31);
int bits, val;
do {
bits = next(31);
val = bits % bound;
} while (bits - val + (bound-1) < 0);
return val;
}

Note that bits - val + (bound-1) < 0 is actually checking whether bits - val + (bound-1) overflows. bits is always equal to or greater than val, and bound is always positive, so there is no way for the LHS to be positive under normal circumstances.
We can think of the < 0 as > Integer.MAX_VALUE.
Let's plot a graph of bits - val + (bound - 1). I have made one on desmos here. Let's say bound is 100 (small bound):
The x axis is bits and y axis is bits - val + (bound-1), and I have added lines on both the x and y axes to indicate Integer.MAX_VALUE. Note that bits is bounded by Integer.MAX_VALUE.
At this scale, you can see that bits - val + (bound-1) seems to never overflow. If you zoom a lot, you'll see:
Note that there is a tiny range of values of bits for which bits < Integer.MAX_VALUE, but bits - val + (bound - 1) > Integer.MAX_VALUE.
For b = (1 << 30) + 1, the graph looks like:
Any b that is greater than 1 << 30 overflows. Hence the 1/2 chance of rejecting the bounds as the documentation said.

Related

Generating random doubles in Java between 0 and 1 inclusively or [0..1] [duplicate]

We can easily get random floating point numbers within a desired range [X,Y) (note that X is inclusive and Y is exclusive) with the function listed below since Math.random() (and most pseudorandom number generators, AFAIK) produce numbers in [0,1):
function randomInRange(min, max) {
return Math.random() * (max-min) + min;
}
// Notice that we can get "min" exactly but never "max".
How can we get a random number in a desired range inclusive to both bounds, i.e. [X,Y]?
I suppose we could "increment" our value from Math.random() (or equivalent) by "rolling" the bits of an IEE-754 floating point double precision to put the maximum possible value at 1.0 exactly but that seems like a pain to get right, especially in languages poorly suited for bit manipulation. Is there an easier way?
(As an aside, why do random number generators produce numbers in [0,1) instead of [0,1]?)
[Edit] Please note that I have no need for this and I am fully aware that the distinction is pedantic. Just being curious and hoping for some interesting answers. Feel free to vote to close if this question is inappropriate.
I believe there is much better decision but this one should work :)
function randomInRange(min, max) {
return Math.random() < 0.5 ? ((1-Math.random()) * (max-min) + min) : (Math.random() * (max-min) + min);
}
First off, there's a problem in your code: Try randomInRange(0,5e-324) or just enter Math.random()*5e-324 in your browser's JavaScript console.
Even without overflow/underflow/denorms, it's difficult to reason reliably about floating point ops. After a bit of digging, I can find a counterexample:
>>> a=1.0
>>> b=2**-54
>>> rand=a-2*b
>>> a
1.0
>>> b
5.551115123125783e-17
>>> rand
0.9999999999999999
>>> (a-b)*rand+b
1.0
It's easier to explain why this happens with a=253 and b=0.5: 253-1 is the next representable number down. The default rounding mode ("round to nearest even") rounds 253-0.5 up (because 253 is "even" [LSB = 0] and 253-1 is "odd" [LSB = 1]), so you subtract b and get 253, multiply to get 253-1, and add b to get 253 again.
To answer your second question: Because the underlying PRNG almost always generates a random number in the interval [0,2n-1], i.e. it generates random bits. It's very easy to pick a suitable n (the bits of precision in your floating point representation) and divide by 2n and get a predictable distribution. Note that there are some numbers in [0,1) that you will will never generate using this method (anything in (0,2-53) with IEEE doubles).
It also means that you can do a[Math.floor(Math.random()*a.length)] and not worry about overflow (homework: In IEEE binary floating point, prove that b < 1 implies a*b < a for positive integer a).
The other nice thing is that you can think of each random output x as representing an interval [x,x+2-53) (the not-so-nice thing is that the average value returned is slightly less than 0.5). If you return in [0,1], do you return the endpoints with the same probability as everything else, or should they only have half the probability because they only represent half the interval as everything else?
To answer the simpler question of returning a number in [0,1], the method below effectively generates an integer [0,2n] (by generating an integer in [0,2n+1-1] and throwing it away if it's too big) and dividing by 2n:
function randominclusive() {
// Generate a random "top bit". Is it set?
while (Math.random() >= 0.5) {
// Generate the rest of the random bits. Are they zero?
// If so, then we've generated 2^n, and dividing by 2^n gives us 1.
if (Math.random() == 0) { return 1.0; }
// If not, generate a new random number.
}
// If the top bits are not set, just divide by 2^n.
return Math.random();
}
The comments imply base 2, but I think the assumptions are thus:
0 and 1 should be returned equiprobably (i.e. the Math.random() doesn't make use of the closer spacing of floating point numbers near 0).
Math.random() >= 0.5 with probability 1/2 (should be true for even bases)
The underlying PRNG is good enough that we can do this.
Note that random numbers are always generated in pairs: the one in the while (a) is always followed by either the one in the if or the one at the end (b). It's fairly easy to verify that it's sensible by considering a PRNG that returns either 0 or 0.5:
a=0   b=0  : return 0
a=0   b=0.5: return 0.5
a=0.5 b=0  : return 1
a=0.5 b=0.5: loop
Problems:
The assumptions might not be true. In particular, a common PRNG is to take the top 32 bits of a 48-bit LCG (Firefox and Java do this). To generate a double, you take 53 bits from two consecutive outputs and divide by 253, but some outputs are impossible (you can't generate 253 outputs with 48 bits of state!). I suspect some of them never return 0 (assuming single-threaded access), but I don't feel like checking Java's implementation right now.
Math.random() is twice for every potential output as a consequence of needing to get the extra bit, but this places more constraints on the PRNG (requiring us to reason about four consecutive outputs of the above LCG).
Math.random() is called on average about four times per output. A bit slow.
It throws away results deterministically (assuming single-threaded access), so is pretty much guaranteed to reduce the output space.
My solution to this problem has always been to use the following in place of your upper bound.
Math.nextAfter(upperBound,upperBound+1)
or
upperBound + Double.MIN_VALUE
So your code would look like this:
double myRandomNum = Math.random() * Math.nextAfter(upperBound,upperBound+1) + lowerBound;
or
double myRandomNum = Math.random() * (upperBound + Double.MIN_VALUE) + lowerBound;
This simply increments your upper bound by the smallest double (Double.MIN_VALUE) so that your upper bound will be included as a possibility in the random calculation.
This is a good way to go about it because it does not skew the probabilities in favor of any one number.
The only case this wouldn't work is where your upper bound is equal to Double.MAX_VALUE
Just pick your half-open interval slightly bigger, so that your chosen closed interval is a subset. Then, keep generating the random variable until it lands in said closed interval.
Example: If you want something uniform in [3,8], then repeatedly regenerate a uniform random variable in [3,9) until it happens to land in [3,8].
function randomInRangeInclusive(min,max) {
var ret;
for (;;) {
ret = min + ( Math.random() * (max-min) * 1.1 );
if ( ret <= max ) { break; }
}
return ret;
}
Note: The amount of times you generate the half-open R.V. is random and potentially infinite, but you can make the expected number of calls otherwise as close to 1 as you like, and I don't think there exists a solution that doesn't potentially call infinitely many times.
Given the "extremely large" number of values between 0 and 1, does it really matter? The chances of actually hitting 1 are tiny, so it's very unlikely to make a significant difference to anything you're doing.
What would be a situation where you would NEED a floating point value to be inclusive of the upper bound? For integers I understand, but for a float, the difference between between inclusive and exclusive is what like 1.0e-32.
Think of it this way. If you imagine that floating-point numbers have arbitrary precision, the chances of getting exactly min are zero. So are the chances of getting max. I'll let you draw your own conclusion on that.
This 'problem' is equivalent to getting a random point on the real line between 0 and 1. There is no 'inclusive' and 'exclusive'.
The question is akin to asking, what is the floating point number right before 1.0? There is such a floating point number, but it is one in 2^24 (for an IEEE float) or one in 2^53 (for a double).
The difference is negligible in practice.
private static double random(double min, double max) {
final double r = Math.random();
return (r >= 0.5d ? 1.5d - r : r) * (max - min) + min;
}
Math.round() will help to include the bound value. If you have 0 <= value < 1 (1 is exclusive), then Math.round(value * 100) / 100 returns 0 <= value <= 1 (1 is inclusive). A note here is that the value now has only 2 digits in its decimal place. If you want 3 digits, try Math.round(value * 1000) / 1000 and so on. The following function has one more parameter, that is the number of digits in decimal place - I called as precision:
function randomInRange(min, max, precision) {
return Math.round(Math.random() * Math.pow(10, precision)) /
Math.pow(10, precision) * (max - min) + min;
}
How about this?
function randomInRange(min, max){
var n = Math.random() * (max - min + 0.1) + min;
return n > max ? randomInRange(min, max) : n;
}
If you get stack overflow on this I'll buy you a present.
--
EDIT: never mind about the present. I got wild with:
randomInRange(0, 0.0000000000000000001)
and got stack overflow.
I am fairly less experienced, So I am also looking for solutions as well.
This is my rough thought:
Random number generators produce numbers in [0,1) instead of [0,1],
Because [0,1) is an unit length that can be followed by [1,2) and so on without overlapping.
For random[x, y],
You can do this:
float randomInclusive(x, y){
float MIN = smallest_value_above_zero;
float result;
do{
result = random(x, (y + MIN));
} while(result > y);
return result;
}
Where all values in [x, y] has the same possibility to be picked, and you can reach y now.
Generating a "uniform" floating-point number in a range is non-trivial. For example, the common practice of multiplying or dividing a random integer by a constant, or by scaling a "uniform" floating-point number to the desired range, have the disadvantage that not all numbers a floating-point format can represent in the range can be covered this way, and may have subtle bias problems. These problems are discussed in detail in "Generating Random Floating-Point Numbers by Dividing Integers: a Case Study" by F. Goualard.
Just to show how non-trivial the problem is, the following pseudocode generates a random "uniform-behaving" floating-point number in the closed interval [lo, hi], where the number is of the form FPSign * FPSignificand * FPRADIX^FPExponent. The pseudocode below was reproduced from my section on floating-point number generation. Note that it works for any precision and any base (including binary and decimal) of floating-point numbers.
METHOD RNDRANGE(lo, hi)
losgn = FPSign(lo)
hisgn = FPSign(hi)
loexp = FPExponent(lo)
hiexp = FPExponent(hi)
losig = FPSignificand(lo)
hisig = FPSignificand(hi)
if lo > hi: return error
if losgn == 1 and hisgn == -1: return error
if losgn == -1 and hisgn == 1
// Straddles negative and positive ranges
// NOTE: Changes negative zero to positive
mabs = max(abs(lo),abs(hi))
while true
ret=RNDRANGE(0, mabs)
neg=RNDINT(1)
if neg==0: ret=-ret
if ret>=lo and ret<=hi: return ret
end
end
if lo == hi: return lo
if losgn == -1
// Negative range
return -RNDRANGE(abs(lo), abs(hi))
end
// Positive range
expdiff=hiexp-loexp
if loexp==hiexp
// Exponents are the same
// NOTE: Automatically handles
// subnormals
s=RNDINTRANGE(losig, hisig)
return s*1.0*pow(FPRADIX, loexp)
end
while true
ex=hiexp
while ex>MINEXP
v=RNDINTEXC(FPRADIX)
if v==0: ex=ex-1
else: break
end
s=0
if ex==MINEXP
// Has FPPRECISION or fewer digits
// and so can be normal or subnormal
s=RNDINTEXC(pow(FPRADIX,FPPRECISION))
else if FPRADIX != 2
// Has FPPRECISION digits
s=RNDINTEXCRANGE(
pow(FPRADIX,FPPRECISION-1),
pow(FPRADIX,FPPRECISION))
else
// Has FPPRECISION digits (bits), the highest
// of which is always 1 because it's the
// only nonzero bit
sm=pow(FPRADIX,FPPRECISION-1)
s=RNDINTEXC(sm)+sm
end
ret=s*1.0*pow(FPRADIX, ex)
if ret>=lo and ret<=hi: return ret
end
END METHOD

Using bitwise operator to divide by 0 (Simulation of division by 0) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
We know that we can use bitwise operators to divide any two numbers. For example:
int result = 10 >> 1; //reult would be 5 as it's like dividing 10 by 2^1
Is there any chance we can divide a number by 0 using bits manipulation?
Edit 1: If I rephrase my question, I want to actually divide a number by zero and break my machine. How do I do that?
Edit 2: Let's forget about Java for a moment. Is it feasible for a machine to divide a number by 0 regardless of the programming language used?
Edit 3: As it's practically impossible to do this, is there a way we can simulate this using a really small number that approaches 0?
Another edit: Some people mentioned that CPU hardware prevents division by 0. I agree, there won't be a direct way to do it. Let's see this code for example:
i = 1;
while(0 * i != 10){
i++;
}
Let's assume that there is no cap on the maximum value of i. In this case there would be no compiler error nor the CPU would resist this. Now, I want my machine to find the number that's when multiplied with 0 gives me a result (which is obviously never going to happen) or die trying.
So, as there is a way to do this. How can I achieve this by directly manipulating bits?
Final Edit: How to perform binary division in Java without using bitwise operators? (I'm sorry, it purely contradicts the title).
Note: I've tried simulating divison by 0 and posted my answer. However, I'm looking for a faster way of doing it.
If what you want is a division method faster than division by repeated subtraction (which you posted), and that will run indefinitely when you try to divide by zero, you can implement your own version of the Goldschmidt division, and not throw an error when the divisor is zero.
The algorithm works like this:
1. Create an estimate for the factor f
2. Multiply both the dividend and the divisor by f
3. If the divisor is close enough to 1
Return the dividend
4. Else
Go back to step 1
Normally, we would need to scale down the dividend and the divisor before starting, so that 0 < divisor < 1 is satisfied. In this case, since we are going to divide by zero, there's no need for this step. We also need to choose an arbitrary precision beyond which we consider the result good enough.
The code, with no check for divisor == 0, would be like this:
static double goldschmidt(double dividend, double divisor) {
double epsilon = 0.0000001;
while (Math.abs(1.0 - divisor) > epsilon) {
double f = 2.0 - divisor;
dividend *= f;
divisor *= f;
}
return dividend;
}
This is much faster than the division by repeated subtraction method, since it converges to the result quadratically instead of linearly. When dividing by zero, it would not really matter, since both methods won't converge. But if you try to divide by a small number, such as 10^(-9), you can clearly see the difference.
If you don't want the code to run indefinitely, but to return Infinity when dividing by zero, you can modify it to stop when dividend reaches Infinity:
static double goldschmidt(double dividend, double divisor) {
double epsilon = 0.0000001;
while (Math.abs(1.0 - divisor) > 0.0000001 && !Double.isInfinite(dividend)) {
double f = 2.0 - divisor;
dividend *= f;
divisor *= f;
}
return dividend;
}
If the starting values for dividend and divisor are such that dividend >= 1.0 and divisor == 0.0, you will get Infinity as a result after, at most, 2^10 iterations. That's because the worst case is when dividend == 1 and you need to multiply it by two (f = 2.0 - 0.0) 1024 times to get to 2^1024, which is greater than Double.MAX_VALUE.
The Goldschmidt division was implemented in AMD Athlon CPUs. If you want to read about some lower level details, you can check this article:
Floating Point Division and Square Root Algorithms and Implementation
in the AMD-K7
TM
Microprocessor.
Edit:
Addressing your comments:
Note that the code for the Restoring Division method you posted iterates 2048 (2^11) times. I lowered the value of n in your code to 1024, so we could compare both methods doing the same number of iterations.
I ran both implementations 100000 times with dividend == 1, which is the worst case for Goldschmidt, and measured the running time like this:
long begin = System.nanoTime();
for (int i = 0; i < 100000; i++) {
goldschmidt(1.0, 0.0); // changed this for restoringDivision(1) to test your code
}
long end = System.nanoTime();
System.out.println(TimeUnit.NANOSECONDS.toMillis(end - begin) + "ms");
The running time was ~290ms for Goldschmidt division and ~23000ms (23 seconds) for your code. So this implementation was about 80x faster in this test. This is expected, since in one case we are doing double multiplications and in the other we are working with BigInteger.
The advantage of your implementation is that, since you are using BigInteger, you can make your result as large as BigInteger supports, while the result here is limited by Double.MAX_VALUE.
In practice, when dividing by zero, the Goldschmidt division is doubling the dividend, which is equivalent to a shift left, at each iteration, until it reaches the maximum possible value. So the equivalent using BigInteger would be:
static BigInteger divideByZero(int dividend) {
return BigInteger.valueOf(dividend)
.shiftLeft(Integer.MAX_VALUE - 1 - ceilLog2(dividend));
}
static int ceilLog2(int n) {
return (int) Math.ceil(Math.log(n) / Math.log(2));
}
The function ceilLog2() is necessary, so that the shiftLeft() will not cause an overflow. Depending on how much memory you have allocated, this will probably result in a java.lang.OutOfMemoryError: Java heap space exception. So there is a compromise to be made here:
You can get the division simulation to run really fast, but with a result upper bound of Double.MAX_VALUE,
or
You can get the result to be as big as 2^(Integer.MAX_VALUE - 1), but it would probably take too much memory and time to get to that limit.
Edit 2:
Addressing your new comments:
Please note that no division is happening in your updated code. It's just trying to find the biggest possible BigInteger
First, let us show that the Goldschmidt division degenerates into a shift left when divisor == 0:
static double goldschmidt(double dividend, double divisor) {
double epsilon = 0.0000001;
while (Math.abs(1.0 - 0.0) > 0.0000001 && !Double.isInfinite(dividend)) {
double f = 2.0 - 0.0;
dividend *= f;
divisor = 0.0 * f;
}
return dividend;
}
The factor f will always be equal to 2.0 and the first while condition will always be true. So if we eliminate the redundancies:
static double goldschmidt(double dividend, 0) {
while (!Double.isInfinite(dividend)) {
dividend *= 2.0;
}
return dividend;
}
Assuming dividend is an Integer, we can do the same multiplication using a shift left:
static int goldschmidt(int dividend) {
while (...) {
dividend = dividend << 1;
}
return dividend;
}
If the maximum value we can reach is 2^n, we need to loop n times. When dividend == 1, this is equivalent to:
static int goldschmidt(int dividend) {
return 1 << n;
}
When the dividend > 1, we need to subtract ceil(log2(dividend)) to prevent an overflow:
static int goldschmidt(int dividend) {
return dividend << (n - ceil(log2(dividend));
}
Thus showing that the Goldschmidt division is equivalent to a shift left if divisor == 0.
However, shifting the bits to the left would pad bits on the right with 0. Try running this with a small dividend and left shift it (once or twice to check the results). This thing will never get to 2^(Integer.MAX_VALUE - 1).
Now that we've seen that a shift left by n is equivalent to a multiplication by 2^n, let's see how the BigInteger version works. Consider the following examples that show we will get to 2^(Integer.MAX_VALUE - 1) if there is enough memory available and the dividend is a power of 2:
For dividend = 1
BigInteger.valueOf(dividend).shiftLeft(Integer.MAX_VALUE - 1 - ceilLog2(dividend))
= BigInteger.valueOf(1).shiftLeft(Integer.MAX_VALUE - 1 - 0)
= 1 * 2^(Integer.MAX_VALUE - 1)
= 2^(Integer.MAX_VALUE - 1)
For dividend = 1024
BigInteger.valueOf(dividend).shiftLeft(Integer.MAX_VALUE - 1 - ceilLog2(dividend))
= BigInteger.valueOf(1024).shiftLeft(Integer.MAX_VALUE - 1 - 10)
= 1024 * 2^(Integer.MAX_VALUE - 1)
= 2^10 * 2^(Integer.MAX_VALUE - 1 - 10)
= 2^(Integer.MAX_VALUE - 1)
If dividend is not a power of 2, we will get as close as we can to 2^(Integer.MAX_VALUE - 1) by repeatedly doubling the dividend.
Your requirement is impossible.
The division by 0 is mathematically impossible. The concept just don't exist, so there is no way to simulate it.
If you were actually trying to do limits operation (divide by 0+ or 0-) then there is still no way to do it using bitwise as it will only allow you to divide by power of two.
Here an exemple using bitwise operation only to divide by power of 2
10 >> 1 = 5
Looking at the comments you posted, if what you want is simply to exit your program when an user try to divide by 0 you can simply validate it :
if(dividant == 0)
System.exit(/*Enter here the exit code*/);
That way you will avoid the ArithmeticException.
After exchanging a couple of comments with you, it seems like what you are trying to do is crash the operating system dividing by 0.
Unfortunately for you, as far as I know, any language that can be written on a computer are validated enought to handle the division by 0.
Just think to a simple calculator that you pay 1$, try to divide by 0 and it won't even crash, it will simply throw an error msg. This is probably validated at the processor level anyway.
Edit
After multiple edits/comments to your question, it seems like you are trying to retrieve the Infinity dividing by a 0+ or 0- that is very clause to 0.
You can achieve this with double/float division.
System.out.println(1.0f / 0.0f);//prints infinity
System.out.println(1.0f / -0.0f);//prints -Infinity
System.out.println(1.0d / 0.0d);//prints infinity
System.out.println(1.0d / -0.0d);//prints -Infinity
Note that even if you write 0.0, the value is not really equals to 0, it is simply really close to it.
No, there isn't, since you can only divide by a power of 2 using right shift.
One way to simulate division of unsigned integers (irrespective of divisor used) is by division by repeated subtraction:
BigInteger result = new BigInteger("0");
int divisor = 0;
int dividend = 2;
while(dividend >= divisor){
dividend = dividend - divisor;
result = result.add(BigInteger.ONE);
}
Second way to do this is by using Restoring Division algorithm (Thanks #harold) which is way faster than the first one:
int num = 10;
BigInteger den = new BigInteger("0");
BigInteger p = new BigInteger(new Integer(num).toString());
int n = 2048; //Can be changed to find the biggest possible number (i.e. upto 2^2147483647 - 1). Currently it shows 2^2048 - 1 as output
den = den.shiftLeft(n);
BigInteger q = new BigInteger("0");
for(int i = n; i > 0; i -= 1){
q = q.shiftLeft(1);
p = p.multiply(new BigInteger("2"));
p = p.subtract(den);
if(p.compareTo(new BigInteger("0")) == 1
|| p.compareTo(new BigInteger("0")) == 0){
q = q.add(new BigInteger("1"));
} else {
p = p.add(den);
}
}
System.out.println(q);
As others have indicated, you cannot mathematically divide by 0.
However if you want methods to divide by 0, there are some constants in Double you could use. For example you could have a method
public static double divide(double a, double b){
return b == 0 ? Double.NaN : a/b;
}
or
public static double posLimitDivide(double a, double b){
if(a == 0 && b == 0)
return Double.NaN;
return b == 0 ? (a > 0 ? Double.POSITIVE_INFINITY : Double.NEGATIVE_INFINITY) : a/b;
Which would return the limit of a/x where x approaches +b.
These should be ok, as long as you account for it in whatever methods use them. And by OK I mean bad, and could cause indeterminate behavior later if you're not careful. But it is a clear way to indicate the result with an actual value rather than an exception.

Confusing implementation of java.util.Random nextInt(int n)

I am trying to understand how java.util.Random.nextInt(int n) works and despite all the search and even debugging cannot completely understand the implementation.
It is the while loop that causes confusion:
http://docs.oracle.com/javase/7/docs/api/java/util/Random.html#nextInt(int)
int bits, val;
do {
bits = next(31);
val = bits % n;
} while (bits - val + (n-1) < 0);
I realize this is supposed to address the modulo bias, but struggling to see how.
Question: how can possibly the expression
bits - val + (n-1)
be negative given that bits value is 31 bit long, i.e. bits is always positive? If bits is positive, val is always less than bits then while condition always stays > 0...
The question has been already addressed in implementation-of-java-util-random-nextint.
Basically, we have to drop some top-most elements for bits in the range [0..2^31[ because they induce a non-uniform distribution
Mathematically we check :
bits - val + (n-1) >= 2^31
and could have wrote it as is if java had unsigned 32 bits integer arithmetics.
You are missing the value of n in your calculation. Either n is negative, bits - val -1 1 < n or otherwise N is big enough to make the integer overflow and become a negative number.

What does <<= operator mean in Java?

Can you please explain this code snippet from HashMap constructor specifically the line
capacity <<= 1:
// Find a power of 2 >= initialCapacity
198 int capacity = 1;
199 while (capacity < initialCapacity)
200 capacity <<= 1;
It is equivalent to capacity = capacity << 1;.
That operation shifts capacity's bits one position to the left, which is equivalent to multiplying by 2.
The specific code you posted finds the smallest power of 2 which is larger than initialCapacity.
So if initialCapacity is 27, for example, capacity will be 32 (2^5) after the loop.
Just like var += 1 is about equivalent to var = var + 1, what you see here (var <<= 1) is about equivalent to var = var << 1, which is "set var to be the result of a 1-position binary left-shift of var."
In this very specific case, it's actually a slightly (runtime) faster way of expressing capacity *= 2 (because a bitwise left-shift of 1 position is equivalent to a multiplication by 2).
It is equivalent of
capacity = capacity << 1;
which shifts bits in capacity one position to the left (so i.e. 00011011 becomes 00110110).
every time this comes out of the loop the value of 'capacity' goes 2 raised by a power higher.
like initially it is 1 i.e.2^0; the operation(capacity <<= 1) for the first time makes it 2^1 and then 2^2 and so on.
May be you would like to see the more on it at http://www.tutorialspoint.com/java/java_basic_operators.htm

signed to positive near-perfect hash

I have an integer type, say long, whose values are between Long.MIN_VALUE = 0x80...0 (-2^63) and Long.MAX_VALUE = 0x7f...f (2^63 - 1). I want to hash it with ~50% collision to a positive integer of the same type (i.e. between 1 and Long.MAX_VALUE) in a clean and efficient manner.
My first attempts were something like:
Math.abs(x) + 1
(x & Long.MAX_VALUE) + 1
but those and similar approaches always have problems with certain values, i.e. when x is 0 / Long.MIN_VALUE / Long.MAX_VALUE. Of course, the naive solution is to use 2 if statements, but I'm looking for something cleaner / shorter / faster. Any ideas?
Note: Assume that I'm working in Java where there is no implicit conversion to boolean and shift semantics is defined.
The simplest approach is to zero the sign bit and then map zero to some other value:
Long y = x & Long.MAX_VALUE;
return (y == 0)? 42: y;
This is simple, uses only one if/ternary operator, and gives ~50% collision rate on average. There is one disadvantage: it maps 4 different values (0, 42, MIN_VALUE, MIN_VALUE+42) to one value (42). So for this value we have 75% collisions, while for other values - exactly 50%.
It may be preferable to distribute collisions more evenly:
return (x == 0)? 42: (x == Long.MIN_VALUE) ? 142: x & Long.MAX_VALUE;
This code gives 67% collisions for 2 values and 50% for other values. You cannot distribute collisions more evenly, but it is possible to choose these 2 most colliding values. Disadvantage is that this code uses two ifs/ternary operators.
It is possible to avoid 75% collisions on single value while using only one if/ternary operator:
Long y = x & Long.MAX_VALUE;
return (y == 0)? 42 - (x >> 7): y;
This code gives 67% collisions for 2 values and 50% collisions for other values. There is less freedom choosing these most colliding values: 0 maps to 42 (and you can choose almost any value instead); MIN_VALUE maps to 42 - (MIN_VALUE >> 7) (and you can shift MIN_VALUE by any value from 1 to 63, only make sure that A - (MIN_VALUE >> B) does not overflow).
It is possible to get the same result (67% collisions for 2 values and 50% collisions for other values) without conditional operators (but with more complicated code):
Long y = x - 1 - ((x >> 63) << 1);
Long z = y + 1 + (y >> 63);
return z & Long.MAX_VALUE;
This gives 67% collisions for values '1' and 'MAX_VALUE'. If it is more convenient to get most collisions for some other values, just apply this algorithm to x + A, where 'A' is any number.
An improved variant of this solution:
Long y = x + 1 + ((x >> 63) << 1);
Long z = y - (y >> 63);
return z & Long.MAX_VALUE;
Assuming you want to collapse all values into the positive space, why not just zero the sign bit?
You can do this with a single bitwise op by taking advantage of the fact that MAX_VALUE is just a zero sign bit followed by ones e.g.
int positive = value & Integer.MAX_VALUE;
Or for longs:
long positive = value & Long.MAX_VALUE;
If you want a "better" hash with pseudo-random qualities, you probably want to pss the value through another hash function first. My favourite fast hashes are the XORshift family by George Marsaglia. These have the nice property that they map the entire int / long number space perfectly onto itself, so you will still get exactly 50% collisions after zeroing the sign bit.
Here's a quick XORshift implementation in Java:
public static final long xorShift64(long a) {
a ^= (a << 21);
a ^= (a >>> 35);
a ^= (a << 4);
return a;
}
public static final int xorShift32(int a) {
a ^= (a << 13);
a ^= (a >>> 17);
a ^= (a << 5);
return a;
}
I would opt for the most simple, yet not totally time wasting version:
public static long postiveHash(final long hash) {
final long result = hash & Long.MAX_VALUE;
return (result != 0) ? result : (hash == 0 ? 1 : 2);
}
This implementation pays one conditional operation for all but two possible inputs: 0 and MIN_VALUE. Those two are assigned different value mappings with the second condition. I doubt you get a better combination of (code) simplicity and (computational) complexity.
Of course if you can live with a worse distribution, it gets a lot simpler. By restricting the space to 1/4 instead of to 1/2 -1 you can get:
public static long badDistribution(final long hash) {
return (hash & -4) + 1;
}
You can do it without any conditionals and in a single expression by using the unsigned shift operator:
public static int makePositive(int x) {
return (x >>> 1) + (~x >>> 31);
}
If the value is positive, it probably can be used directly, else, invert all bits:
x >= 0 ? hash = x : hash = x ^ Long.MIN_VALUE
However, you should scramble this value a bit more if the values of x are correlated (meaning: similar objects produce similar values for x), maybe with
hash = a * (hash + b) % (Long.MAX_VALUE) + 1
for some positive constants a and b, where a should be quite large and b prevents that 0 is always mapped to 1. This also maps the whole thing to [1,Long.MAX_VALUE] instead of [0,Long.MAX_VALUE]. By altering the values for a and b you could also implement more complex hash functionalities like cooko hashing, that needs two different hash functions.
Such a solution should definitely be preferred instead of one that delivers "strange collision distribution" for the same values each time it is used.
From the information theoretic view, you have 2^64 values to map into 2^63-1 values.
As such, mapping is trivial with the modulus operator, since it always has a non-negative result:
y = 1 + x % 0x7fffffffffffffff; // the constant is 2^63-1
This could be pretty expensive, so what else is possible?
The simple math 2^64 = 2 * (2^63 - 1) + 2 says we will have two source values mapping to one target value except in two special cases, where three will go to one. Think of these as two special 64-bit values, call them x1 and x2, that each share a target with two other source values. In the mod expression above, this occurs by "wrapping". The target values y=2^31-2 and y=2^31-3 have three mappings. All others have two. Since we have to use something more complex than mod anyway, let's seek a way to map the special values wherever we like at low cost
For illustration let's work with mapping a 4-bit signed int x in [-8..7] to y in [1..7], rather than the 64-bit space.
An easy course is to have x values in [1..7] map to themselves, then the problem reduces to mapping x in [-8..0] to y in [1..7]. Note there are 9 source values here and only 7 targets as discussed above.
There are obviously many strategies. At this point you can probably see a gazzilion. I'll describe only one that's particularly simple.
Let y = 1 - x for all values except special cases x1 == -8 and x2 == -7. The whole hash function thus becomes
y = x <= -7 ? S(x) : x <= 0 ? 1 - x : x;
Here S(x) is a simple function that says where x1 and x2 are mapped. Choose S based on what you know about the data. For example if you think high target values are unlikely, map them to 6 and 7 with S(x) = -1 - x.
The final mapping is:
-8: 7 -7: 6 -6: 7 -5: 6 -4: 5 -3: 4 -2: 3 -1: 2
0: 1 1: 1 2: 2 3: 3 4: 4 5: 5 6: 6 7: 7
Taking this logic up to the 64-bit space, you'd have
y = (x <= Long.MIN_VALUE + 1) ? -1 - x : x <= 0 ? 1 - x : x;
Many other kinds of tuning are possible within this framework.
Just to make sure, you have a long and want to hash it to an int?
You could do...
(int) x // This results in a meaningless number, but it works
(int) (x & 0xffffffffl) // This will give you just the low order bits
(int) (x >> 32) // This will give you just the high order bits
((Long) x).hashcode() // This is the high and low order bits XORed together
If you want to keep a long you could do...
x & 0x7fffffffffffffffl // This will just ignore the sign, Long.MIN_VALUE -> 0
x & Long.MAX_VALUE // Should be the same I think
If getting a 0 is no good...
x & 0x7ffffffffffffffel + 1 // This has a 75% collision rate.
Just thinking out loud...
((x & Long.MAX_VALUE) << 1) + 1 // I think this is also 75%
I think you're going to need to either be ok with 75% or get a little ugly:
(x > 0) ? x : (x < 0) ? x & Long.MAX_VALUE : 7
This seems the simplest of all:
(x % Long.MAX_VALUE) + 1
I would be interested in speed comparisons of all the methods given.
Just AND your input value with Long.MAX_VALUE and OR it with 1. Nothing else needed.
Ex:
long hash = (input & Long.MAX_VALUE) | 1;

Categories

Resources