Faster mod operation for large numbers in Java

Faster mod operation for large numbers in Java - java

I need to check whether X is divisible by Y or Not. I don't need the actual remainder in other cases.
I'm using the "mod" operator.
if (X.mod(Y).equals(BigInteger.ZERO))
{
do something
}
Now, I'm interested only when X is divisible by Y, I don't need the actual remainder on other cases.
Finding a faster way to check divisibility when the dividend is fixed. More precisely, to check for large number (potential to be prime) with many smaller primes before going to Lucas-Lehmer test.
I was just wondering, can we make some assumption (Look ahead type) depending on the last one or two digits of X & Y and we can make a decision to go for the mod or not (when there is no chance to get zero).

Java BigIntegers (like most numbers in computers since about 1980) are binary, so the only modulo that can be optimized by looking at the last 'digits' (binary digits = bits) are powers of 2, and the only power of 2 that is prime is 21. BigInteger.testBit(0) tests that directly. However, most software that generates large should-be primes is for cryptography (like RSA, Rabin, DH, DSA) and ensures never to test an even candidate in the first place; see e.g. FIPS186-4 A.1.1.2 (or earlier).
Since your actual goal is not as stated in the title, but to test if one (large) integer is not divisible by any of several small primes, the mathematically fastest way is to form their product -- in general any common multiple, preferably the least, but for distinct primes the product is the LCM -- and compute its GCD with the candidate using Euclid's algorithm. If the GCD is 1, no prime factor in the product is common with, and thus divides, the candidate. This requires several BigInteger divideAndRemainder operations, but it handles all of your tests in one fwoop.
A middle way is to 'bunch' several small primes whose product is less than 231 or 263, take BigInteger.mod (or .remainder) that product as .intValue() or .longValue() respectively, and test that (if nonzero) for divisibility by each of the small primes using int or long operations, which are much faster than the BigInteger ones. Repeat for several bunches if needed. BigInteger.probablePrime and related routines does exactly this (primes 3..41 against a long) for candidates up to 95 bits, above which it considers an Erastosthenes-style sieve more efficient. (In either case followed by Miller-Rabin and Lucas-Lehmer.)
When measuring things like this in Java, remember that if you execute some method 'a lot', where the exact definition of 'a lot' can vary and be hard to pin down, all common JVMs will JIT-compile the code, changing the performance radically. If you are doing it a lot be sure to measure the compiled performance, and if you aren't doing it a lot the performance usually doesn't matter. There are many existing questions here on SO about pitfalls in 'microbenchmark(s)' for Java.

There are algorithms to check divisibility but it's plural and each algorithm covers a particular group of numbers, e.g. dividable by 3, dividable by 4, etc. A list of some algorithms can be found e.g. at Wikipedia. There is no general, high performance algorithm that could be used for any given number, otherwise the one who found it would be famous and every dividable-by-implementation out there would use it.

Related

Java HashMap array size

I am reading the implementation details of Java 8 HashMap, can anyone let me know why Java HashMap initial array size is 16 specifically? What is so special about 16? And why is it the power of two always? Thanks

The reason why powers of 2 appear everywhere is because when expressing numbers in binary (as they are in circuits), certain math operations on powers of 2 are simpler and faster to perform (just think about how easy math with powers of 10 are with the decimal system we use). For example, multication is not a very efficient process in computers - circuits use a method similar to the one you use when multiplying two numbers each with multiple digits. Multiplying or dividing by a power of 2 requires the computer to just move bits to the left for multiplying or the right for dividing.
And as for why 16 for HashMap? 10 is a commonly used default for dynamically growing structures (arbitrarily chosen), and 16 is not far off - but is a power of 2.
You can do modulus very efficiently for a power of 2. n % d = n & (d-1) when d is a power of 2, and modulus is used to determine which index an item maps to in the internal array - which means it occurs very often in a Java HashMap. Modulus requires division, which is also much less efficient than using the bitwise and operator. You can convince yourself of this by reading a book on Digital Logic.
The reason why bitwise and works this way for powers of two is because every power of 2 is expressed as a single bit set to 1. Let's say that bit is t. When you subtract 1 from a power of 2, you set every bit below t to 1, and every bit above t (as well as t) to 0. Bitwise and therefore saves the values of all bits below position t from the number n (as expressed above), and sets the rest to 0.
But how does that help us? Remember that when dividing by a power of 10, you can count the number of zeroes following the 1, and take that number of digits starting from the least significant of the dividend in order to find the remainder. Example: 637989 % 1000 = 989. A similar property applies to binary numbers with only one bit set to 1, and the rest set to 0. Example: 100101 % 001000 = 000101

There's one more thing about choosing the hash & (n - 1) versus modulo and that is negative hashes. hashcode is of type int, which of course can be negative. modulo on a negative number (in Java) is negative also, while & is not.

Another reason is that you want all of the slots in the array to be equally likely to be used. Since hash() is evenly distributed over 32 bits, if the array size didn't divide into the hash space, then there would be a remainder causing lower indexes to have a slightly higher chance of being used. Ideally, not just the hash, but (hash() % array_size) is random and evenly distributed.
But this only really matters for data with a small hash range (like a byte or character).

What is a possible use case of BigInteger's .isProbablePrime()?

The method BigInteger.isProbablePrime() is quite strange; from the documentation, this will tell whether a number is prime with a probability of 1 - 1 / 2^arg, where arg is the integer argument.
It has been present in the JDK for quite a long time, so it means it must have uses. My limited knowledge in computer science and algorithms (and maths) tells me that it does not really make sense to know whether a number is "probably" a prime but not exactly a prime.
So, what is a possible scenario where one would want to use this method? Cryptography?

Yes, this method can be used in cryptography. RSA encryption involves the finding of huge prime numbers, sometimes on the order of 1024 bits (about 300 digits). The security of RSA depends on the fact that factoring a number that consists of 2 of these prime numbers multiplied together is extremely difficult and time consuming. But for it to work, they must be prime.
It turns out that proving these numbers prime is difficult too. But the Miller-Rabin primality test, one of the primality tests uses by isProbablePrime, either detects that a number is composite or gives no conclusion. Running this test n times allows you to conclude that there is a 1 in 2n odds that this number is really composite. Running it 100 times yields the acceptable risk of 1 in 2100 that this number is composite.

If the test tells you an integer is not prime, you can certainly believe that 100%.
It is only the other side of the question, if the test tells you an integer is "a probable prime", that you may entertain doubt. Repeating the test with varying "bases" allows the probability of falsely succeeding at "imitating" a prime (being a strong pseudo-prime with respect to multiple bases) to be made as small as desired.
The usefulness of the test lies in its speed and simplicity. One would not necessarily be satisfied with the status of "probable prime" as a final answer, but one would definitely avoid wasting time on almost all composite numbers by using this routine before bringing in the big guns of primality testing.
The comparison to the difficulty of factoring integers is something of a red herring. It is known that the primality of an integer can be determined in polynomial time, and indeed there is a proof that an extension of Miller-Rabin test to sufficiently many bases is definitive (in detecting primes, as opposed to probable primes), but this assumes the Generalized Riemann Hypothesis, so it is not quite so certain as the (more expensive) AKS primality test.

The standard use case for BigInteger.isProbablePrime(int) is in cryptography. Specifically, certain cryptographic algorithms, such as RSA, require randomly chosen large primes. Importantly, however, these algorithms don't really require these numbers to be guaranteed to be prime — they just need to be prime with a very high probability.
How high is very high? Well, in a crypto application, one would typically call .isProbablePrime() with an argument somewhere between 128 and 256. Thus, the probability of a non-prime number passing such a test is less than one in 2128 or 2256.
Let's put that in perspective: if you had 10 billion computers, each generating 10 billion probable prime numbers per second (which would mean less than one clock cycle per number on any modern CPU), and the primality of those numbers was tested with .isProbablePrime(128), you would, on average, expect one non-prime number to slip in once in every 100 billion years.
That is, that would be the case, if those 10 billion computers could somehow all run for hundreds of billions of years without experiencing any hardware failures. In practice, though, it's a lot more likely for a random cosmic ray to strike your computer at just the right time and place to flip the return value of .isProbablePrime(128) from false to true, without causing any other detectable effects, than it is for a non-prime number to actually pass the probabilistic primality test at that certainty level.
Of course, the same risk of random cosmic rays and other hardware faults also applies to deterministic primality tests like AKS. Thus, in practice, even these tests have a (very small) baseline false positive rate due to random hardware failures (not to mention all other possible sources of errors, such as implementation bugs).
Since it's easy to push the intrinsic false positive rate of the Miller–Rabin primality test used by .isProbablePrime() far below this baseline rate, simply by repeating the test sufficiently many times, and since, even repeated so many times, the Miller–Rabin test is still much faster in practice than the best known deterministic primality tests like AKS, it remains the standard primality test for cryptographic applications.
(Besides, even if you happened to accidentally select a strong pseudoprime as one of the factors of your RSA modulus, it would not generally lead to a catastrophic failure. Typically, such pseudoprimes would be products of two (or rarely more) primes of approximately half the length, which means that you'd end up with a multi-prime RSA key. As long as none of the factors were too small (and if they were, the primality test should've caught them), the RSA algorithm will still work just fine, and the key, although somewhat weaker against certain types of attacks than normal RSA keys of the same length, should still be reasonably secure if you didn't needlessly skimp on the key length.)

A possible use case is in testing primality of a given number (at test which in itself has many uses). The isProbablePrime algorithm will run much faster than an exact algorithm, so if the number fails isProbablePrime, then one need not go to the expense of running the more expensive algorithm.

Finding probable primes is an important problem in cryptography. It turns out that a reasonable strategy for finding a probable k-bit prime is to repeatedly select a random k-bit number, and test it for probable primality using a method like isProbablePrime().
For further discussion, see section 4.4.1 of the Handbook of Applied Cryptography.
Also see On generation of probable primes by incremental search by Brandt and Damgård.

Algorithms such as RSA key generation rely on being able to determine whether a number is prime or not.
However, at the time that the isProbablePrime method was added to the JDK (February 1997), there was no proven way to deterministically decide whether a number was prime in a reasonable amount of time. The best known approach at that time was the Miller-Rabin algorithm - a probabilistic algorithm that would sometimes give false positives (i.e, would report non-primes as primes), but could be tuned to reduce the likelihood of false positives, at the expense of modest increases in runtime.
Since then, algorithms have been discovered that can deterministically decide whether a number is prime reasonably quickly, such as the AKS algorithm that was discovered in August 2002. However, it should be noted that these algorithms are still not as fast as Miller-Rabin.
Perhaps a better question is why no isPrime method has been added to the JDK since 2002.

Calculating geometric mean of a long list of random doubles

So, I came across a problem today in my construction of a restricted Boltzmann machine that should be trivial, but seems to be troublingly difficult. Basically I'm initializing 2k values to random doubles between 0 and 1.
What I would like to do is calculate the geometric mean of this data set. The problem I'm running into is that since the data set is so long, multiplying everything together will always result in zero, and doing the proper root at every step will just rail to 1.
I could potentially chunk the list up, but I think that's really gross. Any ideas on how to do this in an elegant way?
In theory I would like to extend my current RBM code to have closer to 15k+ entries, and be able to run the RBM across multiple threads. Sadly this rules out apache commons math (geometric mean method is not synchronized), longs.

Wow, using a big decimal type is way overkill!
Just take the logarithm of everything, find the arithmetic mean, and then exponentiate.

Mehrdad's logarithm solution certainly works. You can do it faster (and possibly more accurately), though:
Compute the sum of the exponents of the numbers, say S.
Slam all of the exponents to zero so that each number is between 1/2 and 1.
Group the numbers into bunches of at most 1000.
For each group, compute the product of the numbers. This will not underflow.
Add the exponent of the product to S and slam the exponent to zero.
You now have about 1/1000 as many numbers. Repeat steps 2 and 3 unless you only have one number.
Call the one remaining number T. The geometric mean is T1/N 2S/N, where N is the size of the input.

It looks like after a sufficient number of multiplications the double precision is not sufficient anymore. Too many leading zeros, if you will.
The wiki page on arbitrary precision arithmetic shows a few ways to deal with the problem. In Java, BigDecimal seems the way to go, though at the expense of speed.

Convert string to a large integer?

I have an assignment (i think a pretty common one) where the goal is to develop a LargeInteger class that can do calculations with.. very large integers.
I am obviously not allowed to use the Java.math.bigeinteger class at all.
Right off the top I am stuck. I need to take 2 Strings from the user (the long digits) and then I will be using these strings to perform the various calculation methods (add, divide, multiply etc.)
Can anyone explain to me the theory behind how this is supposed to work? After I take the string from the user (since it is too large to store in int) am I supposed to break it up maybe into 10 digit blocks of long numbers (I think 10 is the max long maybe 9?)
any help is appreciated.

First off, think about what a convenient data structure to store the number would be. Think about how you would store an N digit number into an int[] array.
Now let's take addition for example. How would you go about adding two N digit numbers?
Using our grade-school addition, first we look at the least significant digit (in standard notation, this would be the right-most digit) of both numbers. Then add them up.
So if the right-most digits were 7 and 8, we would obtain 15. Take the right-most digit of this result (5) and that's the least significant digit of the answer. The 1 is carried over to the next calculation. So now we look at the 2nd least significant digit and add those together along with the carry (if there is no carry, it is 0). And repeat until there are no digits left to add.
The basic idea is to translate how you add, multiply, etc by hand into code when the numbers are stored in some data structure.

I'll give you a few pointers as to what I might do with a similar task, but let you figure out the details.
Look at how addition is done from simple electronic adder circuits. Specifically, they use small blocks of addition combined together. These principals will help. Specifically, you can add the blocks, just remember to carry over from one block to the next.
Your idea of breaking it up into smaller blogs is an excellent one. Just remember to to the correct conversions. I suspect 9 digits is just about right, for the purpose of carry overs, etc.
These tasks will help you with addition and subtraction. Multiplication and Division are a bit trickier, but again, a few tips.
Multiplication is the easier of the tasks, just remember to multiply each block of one number with the other, and carry the zeros.
Integer division could basically be approached like long division, only using whole blocks at a time.
I've never actually build such a class, so hopefully there will be something in here you can use.

Look at the source code for MPI 1.8.6 by Michael Bromberger (a C library). It uses a simple data structure for bignums and simple algorithms. It's C, not Java, but straightforward.
Its division performs poorly (and results in slow conversion of very large bignums to tex), but you can follow the code.
There is a function mpi_read_radix to read a number in an arbitrary radix (up to base 36, where the letter Z is 35) with an optional leading +/- sign, and produce a bignum.
I recently chose that code for a programming language interpreter because although it is not the fastest performer out there, nor the most complete, it is very hackable. I've been able to rewrite the square root myself to a faster version, fix some coding bugs affecting a port to 64 bit digits, and add some missing operations that I needed. Plus the licensing is BSD compatible.

Why does java.util.Random use the mask?

Simplified (i.e., leaving concurrency out) Random.next(int bits) looks like
protected int next(int bits) {
seed = (seed * multiplier + addend) & mask;
return (int) (seed >>> (48 - bits));
}
where masking gets used to reduce the seed to 48 bits. Why is it better than just
protected int next(int bits) {
seed = seed * multiplier + addend;
return (int) (seed >>> (64 - bits));
}
? I've read quite a lot about random numbers, but see no reason for this.

The reason is that the lower bits tend to have a lower period (at least with the algorithm Java uses)
From Wikipedia - Linear Congruential Generator:
As shown above, LCG's do not always use all of the bits in the values they produce. The Java implementation produces 48 bits with each iteration but only returns the 32 most significant bits from these values. This is because the higher-order bits have longer periods than the lower order bits (see below). LCG's that use this technique produce much better values than those that do not.
edit:
after further reading (conveniently, on Wikipedia), the values of a, c, and m must satisfy these conditions to have the full period:
c and m must be relatively primes
a-1 is divisible by all prime factors of m
a-1 is a multiple of 4 if m is a multiple of 4
The only one that I can clearly tell is still satisfied is #3. #1 and #2 need to be checked, and I have a feeling that one (or both) of these fail.

From the docs at the top of java.util.Random:
The algorithm is described in The Art of Computer Programming,
Volume 2 by Donald Knuth in Section 3.2.1. It is a 48-bit seed,
linear congruential formula.
So the entire algorithm is designed to operate of 48-bit seeds, not 64 bit ones. I guess you can take it up with Mr. Knuth ;p

From wikipedia (the quote alluded to by the quote that #helloworld922 posted):
A further problem of LCGs is that the lower-order bits of the generated sequence have a far shorter period than the sequence as a whole if m is set to a power of 2. In general, the nth least significant digit in the base b representation of the output sequence, where bk = m for some integer k, repeats with at most period bn.
And furthermore, it continues (my italics):
The low-order bits of LCGs when m is a power of 2 should never be relied on for any degree of randomness whatsoever. Indeed, simply substituting 2n for the modulus term reveals that the low order bits go through very short cycles. In particular, any full-cycle LCG when m is a power of 2 will produce alternately odd and even results.
In the end, the reason is probably historical: the folks at Sun wanted something to work reliably, and the Knuth formula gave 32 significant bits. Note that the java.util.Random API says this (my italics):
If two instances of Random are created with the same seed, and the same sequence of method calls is made for each, they will generate and return identical sequences of numbers. In order to guarantee this property, particular algorithms are specified for the class Random. Java implementations must use all the algorithms shown here for the class Random, for the sake of absolute portability of Java code. However, subclasses of class Random are permitted to use other algorithms, so long as they adhere to the general contracts for all the methods.
So we're stuck with it as a reference implementation. However that doesn't mean you can't use another generator (and subclass Random or create a new class):
from the same Wikipedia page:
MMIX by Donald Knuth m=264 a=6364136223846793005 c=1442695040888963407
There's a 64-bit formula for you.
Random numbers are tricky (as Knuth notes) and depending on your needs, you might be fine with just calling java.util.Random twice and concatenating the bits if you need a 64-bit number. If you really care about the statistical properties, use something like Mersenne Twister, or if you care about information leakage / unpredictability use java.security.SecureRandom.

It doesn't look like there was a good reason for doing this.
Applying the mask is an conservative approach using a proven design.
Leaving it out most probably leads to a better generator, however, without knowing the math well, it's a risky step.
Another small advantage of masking is a speed gain on 8-bit architectures, since it uses 6 bytes instead of 8.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.