The method BigInteger.isProbablePrime() is quite strange; from the documentation, this will tell whether a number is prime with a probability of 1 - 1 / 2^arg, where arg is the integer argument.
It has been present in the JDK for quite a long time, so it means it must have uses. My limited knowledge in computer science and algorithms (and maths) tells me that it does not really make sense to know whether a number is "probably" a prime but not exactly a prime.
So, what is a possible scenario where one would want to use this method? Cryptography?
Yes, this method can be used in cryptography. RSA encryption involves the finding of huge prime numbers, sometimes on the order of 1024 bits (about 300 digits). The security of RSA depends on the fact that factoring a number that consists of 2 of these prime numbers multiplied together is extremely difficult and time consuming. But for it to work, they must be prime.
It turns out that proving these numbers prime is difficult too. But the Miller-Rabin primality test, one of the primality tests uses by isProbablePrime, either detects that a number is composite or gives no conclusion. Running this test n times allows you to conclude that there is a 1 in 2n odds that this number is really composite. Running it 100 times yields the acceptable risk of 1 in 2100 that this number is composite.
If the test tells you an integer is not prime, you can certainly believe that 100%.
It is only the other side of the question, if the test tells you an integer is "a probable prime", that you may entertain doubt. Repeating the test with varying "bases" allows the probability of falsely succeeding at "imitating" a prime (being a strong pseudo-prime with respect to multiple bases) to be made as small as desired.
The usefulness of the test lies in its speed and simplicity. One would not necessarily be satisfied with the status of "probable prime" as a final answer, but one would definitely avoid wasting time on almost all composite numbers by using this routine before bringing in the big guns of primality testing.
The comparison to the difficulty of factoring integers is something of a red herring. It is known that the primality of an integer can be determined in polynomial time, and indeed there is a proof that an extension of Miller-Rabin test to sufficiently many bases is definitive (in detecting primes, as opposed to probable primes), but this assumes the Generalized Riemann Hypothesis, so it is not quite so certain as the (more expensive) AKS primality test.
The standard use case for BigInteger.isProbablePrime(int) is in cryptography. Specifically, certain cryptographic algorithms, such as RSA, require randomly chosen large primes. Importantly, however, these algorithms don't really require these numbers to be guaranteed to be prime — they just need to be prime with a very high probability.
How high is very high? Well, in a crypto application, one would typically call .isProbablePrime() with an argument somewhere between 128 and 256. Thus, the probability of a non-prime number passing such a test is less than one in 2128 or 2256.
Let's put that in perspective: if you had 10 billion computers, each generating 10 billion probable prime numbers per second (which would mean less than one clock cycle per number on any modern CPU), and the primality of those numbers was tested with .isProbablePrime(128), you would, on average, expect one non-prime number to slip in once in every 100 billion years.
That is, that would be the case, if those 10 billion computers could somehow all run for hundreds of billions of years without experiencing any hardware failures. In practice, though, it's a lot more likely for a random cosmic ray to strike your computer at just the right time and place to flip the return value of .isProbablePrime(128) from false to true, without causing any other detectable effects, than it is for a non-prime number to actually pass the probabilistic primality test at that certainty level.
Of course, the same risk of random cosmic rays and other hardware faults also applies to deterministic primality tests like AKS. Thus, in practice, even these tests have a (very small) baseline false positive rate due to random hardware failures (not to mention all other possible sources of errors, such as implementation bugs).
Since it's easy to push the intrinsic false positive rate of the Miller–Rabin primality test used by .isProbablePrime() far below this baseline rate, simply by repeating the test sufficiently many times, and since, even repeated so many times, the Miller–Rabin test is still much faster in practice than the best known deterministic primality tests like AKS, it remains the standard primality test for cryptographic applications.
(Besides, even if you happened to accidentally select a strong pseudoprime as one of the factors of your RSA modulus, it would not generally lead to a catastrophic failure. Typically, such pseudoprimes would be products of two (or rarely more) primes of approximately half the length, which means that you'd end up with a multi-prime RSA key. As long as none of the factors were too small (and if they were, the primality test should've caught them), the RSA algorithm will still work just fine, and the key, although somewhat weaker against certain types of attacks than normal RSA keys of the same length, should still be reasonably secure if you didn't needlessly skimp on the key length.)
A possible use case is in testing primality of a given number (at test which in itself has many uses). The isProbablePrime algorithm will run much faster than an exact algorithm, so if the number fails isProbablePrime, then one need not go to the expense of running the more expensive algorithm.
Finding probable primes is an important problem in cryptography. It turns out that a reasonable strategy for finding a probable k-bit prime is to repeatedly select a random k-bit number, and test it for probable primality using a method like isProbablePrime().
For further discussion, see section 4.4.1 of the Handbook of Applied Cryptography.
Also see On generation of probable primes by incremental search by Brandt and Damgård.
Algorithms such as RSA key generation rely on being able to determine whether a number is prime or not.
However, at the time that the isProbablePrime method was added to the JDK (February 1997), there was no proven way to deterministically decide whether a number was prime in a reasonable amount of time. The best known approach at that time was the Miller-Rabin algorithm - a probabilistic algorithm that would sometimes give false positives (i.e, would report non-primes as primes), but could be tuned to reduce the likelihood of false positives, at the expense of modest increases in runtime.
Since then, algorithms have been discovered that can deterministically decide whether a number is prime reasonably quickly, such as the AKS algorithm that was discovered in August 2002. However, it should be noted that these algorithms are still not as fast as Miller-Rabin.
Perhaps a better question is why no isPrime method has been added to the JDK since 2002.
Related
A few weeks ago I wrote an exam. The first task was to find the right approximation of a function with the given properties: Properties of the function
I had to check every approximation with the tests i write for the properties. The properties 2, 3 and 4 were no problem. But I don't know how to check for property 1 using a JUnit test written in Java. My approach was to do it this way:
#Test
void test1() {
Double x = 0.01;
Double res = underTest.apply(x);
for(int i = 0; i < 100000; ++i) {
x = rnd(0, x);
Double lastRes = res;
res = underTest.apply(x);
assertTrue(res <= lastRes);
}
}
Where rnd(0, x) is a function call that generates a random number within (0,x].
But this can't be the right way, because it is only checking for the x's getting smaller, the result is smaller than the previous one. I.e. the test would also succeed if the first res is equal to -5 and the next a result little smaller than the previous for all 100000 iterations. So it could be that the result after the 100000 iteration is -5.5 (or something else). That means the test would also succeed for a function where the right limit to 0 is -6. Is there a way to check for property 1?
Is there a way to check for property 1?
Trivially? No, of course not. Calculus is all about what happens at infinity / at infinitely close to any particular point – where basic arithmetic gets stuck trying to divide a number that grows ever tinier by a range that is ever smaller, and basic arithmetic can't do '0 divided by 0'.
Computers are like basic arithmetic here, or at least, you are applying basic arithmetic in the code you have pasted, and making the computer do actual calculus is considerably more complicated than what you do here. As in, years of programming experience required more complicated; I very much doubt you were supposed to write a few hundred thousand lines of code to build an entire computerized math platform to write this test, surely.
More generally, you seem to be labouring under the notion that a test proves anything.
They do not. A test cannot prove correctness, ever. Just like science can never prove anything, they can only disprove (and can build up ever more solid foundations and confirmations which any rational person usually considers more than enough so that they will e.g. take as given that the Law of Gravity will hold, even if there is no proof of this and there never will be).
A test can only prove that you have a bug. It can never prove that you don't.
Therefore, anything you can do here is merely a shades of gray scenario: You can make this test catch more scenarios where you know the underTest algorithm is incorrect, but it is impossible to catch all of them.
What you have pasted is already useful: This test proves that as you use double values that get ever closer to 0, your test ensures that the output of the function grows ever smaller. That's worth something. You seem to think this particular shade of gray isn't dark enough for you. You want to show that it gets fairly close to infinity.
That's... your choice. You cannot PROVE that it gets to infinity, because computers are like basic arithmetic with some approximation sprinkled in.
There isn't a heck of a lot you can do here: Yes, you can test if the final res value is for example 'less than -1000000'. You can even test if it is literally negative infinity, but there is no guarantee that this is even correct; a function that is defined as 'as the input goes to 0 from the positive end, the output will tend towards negative infinity' is free to do so only for an input so incredibly tiny, that double cannot represent it at all (computers are not magical; doubles take 64 bit, that means there are at most 2^64 unique numbers that double can even represent. 2^64 is a very large number, but it is nothing compared to the doubly-dimensioned infinity that is the concept of 'all numbers imaginable' (there are an infinite amount of numbers between 0 and 1, and an infinite amount of such numbers across the whole number line, after all). Thus, there are plenty of very tiny numbers that double just cannot represent. At all.
For your own sanity, using randomness in a unit test is a bad idea, and for some definitions of the word 'unit test', literally broken (some labour under the notion that unit tests must be reliable or they cannot be considered a unit test, and this is not such a crazy notion if you look at what 'unit test' pragmatically speaking ends up being used for: To have test environments automatically run the unit tests, repeatedly and near continuously, in order to flag down ASAP when someone breaks one. If the CI server runs a unit test 1819 times a day, it would get quite annoying if by sheer random chance, one in 20k times it fails; it would then assume the most recent commit is to blame and no framework out there I know of repeats a unit test a few more times. In the end programming works best if you move away from the notion of proofs and cold hard definitions, and move towards 'do what the community thinks things mean'. For unit tests, that means: Don't use randomness).
Firstly: you cannot reliably test ANY of the properties
function f could break one of the properties in a point x which is not even representable as a double due to limited precision
there is too much points too test, realistically you need to pick a subset of the domain
Secondly:
your definition of a limit is wrong. You check that the function is monotonically decreasing. This is not required by limit definition - the function could fluctuate when approaching the limit. In the general case, I would probably follow Weierstrass definition
But:
By eyeballing the conditions you can quickly notice that a logarithmic function (with any base) meets the criteria. (so the function is indeed monotonically decreasing).
Let's pick natural logarithm, and check its value at smallest x that can be represented as a double:
System.out.println(Double.MIN_VALUE);
System.out.println(Math.log(Double.MIN_VALUE));
System.out.println(Math.log(Double.MIN_VALUE) < -1000);
// prints
4.9E-324
-744.4400719213812
false
As you can see, the value is circa -744, which is really far from minus infinity. You cannot get any closer on double represented on 64 bits.
Recently I have been working on DNA sequence matching algorithms and their comparisons. I have implemented standard Naive, KMP, and Robin-Karp algos for the same purpose.
After executing in Java (8Gb RAM, Intel I5 processor, 1GB hard disk), I noted that naive works faster than KMP and RK.
But I got astonished after knowing that for DNA sequences up to 100,000 characters and pattern of 4 characters, naive(6ms) still outperforms KMP(11ms) and RK(17ms). I am confused as to why so this is happening and how can this be possible?
Does naive works really that fast or JVM is throwing some random garbage values, or am I placing the time instances of Java at the wrong places?
Any help is much appreciated.
There are a number of factors that might be contributing to this. Here's a few things to think about:
A four-character search string is pretty short - in fact, that's so small that a naive search would likely be extremely fast. The reason that KMP and Rabin-Karp are considered "fast" string searching algorithms is that they scan each character of the input strings, on average, at most a constant number of times. If you have a four-character string, you'll also scan each character of the input at most a constant number of times as well, and it's a low constant (4). So this may simply be the constant factor terms from the KMP and Rabin-Karp outweighing the cost of a naive search. (This is similar to why many sorting algorithms switch to insertion sort for small input sizes - while insertion sort is worse for large sequences, it's much, much faster than the "fast" sorting algorithms on small inputs.) You may want to mix things up a bit and try different string lengths, nonrandom inputs, etc.
With a four-character sequence drawn from a genome there are 44 = 256 possible combinations of random strings to search for. Therefore, on expectation you'll find that sequence after reading at most 256 four-character blocks from the string being searched. That means that you'd need, on expectation, at most 1024 characters to be read in order to find your string, so the fact that the genomes are 100,000 characters long is likely irrelevant. You probably are more accurately dealing with "effective" string lengths closer to 1000, and, like in part (1), if you have smaller inputs to the algorithms the benefits of KMP and Rabin-Karp diminish relative to their increased constant factors.
This could also be an artifact of how the code is implemented. Which versions of KMP and Rabin-Karp are you using? String.indexOf is likely heavily optimized by the JVM implementers; are you similarly using highly-optimized versions of KMP and Rabin-Karp?
I need to check whether X is divisible by Y or Not. I don't need the actual remainder in other cases.
I'm using the "mod" operator.
if (X.mod(Y).equals(BigInteger.ZERO))
{
do something
}
Now, I'm interested only when X is divisible by Y, I don't need the actual remainder on other cases.
Finding a faster way to check divisibility when the dividend is fixed. More precisely, to check for large number (potential to be prime) with many smaller primes before going to Lucas-Lehmer test.
I was just wondering, can we make some assumption (Look ahead type) depending on the last one or two digits of X & Y and we can make a decision to go for the mod or not (when there is no chance to get zero).
Java BigIntegers (like most numbers in computers since about 1980) are binary, so the only modulo that can be optimized by looking at the last 'digits' (binary digits = bits) are powers of 2, and the only power of 2 that is prime is 21. BigInteger.testBit(0) tests that directly. However, most software that generates large should-be primes is for cryptography (like RSA, Rabin, DH, DSA) and ensures never to test an even candidate in the first place; see e.g. FIPS186-4 A.1.1.2 (or earlier).
Since your actual goal is not as stated in the title, but to test if one (large) integer is not divisible by any of several small primes, the mathematically fastest way is to form their product -- in general any common multiple, preferably the least, but for distinct primes the product is the LCM -- and compute its GCD with the candidate using Euclid's algorithm. If the GCD is 1, no prime factor in the product is common with, and thus divides, the candidate. This requires several BigInteger divideAndRemainder operations, but it handles all of your tests in one fwoop.
A middle way is to 'bunch' several small primes whose product is less than 231 or 263, take BigInteger.mod (or .remainder) that product as .intValue() or .longValue() respectively, and test that (if nonzero) for divisibility by each of the small primes using int or long operations, which are much faster than the BigInteger ones. Repeat for several bunches if needed. BigInteger.probablePrime and related routines does exactly this (primes 3..41 against a long) for candidates up to 95 bits, above which it considers an Erastosthenes-style sieve more efficient. (In either case followed by Miller-Rabin and Lucas-Lehmer.)
When measuring things like this in Java, remember that if you execute some method 'a lot', where the exact definition of 'a lot' can vary and be hard to pin down, all common JVMs will JIT-compile the code, changing the performance radically. If you are doing it a lot be sure to measure the compiled performance, and if you aren't doing it a lot the performance usually doesn't matter. There are many existing questions here on SO about pitfalls in 'microbenchmark(s)' for Java.
There are algorithms to check divisibility but it's plural and each algorithm covers a particular group of numbers, e.g. dividable by 3, dividable by 4, etc. A list of some algorithms can be found e.g. at Wikipedia. There is no general, high performance algorithm that could be used for any given number, otherwise the one who found it would be famous and every dividable-by-implementation out there would use it.
I'm implementing the Sieve of Eratosthenes, for an explanation of this see http://en.wikipedia.org/wiki/Sieve_of_Eratosthenes. However I would like to adapt it to generate M primes, not the primes 1 through N. My method of doing this is simply to create a large enough N such that all M primes are contained in this range. Does anyone have any good heuristics for modeling the growth of primes? In case you wish to post code snippets I am implementing this in Java and C++.
To generate M primes, you need to go up to about M log M. See Approximations for the nth prime number in this Wikipedia article about the Prime Number Theorem. To be on the safe side, you might want to overestimate -- say N = M (log M + 1).
Edited to add: As David Hammen points out, this overestimate is not always good enough. The Wikipedia article gives M (log M + log log M) as a safe upper bound for M >= 6.
This approximation of the nth prime is taken from wikipedia; therefore you'll just need to allocate an array of m*log(m)+m*log(log(m)); an array of m*log(m) will be unsufficient.
Another alternative is a segmented sieve. Sieve the numbers to a million. Then the second million. Then the third. And so on. Stop when you have enough.
It's not hard to reset the sieve for the next segment. See my blog for details.
Why not grow the sieve dynamically? Whenever you need more primes, re-allocate the seive memory, and run the sieve algorithm, just over the new space, using the primes that you have previously found.
Lazy evaluation comes to mind (such as Haskell and other functional languages do it for you). Although you a writing in an imperative language, you can apply the concept I think.
Consider the operation of removing the remaining cardinal numbers from the candidate set. Without actually touching the real algorithm (and more importantly without guessing how many numbers you will create), execute this operation in a lazy manner (which you will have to implement, because you're in an imperative language), when and if you try to take the smallest remaining number.
What complexity are the methods multiply, divide and pow in BigInteger currently? There is no mention of the computational complexity in the documentation (nor anywhere else).
If you look at the code for BigInteger (provided with JDK), it appears to me that
multiply(..) has O(n^2) (actually the method is multiplyToLen(..)). The code for the other methods is a bit more complex, but you can see yourself.
Note: this is for Java 6. I assume it won't differ in Java 7.
As noted in the comments on #Bozho's answer, Java 8 and onwards use more efficient algorithms to implement multiplication and division than the naive O(N^2) algorithms in Java 7 and earlier.
Java 8 multiplication adaptively uses either the naive O(N^2) long multiplication algorithm, the Karatsuba algorithm or the 3 way Toom-Cook algorithm depending in the sizes of the numbers being multiplied. The latter are (respectively) O(N^1.58) and O(N^1.46).
Java 8 division adaptively uses either Knuth's O(N^2) long division algorithm or the Burnikel-Ziegler algorithm. (According to the research paper, the latter is 2K(N) + O(NlogN) for a division of a 2N digit number by an N digit number, where K(N) is the Karatsuba multiplication time for two N-digit numbers.)
Likewise some other operations have been optimized.
There is no mention of the computational complexity in the documentation (nor anywhere else).
Some details of the complexity are mentioned in the Java 8 source code. The reason that the javadocs do not mention complexity is that it is implementation specific, both in theory and in practice. (As illustrated by the fact that the complexity of some operations is significantly different between Java 7 and 8.)
There is a new "better" BigInteger class that is not being used by the sun jdk for conservateism and lack of useful regression tests (huge data sets). The guy that did the better algorithms might have discussed the old BigInteger in the comments.
Here you go http://futureboy.us/temp/BigInteger.java
Measure it. Do operations with linearly increasing operands and draw the times on a diagram.
Don't forget to warm up the JVM (several runs) to get valid benchmark results.
If operations are linear O(n), quadratic O(n^2), polynomial or exponential should be obvious.
EDIT: While you can give algorithms theoretical bounds, they may not be such useful in practice. First of all, the complexity does not give the factor. Some linear or subquadratic algorithms are simply not useful because they are eating so much time and resources that they are not adequate for the problem on hand (e.g. Coppersmith-Winograd matrix multiplication).
Then your computation may have all kludges you can only detect by experiment. There are preparing algorithms which do nothing to solve the problem but to speed up the real solver (matrix conditioning). There are suboptimal implementations. With longer lengths, your speed may drop dramatically (cache missing, memory moving etc.). So for practical purposes, I advise to do experimentation.
The best thing is to double each time the length of the input and compare the times.
And yes, you do find out if an algorithm has n^1.5 or n^1.8 complexity. Simply quadruple
the input length and you need only the half time for 1.5 instead of 2. You get again nearly half the time for 1.8 if you multiply the length 256 times.