Why naive algorithm is faster than KMP and Robin-Karp?

Why naive algorithm is faster than KMP and Robin-Karp? - java

Recently I have been working on DNA sequence matching algorithms and their comparisons. I have implemented standard Naive, KMP, and Robin-Karp algos for the same purpose.
After executing in Java (8Gb RAM, Intel I5 processor, 1GB hard disk), I noted that naive works faster than KMP and RK.
But I got astonished after knowing that for DNA sequences up to 100,000 characters and pattern of 4 characters, naive(6ms) still outperforms KMP(11ms) and RK(17ms). I am confused as to why so this is happening and how can this be possible?
Does naive works really that fast or JVM is throwing some random garbage values, or am I placing the time instances of Java at the wrong places?
Any help is much appreciated.

There are a number of factors that might be contributing to this. Here's a few things to think about:
A four-character search string is pretty short - in fact, that's so small that a naive search would likely be extremely fast. The reason that KMP and Rabin-Karp are considered "fast" string searching algorithms is that they scan each character of the input strings, on average, at most a constant number of times. If you have a four-character string, you'll also scan each character of the input at most a constant number of times as well, and it's a low constant (4). So this may simply be the constant factor terms from the KMP and Rabin-Karp outweighing the cost of a naive search. (This is similar to why many sorting algorithms switch to insertion sort for small input sizes - while insertion sort is worse for large sequences, it's much, much faster than the "fast" sorting algorithms on small inputs.) You may want to mix things up a bit and try different string lengths, nonrandom inputs, etc.
With a four-character sequence drawn from a genome there are 44 = 256 possible combinations of random strings to search for. Therefore, on expectation you'll find that sequence after reading at most 256 four-character blocks from the string being searched. That means that you'd need, on expectation, at most 1024 characters to be read in order to find your string, so the fact that the genomes are 100,000 characters long is likely irrelevant. You probably are more accurately dealing with "effective" string lengths closer to 1000, and, like in part (1), if you have smaller inputs to the algorithms the benefits of KMP and Rabin-Karp diminish relative to their increased constant factors.
This could also be an artifact of how the code is implemented. Which versions of KMP and Rabin-Karp are you using? String.indexOf is likely heavily optimized by the JVM implementers; are you similarly using highly-optimized versions of KMP and Rabin-Karp?

Related

What is a possible use case of BigInteger's .isProbablePrime()?

The method BigInteger.isProbablePrime() is quite strange; from the documentation, this will tell whether a number is prime with a probability of 1 - 1 / 2^arg, where arg is the integer argument.
It has been present in the JDK for quite a long time, so it means it must have uses. My limited knowledge in computer science and algorithms (and maths) tells me that it does not really make sense to know whether a number is "probably" a prime but not exactly a prime.
So, what is a possible scenario where one would want to use this method? Cryptography?

Yes, this method can be used in cryptography. RSA encryption involves the finding of huge prime numbers, sometimes on the order of 1024 bits (about 300 digits). The security of RSA depends on the fact that factoring a number that consists of 2 of these prime numbers multiplied together is extremely difficult and time consuming. But for it to work, they must be prime.
It turns out that proving these numbers prime is difficult too. But the Miller-Rabin primality test, one of the primality tests uses by isProbablePrime, either detects that a number is composite or gives no conclusion. Running this test n times allows you to conclude that there is a 1 in 2n odds that this number is really composite. Running it 100 times yields the acceptable risk of 1 in 2100 that this number is composite.

If the test tells you an integer is not prime, you can certainly believe that 100%.
It is only the other side of the question, if the test tells you an integer is "a probable prime", that you may entertain doubt. Repeating the test with varying "bases" allows the probability of falsely succeeding at "imitating" a prime (being a strong pseudo-prime with respect to multiple bases) to be made as small as desired.
The usefulness of the test lies in its speed and simplicity. One would not necessarily be satisfied with the status of "probable prime" as a final answer, but one would definitely avoid wasting time on almost all composite numbers by using this routine before bringing in the big guns of primality testing.
The comparison to the difficulty of factoring integers is something of a red herring. It is known that the primality of an integer can be determined in polynomial time, and indeed there is a proof that an extension of Miller-Rabin test to sufficiently many bases is definitive (in detecting primes, as opposed to probable primes), but this assumes the Generalized Riemann Hypothesis, so it is not quite so certain as the (more expensive) AKS primality test.

The standard use case for BigInteger.isProbablePrime(int) is in cryptography. Specifically, certain cryptographic algorithms, such as RSA, require randomly chosen large primes. Importantly, however, these algorithms don't really require these numbers to be guaranteed to be prime — they just need to be prime with a very high probability.
How high is very high? Well, in a crypto application, one would typically call .isProbablePrime() with an argument somewhere between 128 and 256. Thus, the probability of a non-prime number passing such a test is less than one in 2128 or 2256.
Let's put that in perspective: if you had 10 billion computers, each generating 10 billion probable prime numbers per second (which would mean less than one clock cycle per number on any modern CPU), and the primality of those numbers was tested with .isProbablePrime(128), you would, on average, expect one non-prime number to slip in once in every 100 billion years.
That is, that would be the case, if those 10 billion computers could somehow all run for hundreds of billions of years without experiencing any hardware failures. In practice, though, it's a lot more likely for a random cosmic ray to strike your computer at just the right time and place to flip the return value of .isProbablePrime(128) from false to true, without causing any other detectable effects, than it is for a non-prime number to actually pass the probabilistic primality test at that certainty level.
Of course, the same risk of random cosmic rays and other hardware faults also applies to deterministic primality tests like AKS. Thus, in practice, even these tests have a (very small) baseline false positive rate due to random hardware failures (not to mention all other possible sources of errors, such as implementation bugs).
Since it's easy to push the intrinsic false positive rate of the Miller–Rabin primality test used by .isProbablePrime() far below this baseline rate, simply by repeating the test sufficiently many times, and since, even repeated so many times, the Miller–Rabin test is still much faster in practice than the best known deterministic primality tests like AKS, it remains the standard primality test for cryptographic applications.
(Besides, even if you happened to accidentally select a strong pseudoprime as one of the factors of your RSA modulus, it would not generally lead to a catastrophic failure. Typically, such pseudoprimes would be products of two (or rarely more) primes of approximately half the length, which means that you'd end up with a multi-prime RSA key. As long as none of the factors were too small (and if they were, the primality test should've caught them), the RSA algorithm will still work just fine, and the key, although somewhat weaker against certain types of attacks than normal RSA keys of the same length, should still be reasonably secure if you didn't needlessly skimp on the key length.)

A possible use case is in testing primality of a given number (at test which in itself has many uses). The isProbablePrime algorithm will run much faster than an exact algorithm, so if the number fails isProbablePrime, then one need not go to the expense of running the more expensive algorithm.

Finding probable primes is an important problem in cryptography. It turns out that a reasonable strategy for finding a probable k-bit prime is to repeatedly select a random k-bit number, and test it for probable primality using a method like isProbablePrime().
For further discussion, see section 4.4.1 of the Handbook of Applied Cryptography.
Also see On generation of probable primes by incremental search by Brandt and Damgård.

Algorithms such as RSA key generation rely on being able to determine whether a number is prime or not.
However, at the time that the isProbablePrime method was added to the JDK (February 1997), there was no proven way to deterministically decide whether a number was prime in a reasonable amount of time. The best known approach at that time was the Miller-Rabin algorithm - a probabilistic algorithm that would sometimes give false positives (i.e, would report non-primes as primes), but could be tuned to reduce the likelihood of false positives, at the expense of modest increases in runtime.
Since then, algorithms have been discovered that can deterministically decide whether a number is prime reasonably quickly, such as the AKS algorithm that was discovered in August 2002. However, it should be noted that these algorithms are still not as fast as Miller-Rabin.
Perhaps a better question is why no isPrime method has been added to the JDK since 2002.

Datastructure to use for storing a huge list of long values

I am caching list of Long indexes in my Java program and it is causing the memory to overflow.
So, decided to cache only the start and end indexes of all continuous indexes and rewrite the ArrayList's required APIs. Now, what data structure will be best here to implement the start-end index cache? Is it better to go for TreeMap and keep start index as key and end index as value?

If I were you, I would use some variation of bit string storage.
In Java bit strings are implemented by BitSet.
For example, to represent arbitrary list of unique 32-bit integers, you could store it as a single bit string 4 billion bits long, so this will take 4 bln / 8 bits = 512MB of memory. This is a lot, but it is worst possible case.
But, you can be a lot smarter than that. For example, you could store it as list or binary tree of some smaller fixed (or dynamic) sized bit strings, say 65536 bits or less (or 8KB or less). In other words, each leaf object in this tree will have small header representing start offset and length (probably power of 2 for simplicity, but it does not have to be), and bit string storing actual array members. For efficiency, you could optionally compress this bit string using gzip or similar algorithm - it will make access slower, but could improve memory efficiency by factor of 10 or more.
If your 20 million index elements are almost consecutive (not very sparse), it should take only around 20mln/8bits ~= 2 million bits = 2 MB to represent it in memory. If you gzip it, it will be probably under 1MB overall.

The most compact representation will depend greatly on the distribution of indices in your specific application.
If your indices are densely clustered, the range-based representation suggested by mvp will probably work well (you might look at implementations of run-length encoding for raster graphics, since they're similar problems).
If your indices aren't clustered in dense runs, that encoding will actually increase memory consumption. For sparsely-populated lists, you might look into primitive data structures such as LongArrayList or LongOpenHashSet in FastUtil, or similar structures in Gnu Trove or Colt. In most VMs, each Long object in your ArrayList consumes 20+ bytes, whereas a primitive long consumes only 8. So you can often realize a significant memory savings with type-specific primitive collections instead of the standard Collections framework.
I've been very pleased with FastUtil, but you might find another solution suits you better. A little simulation and memory profiling should help you determine the most effective representation for your own data.

Most BitSet (compressed or uncompressed) implementations are for integers. Here's one for longs: http://www.censhare.com/en/aktuelles/censhare-labs/yet-another-compressed-bitset which works like an ordered primitive long hash set or long to long hash map.

time complexity for most programming language?

I read about time complexity modular arithmetic in many books . there is thing I don't understood .
I read in some books the following
For any a mod N, a has a multiplicative inverse modulo N if
and only if it is relatively prime to N. When this inverse exists, it can be found in time O(n^3) (where as usual n denotes the number of bits of N) by running the extended Euclid algorithm.
My question revolves around *extended Euclid algorithm* *is has O(n^3)*
when I write in java integrated with netbeans or C# or C++ program this line
A = B.modInverse(N) //here by java syntax
In general. Can I say usually this line has time complexity O(n^3).
or necessary write the same steps extended Euclid algorithm.

Unless the documentation of the modInverse method makes an explicit guarantee about its time complexity, you generally can't make any assumptions about its running time. The implementation could be completely different depending on the runtime/library or even the version of the runtime.
If you have access to the source code, you can verify which algorithm is used. You can also run your own benchmarks for different input sizes and you'll get a pretty good picture about the asymptotic behavior of the concrete implementation.
That said, it's highly probable that popular libraries for arbitrary-precision arithmetic use the best known algorithms for basic operations like modInverse.

How to deal with underflow in scientific computing?

I am working on probabilistic models, and when doing inference on those models, the estimated probabilities can become very small. In order to avoid underflow, I am currently working in the log domain (I store the log of the probabilities). Multiplying probabilities is equivalent to an addition, and summing is done by using the formula:
log(exp(a) + exp(b)) = log(exp(a - m) + exp(b - m)) + m
where m = max(a, b).
I use some very large matrices, and I have to take the element-wise exponential of those matrices to compute matrix-vector multiplications. This step is quite expensive, and I was wondering if there exist other methods to deal with underflow, when working with probabilities.
Edit: for efficiency reasons, I am looking for a solution using primitive types and not objects storing arbitrary-precision representation of real numbers.
Edit 2: I am looking for a faster solution than the log domain trick, not a more accurate solution. I am happy with the accuracy I currently get, but I need a faster method. Particularly, summations happen during matrix-vector multiplications, and I would like to be able to use efficient BLAS methods.
Solution: after a discussion with Jonathan Dursi, I decided to factorize each matrix and vector by its largest element, and to store that factor in the log domain. Multiplications are straightforward. Before additions, I have to factorize one of the added matrices/vectors by the ratio of the two factors. I update the factor every ten operations.

This issue has come up recently on the computational science stack exchange site as well, and although there the immediate worry there was overflow, the issues are more or less the same.
Transforming into log space is certainly one reasonable approach. Whatever space you're in, to do a large number of sums correctly, there's a couple of methods you can use to improve the accuracy of your summations. Compensated summation approaches, most famously Kahan summation, keep both a sum and what's effectively a "remainder"; it gives you some of the advantages of using higher precision arithmeitic without all of the cost (and only using primitive types). The remainder term also gives you some indication of how well you're doing.
In addition to improving the actual mechanics of your addition, changing the order of how you add your terms can make a big difference. Sorting your terms so that you're summing from smallest to largest can help, as then you're no longer adding terms as frequently that are very different (which can cause significant roundoff problems); in some cases, doing log2 N repeated pairwise sums can also be an improvement over just doing the straight linear sum, depending on what your terms look like.
The usefullness of all these approaches depend a lot on the properties of your data. The arbitrary precision math libraries, while enormously expensive in compute time (and possibly memory) to use, have the advantage of being a fairly general solution.

I ran into a similar problem years ago. The solution was to develop an approximation of log(1+exp(-x)). The range of the approximation does not need to be all that large (x from 0 to 40 will more than suffice), and at least in my case the accuracy didn't need to be particularly high, either.
In your case, it looks like you need to compute log(1+exp(-x1)+exp(-x2)+...). Throw out those large negative values. For example, suppose a, b, and c are three log probabilities, with 0>a>b>c. You can ignore c if a-c>38. It's not going to contribute to your joint log probability at all, at least not if you are working with doubles.

Option 1: Commons Math - The Apache Commons Mathematics Library
Commons Math is a library of lightweight, self-contained mathematics and statistics components addressing the most common problems not
available in the Java programming language or Commons Lang.
Note: The API protects the constructors to force a factory pattern while naming the factory DfpField (rather than the somewhat more intuitive DfpFac or DfpFactory). So you have to use
new DfpField(numberOfDigits).newDfp(myNormalNumber)
to instantiate a Dfp, then you can call .multiply or whatever on this. I thought I'd mention this because it's a bit confusing.
Option 2: GNU Scientific Library or Boost C++ Libraries.
In these cases you should use JNI in order to call these native libraries.
Option 3: If you are free to use other programs and/or languages, you could consider using programs/languages for numerical computations such as Octave, Scilab, and similar.
Option 4: BigDecimal of Java.

Rather than storing values in logarithmic form, I think you'd probably be better off using the same concept as doubles, namely, floating-point representation. For example, you might store each value as two longs, one for sign-and-mantissa and one for the exponent. (Real floating-point has a carefully tuned design to support lots of edge cases and avoid wasting a single bit; but you probably don't need to worry so much about any of those, and can focus on designing it in a way that's simple to implement.)

I don't understand why this works, but this formula seems to work and is simpler:
c = a + log(1 + exp(b - a))
Where c = log(exp(a)+exp(b))

What complexity are operations on BigInteger?

What complexity are the methods multiply, divide and pow in BigInteger currently? There is no mention of the computational complexity in the documentation (nor anywhere else).

If you look at the code for BigInteger (provided with JDK), it appears to me that
multiply(..) has O(n^2) (actually the method is multiplyToLen(..)). The code for the other methods is a bit more complex, but you can see yourself.
Note: this is for Java 6. I assume it won't differ in Java 7.

As noted in the comments on #Bozho's answer, Java 8 and onwards use more efficient algorithms to implement multiplication and division than the naive O(N^2) algorithms in Java 7 and earlier.
Java 8 multiplication adaptively uses either the naive O(N^2) long multiplication algorithm, the Karatsuba algorithm or the 3 way Toom-Cook algorithm depending in the sizes of the numbers being multiplied. The latter are (respectively) O(N^1.58) and O(N^1.46).
Java 8 division adaptively uses either Knuth's O(N^2) long division algorithm or the Burnikel-Ziegler algorithm. (According to the research paper, the latter is 2K(N) + O(NlogN) for a division of a 2N digit number by an N digit number, where K(N) is the Karatsuba multiplication time for two N-digit numbers.)
Likewise some other operations have been optimized.
There is no mention of the computational complexity in the documentation (nor anywhere else).
Some details of the complexity are mentioned in the Java 8 source code. The reason that the javadocs do not mention complexity is that it is implementation specific, both in theory and in practice. (As illustrated by the fact that the complexity of some operations is significantly different between Java 7 and 8.)

There is a new "better" BigInteger class that is not being used by the sun jdk for conservateism and lack of useful regression tests (huge data sets). The guy that did the better algorithms might have discussed the old BigInteger in the comments.
Here you go http://futureboy.us/temp/BigInteger.java

Measure it. Do operations with linearly increasing operands and draw the times on a diagram.
Don't forget to warm up the JVM (several runs) to get valid benchmark results.
If operations are linear O(n), quadratic O(n^2), polynomial or exponential should be obvious.
EDIT: While you can give algorithms theoretical bounds, they may not be such useful in practice. First of all, the complexity does not give the factor. Some linear or subquadratic algorithms are simply not useful because they are eating so much time and resources that they are not adequate for the problem on hand (e.g. Coppersmith-Winograd matrix multiplication).
Then your computation may have all kludges you can only detect by experiment. There are preparing algorithms which do nothing to solve the problem but to speed up the real solver (matrix conditioning). There are suboptimal implementations. With longer lengths, your speed may drop dramatically (cache missing, memory moving etc.). So for practical purposes, I advise to do experimentation.
The best thing is to double each time the length of the input and compare the times.
And yes, you do find out if an algorithm has n^1.5 or n^1.8 complexity. Simply quadruple
the input length and you need only the half time for 1.5 instead of 2. You get again nearly half the time for 1.8 if you multiply the length 256 times.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.