Confusion Regarding Rolling Hash in Rabin-Karp Algorithm Java

Confusion Regarding Rolling Hash in Rabin-Karp Algorithm Java - java

I was trying to understand the Rabin-Karp algorithm here: http://algs4.cs.princeton.edu/53substring/RabinKarp.java.html.
I have looked through various articles and I now know that the general form of a polynomial hash is C1*A^k-1+C2*A^k-2+C3*A^k-3. Looking at the code, I understand how they add and subtract the digits in the string.
txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q) % Q;
txtHash = (txtHash*R + txt.charAt(i)) % Q;
Here the program is subtracting the leading digit, multiplying the entire hash and then adding the new digit. However, when I was looking through the function that calculates the hash, it didn't follow the general form of a polynomial hash. It looked like this:
private long hash(String key, int M) {
long h = 0;
for (int j = 0; j < M; j++)
h = (R * h + key.charAt(j)) % Q;
return h;
}
In this function they are multiplying the hash and the radix and then adding the key.charAt(). I would figure the function would be multiplying the key.charAt() with a base that starts out at R^k-1. Then as the for loop continues, the base would divided by R to provide for the decreasing power in the polynomial. Can someone please explain how this function works and how does it generate a hash in the form that I mentioned above? Thanks!

Suppose hash function need to transfer 3 digits.
It would look like :
{digits[0]*R^2+digits[1]*R^1+digits[2]}%Q
= {(digit[0]*R^1+digits[1])*R+digits[2]}%Q
This would make hash function mush more easier to calculate.
Then apply to Rabin-Karp Algorithm,
you can see
RM = R^2 %Q;(M=2)
when you want to move next digit to validate,
you need to delete left most digit and add next digit.
txtHash = {[txtHash - R^2*most_left_digit(equal charAt(i-M))]*R+next_digit(equal charAt(i))}%Q
It's the same as
txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q) % Q;
txtHash = (txtHash*R + txt.charAt(i)) % Q;
Mod Q every steps prevent overflow.

Related

Difference in calculating middle index of array?

I have been trying to solve QuickSort and I got thru a scenario where we are selecting pivot element as the middle one.
http://www.java2novice.com/java-sorting-algorithms/quick-sort/
// calculate pivot number, I am taking pivot as middle index number
int pivot = array[lowerIndex+(higherIndex-lowerIndex)/2];
How difference this is with the below way to get the middle index?
int pivot = array[(lowerIndex+higherIndex)/2];
I remember I have seen this many times before also. And I am sure I am missing a scenario where this helpful when we get a odd number or something.
I tried few sample values but I get the same response for both ways.
What am I missing?
Thanks for your respone.

It is more likely that
(lowerIndex+higherIndex)/2
overflows rather than
lowerIndex+(higherIndex-lowerIndex)/2.
For example for lowerIndex == higherIndex == Integer.MAX_VALUE / 2 + 1.
Edit:
Mathematical proof of equivalence of the expressions
l + (r - l)/2 (in java notation)
= l + round_towards_zero((r - l) / 2) (in math notation)
= round_towards_zero(l + (r - l) / 2) (since l is an integer)
= round_towards_zero((2 * l + r - l) / 2)
= round_towards_zero(r + l) / 2)
= (l + r) / 2 (in java notation)

How to use Euclid's algorithm to find GCF/GCD?

I have created a method that allows me to find the GCF/GCD of two numbers, and although I have a working code, I don't know how or why it works. I understand Euclid's algorithm, but am not sure how the following snippet uses it.
private int gcd(int a, int b)
{
if (b == 0)
return a;
else if(a ==0)
return b;
else
return gcd(b, a % b);
}
I am especially confused on what it is returning, because why are were returning two values? And what does the a % b do? How does this use Euclid's algorithm?

"the greatest common divisor of two numbers does not change if the
larger number is replaced by its difference with the smaller number."
(wikipedia - Euclidean algorithm)
So, modulo:
Modulo returns the remainder of the integer divison between two integers. Integer division is divison without fractions or floating points. Let's denote integer division as m /\ n.
m /\ n = o;
m % n = p;
o * n + p = m;
As an example,
29 /\ 3 = 9; (3 goes - whole - into 29 9 times)
29 % 3 = 2; (the integer division of 29 into 3 has a remainder of 2)
9 * 3 + 2 = 29; (9 goes into 29 3 times (and 3 goes into 29 9 times), with a remainder of 2)
Note that if m is smaller than n (i.e. m < n), then n goes into m 0 times (m /\ n = 0), so the remainder of the integer division will be m (m % n = m, because o * n + p = m and so (0*n) + p = 0 + p = p = m);
So, how does the function work? let's try using it.
1 - gcd(m, n), m < n
So, if we start out gcd(m, n) with an m that is smaller than n, the only thing that happens on the next nested call to gcd is that the order of the arguments changes: gcd(n, m % n) = gcd(n, m);
2 - gcd(n, m), m < n
Okay, so now the first argument larger than the second.
According to euclid's algorithm, we want to do something to the larger of the two numbers. We want to replace it with the difference between it and the smaller number. We could do m - n a bunch of times. But what m % n does is the exact same as subtracting n from m as many times as possible before doing so would result in a negative number. Doing a subtraction would look like (((m - n) - n) - n) - n) and so on. But if we expand that out, we get:
m - (n * o). Because o * n + p = m, we can see that m - (n * o) = p and p = m % n. So, repeatedly subtracting the smaller from the larger is the same as doing modulo of the larger with the smaller.
In the next step, the second argument may be 0 (if n was a divisor of m). In this case, the function returns n. this is correct because n is a divisor of itself and also, as we've seen, a divisor of m.
Or, the second argument may be smaller than n. That is because the remainder of the integer divison of m into n must be smaller than n - this is because, if the remainder of the division were larger than n, then n could have fit into m one more time, which it didn't; this is an absurd result. Assuming that n is not 0, then the second argument (let's call it p) is smaller than n.
So, we are now calling gcd(n, p), where p < n.
3 - gcd(n, p), p < n
What happens now? well, we are exactly in the same place as we were in the previous paragraph. Now we just repeat that step, i.e. we will continue to call gcd(a, b), until the smaller of the two numbers that are passed into gcd(a ,b) is a divisor of the larger of the two numbers, (meaning that a % b = 0) in which case you simply return the smaller of the two numbers.

1) What does the a % b do?
% is the modulus or remainder operator in Java. The % operator returns the remainder of two numbers. For example 8 % 3 is 2 because 8 divided by 3 leaves a remainder of 2.
2) The Euclidean algorithm is based on the principle that the greatest common divisor of two numbers does not change if the larger number is replaced by its difference with the smaller number. Actually, your gcd function is used a more efficient version of the Euclidean algorithm. This version instead replacing the larger of the two numbers by its remainder when divided by the smaller of the two (with this version, the algorithm stops when reaching a zero remainder). This was proven by Gabriel Lamé in 1844 (https://en.wikipedia.org/wiki/Euclidean_algorithm)
3) Your gcd function's not returning two values, it's a recursive function. The recursive function is a function which either calls itself or is in a potential cycle of function calls. In case of your gcd function, it will be repeat until one of two parameters become zero and the gcd value is the remain parameter.
You could learn more about recursive function at this link.
http://pages.cs.wisc.edu/~calvin/cs110/RECURSION.html

Given that your question has a few components, I’ll discuss the evolution of Euclid’s classical algorithm into the recursive method you presented. Please note that the methods presented here assume that a >= b
The method below most likely implements the algorithm that you are familiar with, which repeatedly subtracts b (the smaller number) from a (the larger number), until it is no longer larger or equal to b. If a == 0, there is no remainder, giving b as the GCD. Otherwise, the values of a and b are swapped and repeated subtraction continues.
public int classic_gcd(int a, int b) {
while (true) {
while (a >= b)
a = a - b;
if (a == 0)
return b;
int c = b;
b = a;
a = c;
}
}
Since the inner while loop, essentially calculates the reminder of a divided by b, it can be replaced with the modulus operator. This greatly improves the convergence rate of the algorithm, replacing a potentially large number of subtractions with a single modulus operation. Consider finding the GCD of 12,288 and 6, which would result in over 2,000 subtraction. This improvement is shown in the modified method below.
public int mod_gcd(int a, int b) {
while (true) {
int c = a % b;
if (c == 0)
return b;
a = b;
b = c;
}
}
Lastly, the modified algorithm can be expressed as a recursive algorithm, that is, it calls upon itself, as follows:
public int recurse_gcd(int a, int b) {
if (b == 0)
return a;
else
return recurse_gcd(b, a % b);
}
This method accomplishes the same as before. However, rather than looping, the method calls itself (which if not checked is an endless loop too). The swapping of values is accomplishing by changing the order of the arguments passed to the method.
Mind you, the methods above are purely for demonstration and should not be used in production code.

Solve harmonic-factorial sequence with java recursion

I'm trying to understand reqursion, but I have found one task, I couldn't solve for few days already.
X = 1/1 + 1/(1*2) + 1/(1*2*3) + 1/(1*2*3*4) + 1/(1*2*3*4*5) .....
how can I solve it for 100 repeats without conditional operators?
Can it be solved without recursion?
I've tried this code, but it doesn't work correctly and it contains "If".
public static double harFac(double n) {
if (n == 1) return 1;
return (1.0 / (n * harFac(n - 1))) + harFac(n - 1);
}

I believe you could do something like this:
double result = 0;
int div = 1;
for (int i = 1; i <= 100; i++){
result += 1.0 / div; /*the division needs to take place in floating point*/
div *= i+1;
}

You'll very quickly run into trouble if you evaluate the denominator like that as it will run to a limit very quickly. When working with floating point, it's also a good idea to evaluate the smaller terms first.
Fortunately you can solve both of these problems by recasting the expression to
1 * (1 + 1/2 * ( 1 + 1/3 * (1 + 1/4 * ( ... ) ) ) )
So your final term is in the recursion is foo = 1 + 1.0/100, the penultimate term in the recursion is 1 + 1/98 * foo, and so on.
I personally wouldn't use recursion to solve this, rather use a loop in a single function.

You're along the right lines but you shouldn't be calling harFac twice. You need to instead calculate the divisor. I can't see how you would do this without an if condition, though.
public static double harFac(double n)
{
if (n == 1) return 1;
int divisor = 1;
for (int i = 2; i <= n; ++i) divisor *= i;
return (1.0 / divisor) + harFac(n - 1);
}
This doesn't work beyond around n = 30 because the divisor becomes so massive.

Most efficient way to calculate nCr modulo 142857

I want to calculate nCr modulo 142857. Following is my code in Java:
private static int nCr2(int n, int r) {
if (n == r || r == 0) {
return 1;
}
double l = 1;
if (n - r < r) {
r = n - r;
}
for (int i = 0; i < r; i++) {
l *= (n - i);
l /= (i + 1);
}
return (int) (l % 142857);
}
This gives nCr in O(r) time. I want an algorithm to get the result in less time than this. Is there such an algorithm?

You can precompute results for given n and r pairs and hard-code them in the table int t[][].
Later, during run-time, when you need nCr(n, r), you just make a look-up to this table: t[n][r].
This is O(1) during run-time.

As your number is no prime, this answer doesn't apply. But you could easily decompose 142857 into primes, compute the corresponding moduli, and use the Chinese Remainder Theorem to get your result. This may or may not make sense for numbers you're working with.
In any case you must avoid double, unless you can be sure that all your intermediate results can be represented exactly with only 53 bits (otherwise you lose precision and get a non-sense out).

You already have most of the answer in the function that you mention. If n is fixed and r is variable, you can use nCr = nC(r-1) * (n - r + 1) / r. So you can use a table for nCr and build it incrementally (unlike what the other answer mentions where precomputation is not incremental).
So your new function can be made recursive with a table being passed.

Calculating Extremely Large Powers of 2

I have made a program in Java that calculates powers of two, but it seems very inefficient. For smaller powers (2^4000, say), it does it in less than a second. However, I am looking at calculating 2^43112609, which is one greater than the largest known prime number. With over 12 million digits, it will take a very long time to run. Here's my code so far:
import java.io.*;
public class Power
{
private static byte x = 2;
private static int y = 43112609;
private static byte[] a = {x};
private static byte[] b = {1};
private static byte[] product;
private static int size = 2;
private static int prev = 1;
private static int count = 0;
private static int delay = 0;
public static void main(String[] args) throws IOException
{
File f = new File("number.txt");
FileOutputStream output = new FileOutputStream(f);
for (int z = 0; z < y; z++)
{
product = new byte[size];
for (int i = 0; i < a.length; i++)
{
for (int j = 0; j < b.length; j++)
{
product[i+j] += (byte) (a[i] * b[j]);
checkPlaceValue(i + j);
}
}
b = product;
for (int i = product.length - 1; i > product.length - 2; i--)
{
if (product[i] != 0)
{
size++;
if (delay >= 500)
{
delay = 0;
System.out.print(".");
}
delay++;
}
}
}
String str = "";
for (int i = (product[product.length-1] == 0) ?
product.length - 2 : product.length - 1; i >= 0; i--)
{
System.out.print(product[i]);
str += product[i];
}
output.write(str.getBytes());
output.flush();
output.close();
System.out.println();
}
public static void checkPlaceValue(int placeValue)
{
if (product[placeValue] > 9)
{
byte remainder = (byte) (product[placeValue] / 10);
product[placeValue] -= 10 * remainder;
product[placeValue + 1] += remainder;
checkPlaceValue(placeValue + 1);
}
}
}
This isn't for a school project or anything; just for the fun of it. Any help as to how to make this more efficient would be appreciated! Thanks!
Kyle
P.S. I failed to mention that the output should be in base-10, not binary.

The key here is to notice that:
2^2 = 4
2^4 = (2^2)*(2^2)
2^8 = (2^4)*(2^4)
2^16 = (2^8)*(2^8)
2^32 = (2^16)*(2^16)
2^64 = (2^32)*(2^32)
2^128 = (2^64)*(2^64)
... and in total of 25 steps ...
2^33554432 = (2^16777216)*(16777216)
Then since:
2^43112609 = (2^33554432) * (2^9558177)
you can find the remaining (2^9558177) using the same method, and since (2^9558177 = 2^8388608 * 2^1169569), you can find 2^1169569 using the same method, and since (2^1169569 = 2^1048576 * 2^120993), you can find 2^120993 using the same method, and so on...
EDIT: previously there was a mistake in this section, now it's fixed:
Also, further simplification and optimization by noticing that:
2^43112609 = 2^(0b10100100011101100010100001)
2^43112609 =
(2^(1*33554432))
* (2^(0*16777216))
* (2^(1*8388608))
* (2^(0*4194304))
* (2^(0*2097152))
* (2^(1*1048576))
* (2^(0*524288))
* (2^(0*262144))
* (2^(0*131072))
* (2^(1*65536))
* (2^(1*32768))
* (2^(1*16384))
* (2^(0*8192))
* (2^(1*4096))
* (2^(1*2048))
* (2^(0*1024))
* (2^(0*512))
* (2^(0*256))
* (2^(1*128))
* (2^(0*64))
* (2^(1*32))
* (2^(0*16))
* (2^(0*8))
* (2^(0*4))
* (2^(0*2))
* (2^(1*1))
Also note that 2^(0*n) = 2^0 = 1
Using this algorithm, you can calculate the table of 2^1, 2^2, 2^4, 2^8, 2^16 ... 2^33554432 in 25 multiplications. Then you can convert 43112609 into its binary representation, and easily find 2^43112609 using less than 25 multiplications. In total, you need to use less than 50 multiplications to find any 2^n where n is between 0 and 67108864.

Displaying it in binary is easy and fast - as quickly as you can write to disk! 100000...... :D

Let n = 43112609.
Assumption: You want to print 2^n in decimal.
While filling a bit vector than represents 2^n in binary is trivial, converting that number to decimal notation will take a while. For instance, the implementation of java.math.BigInteger.toString takes O(n^2) operations. And that's probably why
BigInteger.ONE.shiftLeft(43112609).toString()
still hasn't terminated after an hour of execution time ...
Let's start with an asymptotic analysis of your algorithm. Your outer loop will execute n times. For each iteration, you'll do another O(n^2) operations. That is, your algorithm is O(n^3), so poor scalability is expected.
You can reduce this to O(n^2 log n) by making use of
x^64 = x^(2*2*2*2*2*2) = ((((((x^2)^2)^2)^2)^2)^2
(which requires only 8 multiplications) rather than the 64 multiplications of
x^64 = x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x
(Generalizing to arbitrary exponents is left as exercise for you. Hint: Write the exponent as binary number - or look at Lie Ryan's answer).
For speeding up multiplication, you might employ the Karatsuba Algorithm, reducing the overall runtime to O(n^((log 3)/(log 2)) log n).

As mentioned, powers of two correspond to binary digits. Binary is base 2, so each digit is double the value of the previous one.
For example:
1 = 2^0 = b1
2 = 2^1 = b10
4 = 2^2 = b100
8 = 2^3 = b1000
...
Binary is base 2 (that's why it's called "base 2", 2 is the the base of the exponents), so each digit is double the value of the previous one. The shift operator ('<<' in most languages) is used to shift each binary digit to the left, each shift being equivalent to a multiply by two.
For example:
1 << 6 = 2^6 = 64
Being such a simple binary operation, most processors can do this extremely quickly for numbers which can fit in a register (8 - 64 bits, depending on the processor). Doing it with larger numbers requires some type of abstraction (Bignum for example), but it still should be an extremely quick operation. Nevertheless, doing it to 43112609 bits will take a little work.
To give you a little context, 2 << 4311260 (missing the last digit) is 1297181 digits long. Make sure you have enough RAM to handle the output number, if you don't your computer will be swapping to disk, which will cripple your execution speed.
Since the program is so simple, also consider switching to a language which compiles directly into assembly, such as C.
In truth, generating the value is trivial (we already know the answer, a one followed by 43112609 zeros). It will take quite a bit longer to convert it into decimal.

As #John SMith suggests, you can try. 2^4000
System.out.println(new BigInteger("1").shiftLeft(4000));
EDIT: Turning a binary into a decimal is an O(n^2) problem. When you double then number of bits you double the length of each operation and you double the number of digits produced.
2^100,000 takes 0.166 s
2^1000,000 takes 11.7 s
2^10,000,000 should take 1200 seconds.
NOTE: The time taken is entriely in the toString(), not the shiftLeft which takes < 1 ms even for 10 million.

The other key to notice is that your CPU is much faster at multiplying ints and longs than you are by doing long multiplication in Java. Get that number split up into long (64-byte) chunks, and multiply and carry the chunks instead of individual digits. Coupled with the previous answer (using squaring instead of sequential multiplication of 2) will probably speed it up by a factor of 100x or more.
Edit
I attempted to write a chunking and squaring method and it runs slightly slower than BigInteger (13.5 seconds vs 11.5 seconds to calculate 2^524288). After doing some timings and experiments, the fastest method seems to be repeated squaring with the BigInteger class:
public static String pow3(int n) {
BigInteger bigint = new BigInteger("2");
while (n > 1) {
bigint = bigint.pow(2);
n /= 2;
}
return bigint.toString();
}
Some timing results for power of 2 exponents (2^(2^n) for some n)
131072 - 0.83 seconds
262144 - 3.02 seconds
524288 - 11.75 seconds
1048576 - 49.66 seconds
At this rate of growth, it would take approximately 77 hours to calculate 2^33554432, let alone the time storing and adding all the powers together to make the final result of 2^43112609.
Edit 2
Actually, for really large exponents, the BigInteger.ShiftLeft method is the fastest. I estimate that for 2^33554432 with ShiftLeft, it would take approximately 28-30 hours. Wonder how fast a C or Assembly version would take...

Because one actually wants all the digits of the result (unlike, e.g. RSA, where one is only interested in the residue mod a number that's much smaller than the numbers we have here) I think the best approach is probably to extract nine decimal digits at once using long division implemented using multiplication. Start with residue equal zero, and apply the following to each 32 bits in turn (MSB first)
residue = (residue SHL 32)+data
result = 0
temp = (residue >> 30)
temp += (temp*316718722) >> 32
result += temp;
residue -= temp * 1000000000;
while (residue >= 1000000000) /* I don't think this loop ever runs more than twice */
{
result ++;
residue -= 1000000000;
}
Then store the result in the 32 bits just read, and loop through each lower word. The residue after the last step will be the nine bottom decimal digits of the result. Since the computation of a power of two in binary will be quick and easy, I think dividing out to convert to decimal may be the best approach.
BTW, this computes 2^640000 in about 15 seconds in vb.net, so 2^43112609 should be about five hours to compute all 12,978,188 digits.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Confusion Regarding Rolling Hash in Rabin-Karp Algorithm Java - java

Related

Difference in calculating middle index of array?

How to use Euclid's algorithm to find GCF/GCD?

Solve harmonic-factorial sequence with java recursion

Most efficient way to calculate nCr modulo 142857

Calculating Extremely Large Powers of 2

Categories

Resources