I recently stumbled upon a paper on a parallelization of Pollard's Rho algorithm, and given my specific application, in addition to the fact that I haven't attained the required level of math, I'm wondering if this particular parallelization method helps my specific case.
I'm trying to find two factors—semiprimes—of a very large number. My assumption, based on what little I can understand of the paper, is that this parallelization works well on a number with lots of smaller factors, rather than on two very large factors.
Is this true? Should I use this parallelization or use something else? Should I even use Pollard's Rho, or is there a better parallelization of a different factorization algorithm?
The wikipedia article states two concrete examples:
Number Original code Brent's modification
18446744073709551617 26 ms 5 ms
10023859281455311421 109 ms 31 ms
First of all, run these two with your program and take a look at your times. If they are similar to this ("hard" numbers calculating 4-6 times longer), ask yourself if you can live with that. Or even better, use other algorithms like simple classic "brute force" factorization and look at the times they give. I guess they might have a hard-easy factor closer to 1, but worse absolute times, so it's a simple trade-off.
Side note: Of course, parallelization is the way to go here, I guess you know that but I think it's important to emphasize. Also, it would help for the case that another approach lies between the Pollard-rho timings (e.g. Pollard-Rho 5-31 ms, different approach 15-17 ms) - in this case, consider running the 2 algorithms in seperate threads to do a "factorization race".
In case you don't have an actual implementation of the algorithm yet, here are Python implementations.
The basic idea in factoring large integers is to use a variety of methods, each with its own pluses and minuses. The usual plan is to start with trial division by primes to 1000 or 10000, followed by a few million Pollard rho steps; that ought to get you factors up to about twelve digits. At that point a few tests are in order: is the number a prime power or a perfect power (there are simple tests for those properties). If you still haven't factored the number, you know that it will be hard, so you will need heavy-duty tools. A useful next step is Pollard's p-1 method, followed by its close cousin the elliptic curve method. After a while, if that doesn't work, the only methods left are quadratic sieve or number field sieve, which are inherently parallel.
The parallel rho method that you asked about isn't widely used today. As you suggested, Pollard rho is better suited to finding small factors rather than large ones. For a semi-prime, it's better to spend parallel cycles on one of the sieves than on Pollard rho.
I recommend the factoring forum at mersenneforum.org for more information.
Related
I'm trying to compare two algorithms and their Big Oh efficiencies. I am trying to find the value for n where one algorithm becomes more efficient than the other algorithm. Any helpful examples or resources would be a huge help.
You really need to know more than just the BigO complexity of an algorithm in order to determine exactly at what point one algorithm becomes more efficient than another, assuming that they have different lower order terms and constants and that the one that has worse BigO characterics has better lower order terms\constants. But usually the approximation is enough.
Runtime complexity of algorithms is the tool to use when dealing with problems of growing scale of input size.
Empirical performance profiling is the tool to use when dealing with high frequency, repetitive problems that generally involve small inputs*
(*) What constitutes small inputs depends on the complexity of the algorithms involved. For example, for the travelling salesman problem, an input of size 5 is small while an input of size 15 is huge. For sorting, 20 elements would be considered small, 20000 large and 2000000 would be huge.
My computer science teacher has assigned this problem to us, and just about everyone in our class up-roared over the complexity of the problem. We are only in Advanced Topics of Computer Science in High school and none of us really have no idea where to start, what algorithms to use or anything. We have determined that going straight though every possible combination, there would be 2^50th combinations to run though which is way WAY to big for really any of us to search for. I'm just curious if this is even possible to do at our low Computer Science skill level and if anyone personally thinks that this is a fair problem because our teacher still hasn't found a solution to his own problem.
Thanks!
The solution space is not really 2^50. A tie (assuming only two candidates) means 269-269. You can't get to 269 with only one state (or even only a handful of states) so you can immediately throw out all small subsets and all large subsets (winning every state also doesn't work). Furthermore, you only need to look for subsets that total 269 (because there are 538 total, that means that the complement of each of those sets is also 269).
That said, this still boils down to the subset sum problem: (https://en.wikipedia.org/wiki/Subset_sum_problem) so any solution will not scale well (unless you figure out how to do it in polynomial time, in which case you can claim $1,000,000). However, your problem is not to scale it; for the case of the US electoral college configuration (including vote splits in some states) it is not too large to figure out in a reasonable (< 10 mins as you say) amount of time.
The solution space is smaller than it seems, since some states have the same number of electoral votes. For example, Florida and New York both have 29 electoral votes, so there are really just three cases, not four: both on the left, both on the right, and one on each side (which should be double-counted since this can happen in two ways). This reduces the number of cases to 6.2 * 10^9, over five orders of magnitude smaller than 2^51 (although, in exchange, there's a slight amount of extra work determining how many cases you're representing). Even without further optimization this is small enough to iterate over fairly quickly.
This PARI/GP script
EV=[55,38,29,20,18,16,15,14,13,12,11,10,9,8,7,6,5,4,3]~;
count=[1,1,2,2,1,2,1,1,1,1,4,4,3,2,3,6,3,5,8];
s=0; forvec(v=vector(#count,i,[0,count[1]]), if(v*EV==269, s+=prod(i=1,#count, binomial(count[i],v[i])))); s
yields an answer within milliseconds.
This version doesn't attempt to handle third-party candidates, split state votes, etc.
I have a few questions about my genetic algorithm and GAs overall.
I have created a GA that when given points to a curve it tries to figure out what function produced this curve.
An example is the following
Points
{{-2, 4},{-1, 1},{0, 0},{1, 1},{2, 4}}
Function
x^2
Sometimes I will give it points that will never produce a function, or will sometimes produce a function. It can even depend on how deep the initial trees are.
Some questions:
Why does the tree depth matter in trying to evaluate the points and
produce a satisfactory function?
Why do I sometimes get a premature convergence and the GA never
breaks out if the cycle?
What can I do to prevent a premature convergence?
What about annealing? How can I use it?
Can you take a quick look at my code and tell me if anything is obviously wrong with it? (This is test code, I need to do some code clean up.)
https://github.com/kevkid/GeneticAlgorithmTest
Source: http://www.gp-field-guide.org.uk/
EDIT:
Looks like Thomas's suggestions worked well I get very fast results, and less premature convergence. I feel like increasing the Gene pool gives better results, but i am not exactly sure if it is actually getting better over every generation or if the fact that it is random allows it to find a correct solution.
EDIT 2:
Following Thomas's suggestions I was able to get it work properly, seems like I had an issue with getting survivors, and expanding my gene pool. Also I recently added constants to my GA test if anyone else wants to look at it.
In order to avoid premature convergence you can also use multiple-subpopulations. Each sub-population will evolve independently. At the end of each generation you can exchange some individuals between subpopulations.
I did an implementation with multiple-subpopulations for a Genetic Programming variant: http://www.mepx.org/source_code.html
I don't have the time to dig into your code but I'll try to answer from what I remember on GAs:
Sometimes I will give it points that will never produce a function, or will sometimes produce a function. It can even depend on how deep the initial trees are.
I'm not sure what's the question here but if you need a result you could try and select the function that provides the least distance to the given points (could be sum, mean, number of points etc. depending on your needs).
Why does the tree depth matter in trying to evaluate the points and produce a satisfactory function?
I'm not sure what tree depth you mean but it could affect two things:
accuracy: i.e. the higher the depth the more accurate the solution might be or the more possibilities for mutations are given
performance: depending on what tree you mean a higher depth might increase performance (allowing for more educated guesses on the function) or decrease it (requiring more solutions to be generated and compared).
Why do I sometimes get a premature convergence and the GA never breaks out if the cycle?
That might be due to too little mutations. If you have a set of solutions that all converge around a local optimimum only slight mutations might not move the resulting solutions far enough from that local optimum in order to break out.
What can I do to prevent a premature convergence?
You could allow for bigger mutations, e.g. when solutions start to converge. Alternatively you could throw entirely new solutions into the mix (think of is as "immigration").
What about annealing? How can I use it?
Annealing could be used to gradually improve your solutions once they start to converge on a certain point/optimum, i.e. you'd improve the solutions in a more controlled way than "random" mutations.
You can also use it to break out of a local optimum depending on how those are distributed. As an example, you could use your GA until solutions start to converge, then use annealing and/or larger mutations and/or completely new solutions (you could generate several sets of solutions with different approaches and compare them at the end), create your new population and if the convergence is broken, start a new iteration with the GA. If the solutions still converge at the same optimum then you could stop since no bigger improvement is to be expected.
Besides all that, heuristic algorithms may still hit a local optimum but that's the tradeoff they provide: performance vs. accuracy.
I've run the implementation at available at: http://www.apl.jhu.edu/~hall/java/NQueens.java , which solve the N-queen problem with O(n) time complexity. It's amazingly fast and helps find out one solution without searching. However, I'm not really clear about the logic behind.
Why do they split the problem into 3: odd, even (but not in form 6k), even (but not in form 6k+2).
Can any one check the code and explain in more detail for me (logic only)?
They split the problem because neither construction covers all cases. Probably if you try to prove that they work in the bad cases, you'll find that a certain number is not a unit modulo n. This is a pretty typical state of affairs when constructing constrained combinatorial objects. For example, there exist Steiner triple systems of orders 6k+1 and 6k+3, but the two residues mod 6 require different constructions.
Two Questions:
Will I get different sequences of numbers for every seed I put into it?
Are there some "dead" seeds? (Ones that produce zeros or repeat very quickly.)
By the way, which, if any, other PRNGs should I use?
Solution: Since, I'm going to be using the PRNG to make a game, I don't need it to be cryptographically secure. I'm going with the Mersenne Twister, both for it's speed and huge period.
To some extent, random number generators are horses for courses. The Random class implements an LCG with reasonably chosen parameters. But it still exhibits the following features:
fairly short period (2^48)
bits are not equally random (see my article on randomness of bit positions)
will only generate a small fraction of combinations of values (the famous problem of "falling in the planes")
If these things don't matter to you, then Random has the redeeming feature of being provided as part of the JDK. It's good enough for things like casual games (but not ones where money is involved). There are no weak seeds as such.
Another alternative which is the XORShift generator, which can be implemented in Java as follows:
public long randomLong() {
x ^= (x << 21);
x ^= (x >>> 35);
x ^= (x << 4);
return x;
}
For some very cheap operations, this has a period of 2^64-1 (zero is not permitted), and is simple enough to be inlined when you're generating values repeatedly. Various shift values are possible: see George Marsaglia's paper on XORShift Generators for more details. You can consider bits in the numbers generated as being equally random. One main weakness is that occasionally it will get into a "rut" where not many bits are set in the number, and then it takes a few generations to get out of this rut.
Other possibilities are:
combine different generators (e.g. feed the output from an XORShift generator into an LCG, then add the result to the output of an XORShift generator with different parameters): this generally allows the weaknesses of the different methods to be "smoothed out", and can give a longer period if the periods of the combined generators are carefully chosen
add a "lag" (to give a longer period): essentially, where a generator would normally transform the last number generated, store a "history buffer" and transform, say, the (n-1023)th.
I would say avoid generators that use a stupid amount of memory to give you a period longer than you really need (some have a period greater than the number of atoms in the universe-- you really don't usually need that). And note that "long period" doesn't necessarily mean "high quality generator" (though 2^48 is still a little bit low!).
As zvrba said, that JavaDoc explains the normal implementation. The Wikipedia page on pseudo-random number generators has a fair amount of information and mentions the Mersenne twister, which is not deemed cryptographically secure, but is very fast and has various implementations in Java. (The last link has two implementations - there are others available, I believe.)
If you need cryptographically secure generation, read the Wikipedia page - there are various options available.
As RNGs go, Sun's implementation is definitely not state-of-theart, but's good enough for most purposes. If you need random numbers for cryptography purposes, there's java.security.SecureRandom, if you just want something faster and better than java.util.random, it's easy to find Java implementations of the Mersenne Twister on the net.
This is described in the documentation. Linear congruential generators are theoretically well-understood and a lot of material on them is available in literature and on the internet. Linear congruential generator with same parameters always outputs the same periodic sequence, and the only thing that seed decides is where the sequence begins. So the answer to your first question is "yes, if you generate enough random numbers."
See the answer in my blog post:
http://code-o-matic.blogspot.com/2010/12/how-long-is-period-of-random-numbers.html
Random has a maximal period for its state (a long, i.e. 2^64 period). This can be directly generalized to 2^k - invest as many state bits as you want, and you get the maximal period. 2Mersenne Twister has actually a very short period, comparatively (see the comments in said blog post).
--Oops. Random restricts itself to 48bits, instead of using the full 64 bits of a long, so correspondingly, its period is 2^48 after all, not 2^64.
If RNG quality really matters to you, I'd recommend using your own RNG. Maybe java.util.Random is just great, in this version, on your operating system, etc. It probably is. But that could change. It's happened before that a library writer made things worse in a later version.
It's very simple to write your own, and then you know exactly what's going on. It won't change on upgrade, etc. Here's a generator you could port to Java in 10 minutes. And if you start writing in some new language a week from now, you can port it again.
If you don't implement your own, you can grab code for a well-known RNG from a reputable source and use it in your projects. Then nobody will change your generator out from under you.
(I'm not advocating that people come up with their own algorithms, only their own implementation. Most people, myself included, have no business developing their own algorithm. It's easy to write a bad generator that you think is wonderful. That's why people need to ask questions like this one, wondering how good the library generator is. The algorithm in the generator I referenced has been through the ringer of much peer review.)