Multithreaded Segmented Sieve of Eratosthenes in Java - java

I am trying to create a fast prime generator in Java. It is (more or less) accepted that the fastest way for this is the segmented sieve of Eratosthenes: https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes. Lots of optimizations can be further implemented to make it faster. As of now, my implementation generates 50847534 primes below 10^9 in about 1.6 seconds, but I am looking to make it faster and at least break the 1 second barrier. To increase the chance of getting good replies, I will include a walkthrough of the algorithm as well as the code.
Still, as a TL;DR, I am looking to include multi-threading into the code
For the purposes of this question, I want to separate between the 'segmented' and the 'traditional' sieves of Eratosthenes. The traditional sieve requires O(n) space and therefore is very limited in range of the input (the limit of it). The segmented sieve however only requires O(n^0.5) space and can operate on much larger limits. (A main speed-up is using a cache-friendly segmentation, taking into account the L1 & L2 cache sizes of the specific computer). Finally, the main difference that concerns my question is that the traditional sieve is sequential, meaning it can only continue once the previous steps are completed. The segmented sieve however, is not. Each segment is independent, and is 'processed' individually against the sieving primes (the primes not larger than n^0.5). This means that theoretically, once I have the sieving primes, I can divide the work between multiple computers, each processing a different segment. The work of eachother is independent of the others. Assuming (wrongly) that each segment requires the same amount of time t to complete, and there are k segments, One computer would require total time of T = k * t, whereas k computers, each working on a different segment would require a total amount of time T = t to complete the entire process. (Practically, this is wrong, but for the sake of simplicity of the example).
This brought me to reading about multithreading - dividing the work to a few threads each processing a smaller amount of work for better usage of CPU. To my understanding, the traditional sieve cannot be multithreaded exactly because it is sequential. Each thread would depend on the previous, rendering the entire idea unfeasible. But a segmented sieve may indeed (I think) be multithreaded.
Instead of jumping straight into my question, I think it is important to introduce my code first, so I am hereby including my current fastest implementation of the segmented sieve. I have worked quite hard on it. It took quite some time, slowly tweaking and adding optimizations to it. The code is not simple. It is rather complex, I would say. I therefore assume the reader is familiar with the concepts I am introducing, such as wheel factorization, prime numbers, segmentation and more. I have included notes to make it easier to follow.
import java.math.BigInteger;
import java.util.ArrayList;
import java.util.Arrays;
public class primeGen {
public static long x = (long)Math.pow(10, 9); //limit
public static int sqrtx;
public static boolean [] sievingPrimes; //the sieving primes, <= sqrtx
public static int [] wheels = new int [] {2,3,5,7,11,13,17,19}; // base wheel primes
public static int [] gaps; //the gaps, according to the wheel. will enable skipping multiples of the wheel primes
public static int nextp; // the first prime > wheel primes
public static int l; // the amount of gaps in the wheel
public static void main(String[] args)
{
long startTime = System.currentTimeMillis();
preCalc(); // creating the sieving primes and calculating the list of gaps
int segSize = Math.max(sqrtx, 32768*8); //size of each segment
long u = nextp; // 'u' is the running index of the program. will continue from one segment to the next
int wh = 0; // the will be the gap index, indicating by how much we increment 'u' each time, skipping the multiples of the wheel primes
long pi = pisqrtx(); // the primes count. initialize with the number of primes <= sqrtx
for (long low = 0 ; low < x ; low += segSize) //the heart of the code. enumerating the primes through segmentation. enumeration will begin at p > sqrtx
{
long high = Math.min(x, low + segSize);
boolean [] segment = new boolean [(int) (high - low + 1)];
int g = -1;
for (int i = nextp ; i <= sqrtx ; i += gaps[g])
{
if (sievingPrimes[(i + 1) / 2])
{
long firstMultiple = (long) (low / i * i);
if (firstMultiple < low)
firstMultiple += i;
if (firstMultiple % 2 == 0) //start with the first odd multiple of the current prime in the segment
firstMultiple += i;
for (long j = firstMultiple ; j < high ; j += i * 2)
segment[(int) (j - low)] = true;
}
g++;
//if (g == l) //due to segment size, the full list of gaps is never used **within just one segment** , and therefore this check is redundant.
//should be used with bigger segment sizes or smaller lists of gaps
//g = 0;
}
while (u <= high)
{
if (!segment[(int) (u - low)])
pi++;
u += gaps[wh];
wh++;
if (wh == l)
wh = 0;
}
}
System.out.println(pi);
long endTime = System.currentTimeMillis();
System.out.println("Solution took "+(endTime - startTime) + " ms");
}
public static boolean [] simpleSieve (int l)
{
long sqrtl = (long)Math.sqrt(l);
boolean [] primes = new boolean [l/2+2];
Arrays.fill(primes, true);
int g = -1;
for (int i = nextp ; i <= sqrtl ; i += gaps[g])
{
if (primes[(i + 1) / 2])
for (int j = i * i ; j <= l ; j += i * 2)
primes[(j + 1) / 2]=false;
g++;
if (g == l)
g=0;
}
return primes;
}
public static long pisqrtx ()
{
int pi = wheels.length;
if (x < wheels[wheels.length-1])
{
if (x < 2)
return 0;
int k = 0;
while (wheels[k] <= x)
k++;
return k;
}
int g = -1;
for (int i = nextp ; i <= sqrtx ; i += gaps[g])
{
if(sievingPrimes[( i + 1 ) / 2])
pi++;
g++;
if (g == l)
g=0;
}
return pi;
}
public static void preCalc ()
{
sqrtx = (int) Math.sqrt(x);
int prod = 1;
for (long p : wheels)
prod *= p; // primorial
nextp = BigInteger.valueOf(wheels[wheels.length-1]).nextProbablePrime().intValue(); //the first prime that comes after the wheel
int lim = prod + nextp; // circumference of the wheel
boolean [] marks = new boolean [lim + 1];
Arrays.fill(marks, true);
for (int j = 2 * 2 ;j <= lim ; j += 2)
marks[j] = false;
for (int i = 1 ; i < wheels.length ; i++)
{
int p = wheels[i];
for (int j = p * p ; j <= lim ; j += 2 * p)
marks[j]=false; // removing all integers that are NOT comprime with the base wheel primes
}
ArrayList <Integer> gs = new ArrayList <Integer>(); //list of the gaps between the integers that are coprime with the base wheel primes
int d = nextp;
for (int p = d + 2 ; p < marks.length ; p += 2)
{
if (marks[p]) //d is prime. if p is also prime, then a gap is identified, and is noted.
{
gs.add(p - d);
d = p;
}
}
gaps = new int [gs.size()];
for (int i = 0 ; i < gs.size() ; i++)
gaps[i] = gs.get(i); // Arrays are faster than lists, so moving the list of gaps to an array
l = gaps.length;
sievingPrimes = simpleSieve(sqrtx); //initializing the sieving primes
}
}
Currently, it produces 50847534 primes below 10^9 in about 1.6 seconds. This is very impressive, at least by my standards, but I am looking to make it faster, possibly break the 1 second barrier. Even then, I believe it can be made much faster still.
The whole program is based on wheel factorization: https://en.wikipedia.org/wiki/Wheel_factorization. I have noticed I am getting the fastest results using a wheel of all primes up to 19.
public static int [] wheels = new int [] {2,3,5,7,11,13,17,19}; // base wheel primes
This means that the multiples of those primes are skipped, resulting in a much smaller searching range. The gaps between numbers which we need to take are then calculated in the preCalc method. If we make those jumps between the the numbers in the searching range we skip the multiples of the base primes.
public static void preCalc ()
{
sqrtx = (int) Math.sqrt(x);
int prod = 1;
for (long p : wheels)
prod *= p; // primorial
nextp = BigInteger.valueOf(wheels[wheels.length-1]).nextProbablePrime().intValue(); //the first prime that comes after the wheel
int lim = prod + nextp; // circumference of the wheel
boolean [] marks = new boolean [lim + 1];
Arrays.fill(marks, true);
for (int j = 2 * 2 ;j <= lim ; j += 2)
marks[j] = false;
for (int i = 1 ; i < wheels.length ; i++)
{
int p = wheels[i];
for (int j = p * p ; j <= lim ; j += 2 * p)
marks[j]=false; // removing all integers that are NOT comprime with the base wheel primes
}
ArrayList <Integer> gs = new ArrayList <Integer>(); //list of the gaps between the integers that are coprime with the base wheel primes
int d = nextp;
for (int p = d + 2 ; p < marks.length ; p += 2)
{
if (marks[p]) //d is prime. if p is also prime, then a gap is identified, and is noted.
{
gs.add(p - d);
d = p;
}
}
gaps = new int [gs.size()];
for (int i = 0 ; i < gs.size() ; i++)
gaps[i] = gs.get(i); // Arrays are faster than lists, so moving the list of gaps to an array
l = gaps.length;
sievingPrimes = simpleSieve(sqrtx); //initializing the sieving primes
}
At the end of the preCalc method, the simpleSieve method is called, efficiently sieving all the sieving primes mentioned before, the primes <= sqrtx. This is a simple Eratosthenes sieve, rather than segmented, but it is still based on wheel factorization, perviously computed.
public static boolean [] simpleSieve (int l)
{
long sqrtl = (long)Math.sqrt(l);
boolean [] primes = new boolean [l/2+2];
Arrays.fill(primes, true);
int g = -1;
for (int i = nextp ; i <= sqrtl ; i += gaps[g])
{
if (primes[(i + 1) / 2])
for (int j = i * i ; j <= l ; j += i * 2)
primes[(j + 1) / 2]=false;
g++;
if (g == l)
g=0;
}
return primes;
}
Finally, we reach the heart of the algorithm. We start by enumerating all primes <= sqrtx, with the following call:
long pi = pisqrtx();`
which used the following method:
public static long pisqrtx ()
{
int pi = wheels.length;
if (x < wheels[wheels.length-1])
{
if (x < 2)
return 0;
int k = 0;
while (wheels[k] <= x)
k++;
return k;
}
int g = -1;
for (int i = nextp ; i <= sqrtx ; i += gaps[g])
{
if(sievingPrimes[( i + 1 ) / 2])
pi++;
g++;
if (g == l)
g=0;
}
return pi;
}
Then, after initializing the pi variable which keeps track of the enumeration of primes, we perform the mentioned segmentation, starting the enumeration from the first prime > sqrtx:
int segSize = Math.max(sqrtx, 32768*8); //size of each segment
long u = nextp; // 'u' is the running index of the program. will continue from one segment to the next
int wh = 0; // the will be the gap index, indicating by how much we increment 'u' each time, skipping the multiples of the wheel primes
long pi = pisqrtx(); // the primes count. initialize with the number of primes <= sqrtx
for (long low = 0 ; low < x ; low += segSize) //the heart of the code. enumerating the primes through segmentation. enumeration will begin at p > sqrtx
{
long high = Math.min(x, low + segSize);
boolean [] segment = new boolean [(int) (high - low + 1)];
int g = -1;
for (int i = nextp ; i <= sqrtx ; i += gaps[g])
{
if (sievingPrimes[(i + 1) / 2])
{
long firstMultiple = (long) (low / i * i);
if (firstMultiple < low)
firstMultiple += i;
if (firstMultiple % 2 == 0) //start with the first odd multiple of the current prime in the segment
firstMultiple += i;
for (long j = firstMultiple ; j < high ; j += i * 2)
segment[(int) (j - low)] = true;
}
g++;
//if (g == l) //due to segment size, the full list of gaps is never used **within just one segment** , and therefore this check is redundant.
//should be used with bigger segment sizes or smaller lists of gaps
//g = 0;
}
while (u <= high)
{
if (!segment[(int) (u - low)])
pi++;
u += gaps[wh];
wh++;
if (wh == l)
wh = 0;
}
}
I have also included it as a note, but will explain as well. Because the segment size is relatively small, we will not go through the entire list of gaps within just one segment, and checking it - is redundant. (Assuming we use a 19-wheel). But in a broader scope overview of the program, we will make use of the entire array of gaps, so the variable u has to follow it and not accidentally surpass it:
while (u <= high)
{
if (!segment[(int) (u - low)])
pi++;
u += gaps[wh];
wh++;
if (wh == l)
wh = 0;
}
Using higher limits will eventually render a bigger segment, which might result in a neccessity of checking we don't surpass the gaps list even within the segment. This, or tweaking the wheel primes base might have this effect on the program. Switching to bit-sieving can largely improve the segment limit though.
As an important side-note, I am aware that efficient segmentation is
one that takes the L1 & L2 cache-sizes into account. I get the
fastest results using a segment size of 32,768 * 8 = 262,144 = 2^18. I am not sure what the cache-size of my computer is, but I do
not think it can be that big, as I see most cache sizes <= 32,768.
Still, this produces the fastest run time on my computer, so this is
why it's the chosen segment size.
As I mentioned, I am still looking to improve this by a lot. I
believe, according to my introduction, that multithreading can result
in a speed-up factor of 4, using 4 threads (corresponding to 4
cores). The idea is that each thread will still use the idea of the
segmented sieve, but work on different portions. Divide the n
into 4 equal portions - threads, each in turn performing the
segmentation on the n/4 elements it is responsible for, using the
above program. My question is how do I do that? Reading about
multithreading and examples, unfortunately, did not bring to me any
insight on how to implement it in the case above efficiently. It
seems to me, as opposed to the logic behind it, that the threads were
running sequentially, rather than simultaneously. This is why I
excluded it from the code to make it more readable. I will really
appreciate a code sample on how to do it in this specific code, but a
good explanation and reference will maybe do the trick too.
Additionally, I would like to hear about more ways of speeding-up
this program even more, any ideas you have, I would love to hear!
Really want to make it very fast and efficient. Thank you!

An example like this should help you get started.
An outline of a solution:
Define a data structure ("Task") that encompasses a specific segment; you can put all the immutable shared data into it for extra neatness, too. If you're careful enough, you can pass a common mutable array to all tasks, along with the segment limits, and only update the part of the array within these limits. This is more error-prone, but can simplify the step of joining the results (AFAICT; YMMV).
Define a data structure ("Result") that stores the result of a Task computation. Even if you just update a shared resulting structure, you may need to signal which part of that structure has been updated so far.
Create a Runnable that accepts a Task, runs a computation, and puts the results into a given result queue.
Create a blocking input queue for Tasks, and a queue for Results.
Create a ThreadPoolExecutor with the number of threads close to the number of machine cores.
Submit all your Tasks to the thread pool executor. They will be scheduled to run on the threads from the pool, and will put their results into the output queue, not necessarily in order.
Wait for all the tasks in the thread pool to finish.
Drain the output queue and join the partial results into the final result.
Extra speedup may (or may not) be achieved by joining the results in a separate task that reads the output queue, or even by updating a mutable shared output structure under synchronized, depending on how much work the joining step involves.
Hope this helps.

Are you familiar with the work of Tomas Oliveira e Silva? He has a very fast implementation of the Sieve of Eratosthenes.

How interested in speed are you? Would you consider using c++?
$ time ../c_code/segmented_bit_sieve 1000000000
50847534 primes found.
real 0m0.875s
user 0m0.813s
sys 0m0.016s
$ time ../c_code/segmented_bit_isprime 1000000000
50847534 primes found.
real 0m0.816s
user 0m0.797s
sys 0m0.000s
(on my newish laptop with an i5)
The first is from #Kim Walisch using a bit array of odd prime candidates.
https://github.com/kimwalisch/primesieve/wiki/Segmented-sieve-of-Eratosthenes
The second is my tweak to Kim's with IsPrime[] also implemented as bit array, which is slightly less clear to read, although a little faster for big N due to the reduced memory footprint.
I will read your post carefully as I am interested in primes and performance no matter what language is used. I hope this isn't too far off topic or premature. But I noticed I was already beyond your performance goal.

Related

How to calculate the probability of getting the sum X using N six-sided dice

The Challenge:
For example, what is the probability of getting the sum of 15 when using 3 six-sided dice. This can be for example by getting 5-5-5 or 6-6-3 or 3-6-6 or many more options.
A brute force solution for 2 dice - with complexity of 6^2:
Assuming we had only 2 six-sided dice, we can write a very basic code like that:
public static void main(String[] args) {
System.out.println(whatAreTheOdds(7));
}
public static double whatAreTheOdds(int wantedSum){
if (wantedSum < 2 || wantedSum > 12){
return 0;
}
int wantedFound = 0;
int totalOptions = 36;
for (int i = 1; i <= 6; i++) {
for (int j = 1; j <= 6; j++) {
int sum = i+j;
if (sum == wantedSum){
System.out.println("match: " + i + " " + j );
wantedFound +=1;
}
}
}
System.out.println("combinations count:" + wantedFound);
return (double)wantedFound / totalOptions;
}
And the output for 7 will be:
match: 1 6
match: 2 5
match: 3 4
match: 4 3
match: 5 2
match: 6 1
combination count:6
0.16666666666666666
The question is how to generalize the algorithm to support N dice:
public static double whatAreTheOdds(int wantedSum, int numberOfDices)
Because we can't dynamically create nested for loops, we must come with a different approach.
I thought of something like that:
public static double whatAreTheOdds(int sum, int numberOfDices){
int sum;
for (int i = 0; i < numberOfDices; i++) {
for (int j = 1; j <= 6; j++) {
}
}
}
but failed to come up with the right algorithm.
Another challenge here is - is there a way to do it efficiently, and not in a complexity of 6^N?
Here is a recursive solution with memoization to count the combinations.
import java.util.Arrays;
import java.lang.Math;
class Dices {
public static final int DICE_FACES = 6;
public static void main(String[] args) {
System.out.println(whatAreTheOdds(40, 10));
}
public static double whatAreTheOdds(int sum, int dices) {
if (dices < 1 || sum < dices || sum > DICE_FACES * dices) return 0;
long[][] mem = new long[dices][sum];
for (long[] mi : mem) {
Arrays.fill(mi, 0L);
}
long n = whatAreTheOddsRec(sum, dices, mem);
return n / Math.pow(DICE_FACES, dices);
}
private static long whatAreTheOddsRec(int sum, int dices, long[][] mem) {
if (dices <= 1) {
return 1;
}
long n = 0;
int dicesRem = dices - 1;
int minFace = Math.max(sum - DICE_FACES * dicesRem, 1);
int maxFace = Math.min(sum - dicesRem, DICE_FACES);
for (int i = minFace; i <= maxFace; i++) {
int sumRem = sum - i;
long ni = mem[dicesRem][sumRem];
if (ni <= 0) {
ni = whatAreTheOddsRec(sumRem, dicesRem, mem);
mem[dicesRem][sumRem] = ni;
}
n += ni;
}
return n;
}
}
Output:
0.048464367913724195
EDIT: For the record, the complexity of this algorithm is still O(6^n), this answer just aims to give a possible implementation for the general case that is better than the simplest implementation, using memoization and search space prunning (exploring only feasible solutions).
As Alex's answer notes, there is a combinatorial formula for this:
In this formula, p is the sum of the numbers rolled (X in your question), n is the number of dice, and s is the number of sides each dice has (6 in your question). Whether the binomial coefficients are evaluated using loops, or precomputed using Pascal's triangle, either way the time complexity is O(n2) if we take s = 6 to be a constant and X - n to be O(n).
Here is an alternative algorithm, which computes all of the probabilities at once. The idea is to use discrete convolution to compute the distribution of the sum of two random variables given their distributions. By using a divide and conquer approach as in the exponentiation by squaring algorithm, we only have to do O(log n) convolutions.
The pseudocode is below; sum_distribution(v, n) returns an array where the value at index X - n is the number of combinations where the sum of n dice rolls is X.
// for exact results using integers, let v = [1, 1, 1, 1, 1, 1]
// and divide the result through by 6^n afterwards
let v = [1/6.0, 1/6.0, 1/6.0, 1/6.0, 1/6.0, 1/6.0]
sum_distribution(distribution, n)
if n == 0
return [1]
else if n == 1
return v
else
let r = convolve(distribution, distribution)
// the division here rounds down
let d = sum_distribution(r, n / 2)
if n is even
return d
else
return convolve(d, v)
Convolution cannot be done in linear time, so the running time is dominated by the last convolution on two arrays of length 3n, since the other convolutions are on sufficiently shorter arrays.
This means if you use a simple convolution algorithm, it should take O(n2) time to compute all of the probabilities, and if you use a fast Fourier transform then it should take O(n log n) time.
You might want to take a look at Wolfram article for a completely different approach, which calculates the desired probability with a single loop.
The idea is to have an array storing the current "state" of each dice, starting will every dice at one, and count upwards. For example, with three dice you would generate the combinations:
111
112
...
116
121
122
...
126
...
665
666
Once you have a state, you can easily find if the sum is the one you are looking for.
I leave the details to you, as it seems a useful learning exercise :)

Parallelizing Sieve of Eratosthenes in Java

I am trying to make a parallel implementation of the Sieve of Eratosthenes. I made a boolean list which gets filled up with true's for the given size. Whenever a prime is found, all multiples of that prime are marked false in the boolean list.
The way I am trying to make this algorithm parallel is by firing up a new thread while still filtering the initial prime number. For example, the algorithm starts with prime = 2. In the for loop for filter, when prime * prime, I make another for loop in which every number in between the prime (2) and the prime * prime (4) is checked. If that index in the boolean list is still true, I fire up another thread to filter that prime number.
The nested for loop creates more and more overhead as the prime numbers to filter are progressing, so I limited this to only do this nested for loop when the prime number < 100. I am assuming that by that time, the 100 million numbers will be somewhat filtered. The problem here is that this way, the primes to be filter stay just under 9500 primes, while the algorithm stops at 10000 primes (prime * prime < size(100m)). I also think this is not at all the correct way to go about it. I have searched a lot online, but didn't manage to find any examples of parallel Java implementations of the sieve.
My code looks like this:
Main class:
public class Main {
private static ListenableQueue<Integer> queue = new ListenableQueue<>(new LinkedList<>());
private static ArrayList<Integer> primes = new ArrayList<>();
private static boolean serialList[];
private static ArrayList<Integer> serialPrimes = new ArrayList<>();
private static ExecutorService exec = Executors.newFixedThreadPool(10);
private static int size = 100000000;
private static boolean list[] = new boolean[size];
private static int lastPrime = 2;
public static void main(String[] args) {
Arrays.fill(list, true);
parallel();
}
public static void parallel() {
Long startTime = System.nanoTime();
int firstPrime = 2;
exec.submit(new Runner(size, list, firstPrime));
}
public static void parallelSieve(int size, boolean[] list, int prime) {
int queuePrimes = 0;
for (int i = prime; i * prime <= size; i++) {
try {
list[i * prime] = false;
if (prime < 100) {
if (i == prime * prime && queuePrimes <= 1) {
for (int j = prime + 1; j < i; j++) {
if (list[j] && j % prime != 0 && j > lastPrime) {
lastPrime = j;
startNewThread(j);
queuePrimes++;
}
}
}
}
} catch (ArrayIndexOutOfBoundsException ignored) { }
}
}
private static void startNewThread(int newPrime) {
if ((newPrime * newPrime) < size) {
exec.submit(new Runner(size, list, newPrime));
}
else {
exec.shutdown();
for (int i = 2; i < list.length; i++) {
if (list[i]) {
primes.add(i);
}
}
}
}
}
Runner class:
public class Runner implements Runnable {
private int arraySize;
private boolean[] list;
private int k;
public Runner(int arraySize, boolean[] list, int k) {
this.arraySize = arraySize;
this.list = list;
this.k = k;
}
#Override
public void run() {
Main.parallelSieve(arraySize, list, k);
}
}
I feel like there is a much simpler way to solve this...
Do you guys have any suggestions as to how I can make this parallelization working and maybe a bit simpler?
Creating a performant concurrent implementation of an algorithm like the Sieve of Eratosthenes is somewhat more difficult than creating a performant single-threaded implementation. The reason is that you need to find a way to partition the work in a way that minimises communication and interference between the parallel worker threads.
If you achieve complete isolation then you can hope for a speed increase approaching the number of logical processors available, or about one order of magnitude on a typical modern PC. By contrast, using a decent single-threaded implementation of the sieve will give you a speedup of at least two to three orders of magnitude. One simple cop-out would be to simply load the data from a file when needed, or to shell out to a decent prime-sieving program like Kim Walisch's PrimeSieve.
Even if we only want to look at the parallelisation problem, it is still necessary to have some insight in the algorithm itself and into to machine it runs on.
The most important aspect is that modern computers have deep cache hierarchies where only the L1 cache - typically 32 KB - is accessible at full speed and all other memory accesses incur significant penalties. Translated to the Sieve of Eratosthenes this means that you need to sieve your target range one 32 KB window at a time, instead of striding each prime over many megabytes. The small primes up to the square root of the target range end must be sieved before the parallel dance begins, but then each segment or window can be sieved independently.
Sieving a given window or segment necessitates determining the start offsets for the small primes that you want to sieve by, which means at least one modulo divison per small prime per window and division is a an extremely slow operation. However, if you sieve consecutive segments instead of arbitrary windows placed anywhere in the range then you can keep the end offsets for each prime in a vector and use them as start offsets for the next segment, thus eliminating the expensive computation of the start offset.
Thus, one promising parallelisation strategy for the Sieve of Eratosthenes would be to give each worker thread a contiguous group of 32 KB blocks to sieve, so that the start offset calculation needs to happen only once per worker. This way there cannot be memory access contention between workers, since each has its own independent subrange of the target range.
However, before you begin to parallelise - i.e., make your code more complex - you should first slim it down and reduce the work to be done to the absolute essentials. For example, take a look at this fragment from your code:
for (int i = prime; i * prime <= size; i++)
list[i * prime] = false;
Instead of recomputing loop bounds in every iteration and indexing with a multiplication, check the loop variable against a precomputed, loop-invariant value and reduce the multiplication to iterated addition:
for (int o = prime * prime; o <= size; o += prime)
list[o] = false;
There are two simple sieve-specific optimisations that can give significant speed bosts.
1) Leave the even numbers out of your sieve and pull the prime 2 out of thin air when needed. Bingo, you just doubled your performance.
2) Instead of sieving each segment by the small odd primes 3, 5, 7 and so on, blast a precomputed pattern over the segment (or even the whole range). This saves time because these small primes make many, many steps in each segment and account for the lion's share of sieving time.
There are more possible optimisations including a couple more low-hanging fruit but either the returns are diminishing or the effort curve rises steeply. Try searching Code Review for 'sieve'. Also, don't forget that you're fighting a Java compiler in addition to the algorithmic problem and the machine architecture, i.e. things like array bounds checking which your compiler may or may not be able to hoist out of loops.
To give you a ballpark figure: a single-threaded segmented odds-only sieve with precomputed patterns can sieve the whole 32-bit range in 2 to 4 seconds in C#, depending on how much TLC you apply in addition to things mentioned above. Your much smaller problem of primes up to 100000000 (1e8) is solved in less than 100 ms on my aging notebook.
Here's some code that shows how windowed sieving works. For clarity I left off all optimisations like odds-only representation or wheel-3 stepping when reading out the primes and so on. It's C# but that should be similar enough to Java to be readable.
Note: I called the sieve array eliminated because a true value indicates a crossed-off number (saves filling the array with all true at the beginning and it is more logical anyway).
static List<uint> small_primes_between (uint m, uint n)
{
m = Math.Max(m, 2);
if (m > n)
return new List<uint>();
Trace.Assert(n - m < int.MaxValue);
uint sieve_bits = n - m + 1;
var eliminated = new bool[sieve_bits];
foreach (uint prime in small_primes_up_to((uint)Math.Sqrt(n)))
{
uint start = prime * prime, stride = prime;
if (start >= m)
start -= m;
else
start = (stride - 1) - (m - start - 1) % stride;
for (uint j = start; j < sieve_bits; j += stride)
eliminated[j] = true;
}
return remaining_numbers(eliminated, m);
}
//---------------------------------------------------------------------------------------------
static List<uint> remaining_numbers (bool[] eliminated, uint sieve_base)
{
var result = new List<uint>();
for (uint i = 0, e = (uint)eliminated.Length; i < e; ++i)
if (!eliminated[i])
result.Add(sieve_base + i);
return result;
}
//---------------------------------------------------------------------------------------------
static List<uint> small_primes_up_to (uint n)
{
Trace.Assert(n < int.MaxValue); // size_t is int32_t in .Net (!)
var eliminated = new bool[n + 1]; // +1 because indexed by numbers
eliminated[0] = true;
eliminated[1] = true;
for (uint i = 2, sqrt_n = (uint)Math.Sqrt(n); i <= sqrt_n; ++i)
if (!eliminated[i])
for (uint j = i * i; j <= n; j += i)
eliminated[j] = true;
return remaining_numbers(eliminated, 0);
}

O(log n) Programming

I am trying to prepare for a contest but my program speed is always dreadfully slow as I use O(n). First of all, I don't even know how to make it O(log n), or I've never heard about this paradigm. Where can I learn about this?
For example,
If you had an integer array with zeroes and ones, such as [ 0, 0, 0, 1, 0, 1 ], and now you wanted to replace every 0 with 1 only if one of it's neighbors has the value of 1, what is the most efficient way to go about doing if this must occur t number of times? (The program must do this for a number of t times)
EDIT:
Here's my inefficient solution:
import java.util.Scanner;
public class Main {
static Scanner input = new Scanner(System.in);
public static void main(String[] args) {
int n;
long t;
n = input.nextInt();
t = input.nextLong();
input.nextLine();
int[] units = new int[n + 2];
String inputted = input.nextLine();
input.close();
for(int i = 1; i <= n; i++) {
units[i] = Integer.parseInt((""+inputted.charAt(i - 1)));
}
int[] original;
for(int j = 0; j <= t -1; j++) {
units[0] = units[n];
units[n + 1] = units[1];
original = units.clone();
for(int i = 1; i <= n; i++) {
if(((original[i - 1] == 0) && (original[i + 1] == 1)) || ((original[i - 1] == 1) && (original[i + 1] == 0))) {
units[i] = 1;
} else {
units[i] = 0;
}
}
}
for(int i = 1; i <= n; i++) {
System.out.print(units[i]);
}
}
}
This is an elementary cellular automaton. Such a dynamical system has properties that you can use for your advantages. In your case, for example, you can set to value 1 every cell at distance at most t from any initial value 1 (cone of light property). Then you may do something like:
get a 1 in the original sequence, say it is located at position p.
set to 1 every position from p-t to p+t.
You may then take as your advantage in the next step that you've already set position p-t to p+t... This can let you compute the final step t without computing intermediary steps (good factor of acceleration isn't it?).
You can also use some tricks as HashLife, see 1.
As I was saying in the comments, I'm fairly sure you can keep out the array and clone operations.
You can modify a StringBuilder in-place, so no need to convert back and forth between int[] and String.
For example, (note: This is on the order of an O(n) operation for all T <= N)
public static void main(String[] args) {
System.out.println(conway1d("0000001", 7, 1));
System.out.println(conway1d("01011", 5, 3));
}
private static String conway1d(CharSequence input, int N, long T) {
System.out.println("Generation 0: " + input);
StringBuilder sb = new StringBuilder(input); // Will update this for all generations
StringBuilder copy = new StringBuilder(); // store a copy to reference current generation
for (int gen = 1; gen <= T; gen++) {
// Copy over next generation string
copy.setLength(0);
copy.append(input);
for (int i = 0; i < N; i++) {
conwayUpdate(sb, copy, i, N);
}
input = sb.toString(); // next generation string
System.out.printf("Generation %d: %s\n", gen, input);
}
return input.toString();
}
private static void conwayUpdate(StringBuilder nextGen, final StringBuilder currentGen, int charPos, int N) {
int prev = (N + (charPos - 1)) % N;
int next = (charPos + 1) % N;
// **Exactly one** adjacent '1'
boolean adjacent = currentGen.charAt(prev) == '1' ^ currentGen.charAt(next) == '1';
nextGen.setCharAt(charPos, adjacent ? '1' : '0'); // set cell as alive or dead
}
For the two samples in the problem you posted in the comments, this code generates this output.
Generation 0: 0000001
Generation 1: 1000010
1000010
Generation 0: 01011
Generation 1: 00011
Generation 2: 10111
Generation 3: 10100
10100
The BigO notation is a simplification to understand the complexity of the Algorithm. Basically, two algorithms O(n) can have very different execution times. Why? Let's unroll your example:
You have two nested loops. The outer loop will run t times.
The inner loop will run n times
For each time the loop executes, it will take a constant k time.
So, in essence your algorithm is O(k * t * n). If t is in the same order of magnitude of n, then you can consider the complexity as O(k * n^2).
There is two approaches to optimize this algorithm:
Reduce the constant time k. For example, do not clone the whole array on each loop, because it is very time consuming (clone needs to do a full array loop to clone).
The second optimization in this case is to use Dynamic Programing (https://en.wikipedia.org/wiki/Dynamic_programming) that can cache information between two loops and optimize the execution, that can lower k or even lower the complexity from O(nĖ†2) to O(n * log n).

collatz sequence - optimising code

As an additional question to an assignment, we were asked to find the 10 starting numbers (n) that produce the longest collatz sequence. (Where 0 < n < 10,000,000,000) I wrote code that would hopefully accomplish this, but I estimate that it would take a full 11 hours to compute an answer.
I have noticed a couple of small optimisations like starting from biggest to smallest so adding to the array is done less, and only computing between 10,000,000,000/2^10 (=9765625) and 10,000,000,000 because there has to be 10 sequences of longer length, but I can't see anything more I could do. Can anyone help?
Relevant Code
The Sequence Searching Alg
long[][] longest = new long[2][10]; //terms/starting number
long max = 10000000000l; //10 billion
for(long i = max; i >= 9765625; i--) {
long n = i;
long count = 1; //terms in the sequence
while(n > 1) {
if((n & 1) == 0) n /= 2; //checks if the last bit is a 0
else {
n = (3*n + 1)/2;
count++;
}
count++;
}
if(count > longest[0][9]) {
longest = addToArray(count, i, longest);
currentBest(longest); //prints the currently stored top 10
}
}
The storage alg
public static long[][] addToArray(long count, long i, long[][] longest) {
int pos = 0;
while(count < longest[0][pos]) {
pos++;
}
long TEMP = count; //terms
long TEMPb = i; //starting number
for(int a = pos; a < longest[0].length; a++) {
long TEMP2 = longest[0][a];
longest[0][a] = TEMP;
TEMP = TEMP2;
long TEMP2b = longest[1][a];
longest[1][a] = TEMPb;
TEMPb = TEMP2b;
}
return longest;
}
You can do something like
while (true) {
int ntz = Long.numberOfTrailingZeros(n);
count += ntz;
n >>>= ntz; // Using unsigned shift allows to work with bigger numbers.
if (n==1) break;
n = 3*n + 1;
count++;
}
which should be faster as it does multiple steps at once and avoids unpredictable branches. numberOfTrailingZeros is JVM intrinsic taking just one cycle on modern desktop CPUs. However, it's not very efficient as the average number of zeros is only 2.
The Wikipedia explains how to do multiple steps at once. This is based on the observation that knowing k least significant bits is sufficient to determine the future steps up to the point when the k-th halving happens. My best result based on this (with k=17) and filtering out some non-promising values is 57 seconds for the determination of the maximum in range 1 .. 1e10.

How can I make this Java snippet more efficient?

In a computer contest, I was given a problem where I had to manipulate input data. The input has been split() into an array where data[0] is the number of repetitions. There can be up to 10^18 repetitions. My program returned Exception in thread "main" java.lang.OutOfMemoryError: Java heap space and I failed the contest.
Here's a piece of my code that's eating up memory and CPU:
long product[][]=new long[data[0]][2];
product[0][0]=data[1];
product[0][1]=data[2];
for(int a=1;a<data[0];a++){
product[a][0]=((data[5]*product[a-1][0] + data[6]) % data[3]) + 1; // Pi = ((A*Pi-1 + B) mod M) + 1 (for all i = 2..N)
product[a][1]=((data[7]*product[a-1][1] + data[8]) % data[4]) + 1; // Wi = ((C*Wi-1 + D) mod K) + 1 (for all i = 2..N)
}
Here's some of the input data:
980046644627629799 9 123456 18 10000000 831918484 451864686 840000324 650000765
972766173386786486 123 1 10000000 10000000 590000001 680000000 610000001 970000002
299896237124947938 681206 164538 2280874 981991 416793690 904023823 813682336 774801135
My program can only work up to about 7 or 8 digits, then it takes minutes to run. With 18 digits, it crashed almost as soon as I clicked "Run" in Eclipse.
I'm curious as to how is it possible to manipulate that much data on a normal computer. Please let me know if my question is unclear or you need more information. Thanks!
You can't have, and don't need, an array of such a huge length. You just need to track the most recent 2values. E.g., just have product1 and product2.
Also, consider testing if either number is a NaN after each iteration. If so, throw an Exception and give the iteration number.
Because once you get a NaN they will all be NaN. Except you are using long, so scratch that. "Nevermind". :-)
long product[][]=new long[data[0]][2];
This is the only line in the code you pasted that allocates memory. You allocate an array whose length will be data[0] in length! As data grows, so does the array. What is the formula you're trying to apply here?
The first input data you provide :
980046644627629799
is already too large to even declare an array for. Try creating a single dimension array with that as its length and see what happens....
Are you sure you don't just want a 1 x 2 matrix that you accumulate over? Explain your intended algorithm clearly and we can help you with a more optimal solution.
Let's put the numbers into perspective.
Memory: One long takes 8 bytes. 1018 longs take 16,000,000 terabytes. Way too much.
Time: 10,000,000 operations ā‰ˆ 1 second. 1018 steps ā‰ˆ 30 centuries. Also way too much.
You can solve the memory problem by realising that you only need the most recent values at any time, and that the entire array is redundant:
long currentP = data[1];
long currentW = data[2];
for (int a = 1; a < data[0]; a++)
{
currentP = ((data[5] * currentP + data[6]) % data[3]) + 1;
currentW = ((data[7] * currentW + data[8]) % data[4]) + 1;
}
The time problem is a bit trickier to solve. Since modulus is used, you can observe that the numbers must enter a cycle at some point. Once you find the cycle, you can predict what the value will be after n iterations without having to do each iteration manually.
The simplest method for finding cycles is to keep track of whether or not you visited each element, and then go through until you encounter an element you've seen before. In this situation, the amount of memory required is proportional to M and K (data[3] and data[4]). If they are too large, a more space-efficient cycle detection algorithm must be used.
Here is an example which finds the value for P:
public static void main(String[] args)
{
// value = (A * prevValue + B) % M + 1
final long NOT_SEEN = -1; // the code used for values not visited before
long[] data = { 980046644627629799L, 9, 123456, 18, 10000000, 831918484, 451864686, 840000324, 650000765 };
long N = data[0]; // the number of iterations
long S = data[1]; // the initial value of the sequence
long M = data[3]; // the modulus divisor
long A = data[5]; // muliply by this
long B = data[6]; // add this
int max = (int) Math.max(M, S); // all the numbers (except first) must be less than or equal to M
long[] seenTime = new long[max + 1]; // whether or not a value was seen and how many iterations it took
// initialize the values of 'seenTime' to 'not seen'
for (int i = 0; i < seenTime.length; i++)
{
seenTime[i] = NOT_SEEN;
}
// find the cycle
long count = 0;
long cycleValue = S; // the current value in the series
while (seenTime[(int)cycleValue] == NOT_SEEN)
{
seenTime[(int)cycleValue] = count;
cycleValue = (A * cycleValue + B) % M + 1;
count++;
}
long cycleLength = count - seenTime[(int)cycleValue];
long cycleOffset = seenTime[(int)cycleValue];
long result;
if (N < cycleOffset)
{
// Special case: requested iteration occurs before the cycle starts
// Straightforward simulation
long value = S;
for (long i = 0; i < N; i++)
{
value = (A * value + B) % M + 1;
}
result = value;
}
else
{
// Normal case: requested iteration occurs inside the cycle
// Simulate just the relevant part of one cycle
long positionInCycle = (N - cycleOffset) % cycleLength;
long value = cycleValue;
for (long i = 0; i < positionInCycle; i++)
{
value = (A * value + B) % M + 1;
}
result = value;
}
System.out.println(result);
}
I am only giving you the solution because it looks like the contest is over. The important lesson to learn from this is that you should always check the bounds to see whether your solution is practical before you start coding it up.

Categories

Resources