Parallelizing Sieve of Eratosthenes Algorithm for finding Prime Number - java

Parallelize Sieve of Eratosthenes method in two ways
using Java and
using C/C++ with Pthreads
Find the best values for THRESHOLD for 2 and 4 Core CPUs.
can any one help me how to do it. i am learning threads of java & C/C++.what things will i need to parallelize this algorithm

Note that using the Sieve of Eratostheens method to find the prime numbers table, once you find a prime number i - you set i*n as non-prime for each n.
Note that for 2 elements that you know they are prime numbers - i,j you can do it in parallel, i does not require any information from j and vise versa.
The same of course holds for each k prime numbers.
Note however that finding if a number is prime - depends on last calculations! Thus, a barrier is needed between marking numbers as non-primes and finding next prime number to work with.
A good place to start could be:
repeat until finished filling the table:
1. Find k prime numbers [serially]
2. Fill the table for these k numbers [parallely]
3. after all threads finished step 2 [barrier] - return to 1
Since it seems homework, I'll only give these hints, and let you do the rest of the work. If you later have any specific problem - ask a new question and show us what you already did.
EDIT: one more important hint, note that if i is prime, the calculation of all the non prime numbers that derive from i will not affect the fact that j is prime for each i < j < 2i. Use this fact to find the k prime numbers, you don't want for example to take 2,3,4 as prime numbers on your first iteration of the algorithm.

Prime sieves use a lot of space.
In case of 1 bit per number, you'd need 1 giga-bit (2ˇ30) to store sieve for N = ~10ˇ9. And 1 terabit (2ˇ40) to store sieve for N = ~10ˇ12, which is 2ˇ37 bytes = 2ˇ7 GB = 128GB.
If only storing odd numbers, 16 numbers/B, 64GB is needed for N=10ˇ9. With your laptops 16GB you'll be limited to N < 2ˇ35. Good to start cracking RSA 70-bit.
If making algorithm much more complex and not storing multiples of 3 5 or 7 in the seive, 1 byte would hold sieve-bits for 30 numbers, which would let ~2ˇ36 = 2ˇ(30+6) = ~64 x 10ˇ9, dangerous for RSA72 :lol: To crack RSA1024, youd need (1024/2 = 512, 512-36 = 476)
2ˇ476 (~10ˇ159) times more memory.
So memory usage is the main problem.
After all, even fast RAM is hundreds of times slower than CPU's L1 cache, so you want to fit the data to 32KB or upto 256KB instead of zillion TBs.
As the sieve is accessesd pretty randomly, it is not automagically loaded to cache and you'd get contineous cache-misses. You need to go through the range of numbers finishing the whole job chunk by chunk. While chunk-size is your fastest memory (L1 L2 of CPU or GPU).
Let's skip the complex part...
You'll have a list of primes that are needed to unmark composites in current junk. If already sieving big numbers, lets say 10ˇ30 upto 10ˇ30+10ˇ8, the range of 100M numbers has around 1M primes waiting in the list just to unmark 1 of their composites, each prime of that size would need (i guess) around 128B to be stored, so that you'd need 128MB for it. Luckyly this list is accessed sequentially and it may be already in cache when needed. But anyway you'll need zillion bytes to store the prime-lists for the next sieves.
About multithreading, scalability to GPUs, thousands of threads e.g.
Each prime in the list to unmark 1 bit in sieve in the fastest memory, accesses it basically randomly. GPu doesn't write bit by bit, instead it writes a la 128B per operation, it would be 1024bits. While 1 thread tries to unmark 1 bit, the other thread unmarks the otherbit while the bit of the 1st thread will have value 1 again. To fence/barrier/lock the memory would stop all the threads and no speed increase althoough lots of threads running.
So threads should not share the sieve. Would need to share sieving-prime-lists instead, so that each thread would have its own chunk of a chunk of sieve, and would use the same primes for unmarking. But after unmarking, they need to "schedule" that prime to the shared primes-list for its future sieve, which will be executed after millions of sievings, at the same time while other threads are changing the same list. Pretyy much stuck again.
It is very easy to make it parallel while slowing it down. Rather often it would be faster to recalculate something in each thread than to ask it over a slow bus like RAM, PCIe, SSD, GBit-net... but it is possible and not very complex if you have only some threads.

An approach is to let a single thread find the next prime number in the sieve, and then let all threads mark the multiples concurently. Every thread will be assigned a different section of the array to avoid memory sharing as much as possible. So every thread needs to determine what range of multiples it will handle.

First, should the parallelization affect the creation of the table, or
simply using it to determine whether a number is prime or not. In a
real application, I'd do the first off line, so the table would appear
in the final C++ code as a statically initialized C style array. So the
question of parallelization would be irrelevant. And since nothing
should be modified in the second, you can access from as many threads as
you want, without concern.
I suspect, however, that the purpose of the exercise is for you to use
multiple threads to construct the table. (Otherwise, it doesn't make
sense.) In this case: the table is constructed by a series of loops,
with steps 2, 3, 5... Each of these loops can be executed in a separate
thread, but... some sort of synchronization will be needed for
concurrent accesses. If you treat the table as a single object, with
just one lock, you'll end up either running the loops sequencially
(because you're acquiring the lock outside of the loop), or spending
more time acquiring and releasing the lock than doing any real work.
(Acquiring and releasing an uncontested lock can be very fast. But not
as fast as just setting a bool. And in this case, the lock is going
to be very, very contested, since all of the threads want it most of the
time.) If you create a lock per bool, that's going to be an awful lot
of locks—it will probably take less time to construct the table in
a single thread than to create all of the mutexes.
Of course, in C++ (and perhaps in Java as well), you'll want to use a
bitmap, rather than one bool per entry; the larger the table, the
larger the maximum number you can handle. (Something like bool
sieve[INT_MAX]; is almost certain to fail; you might be able to get
away with unsigned char sieve[INT_MAX / 8 + 1];, however.) In this
case, you'll need a mutex per element, not per entry (which would be a
single bit in the element). Given that each mutex eats up some
resources as well, you probably want to divide the table into discrete
blocks, with a mutex per block, and use a nested loop:
int j = 0;
for ( int i = 0; i < numberOfBlocks; ++ i ) {
scoped_lock( mutex_table[i] );
while ( j < (i + 1) * bitsInBlock ) {
// ...
j += step;
}
Once this is working, a bit of tuning will be necessary to determine the
optimal block size (but I would guess fairly big).

Your teacher won't like this, but I guess there is a brutal approach which can be worth considering.
Just let every thread repeat
find the next prime in the sieve
mark all multiples of this prime
independently of all others and without any synchronization. Every thread stops when there it finds no more prime.
This is brutal because several threads may work on the same prime by accident, but the final sieve will be correct (all composites detected), and what you loose in terms of duplicate work could be regained by the absence of synchronization.

Related

How to save CPU cycle searching a sorted array for an exact match? Linear search is too slow?

The aim of this exercise is to check the presence of a number in an array.
Specifications: The items are integers arranged in ascending order. The array can contain up to 1 million items. The array is never null. Implement the method boolean Answer.Exists(int[] ints, int k) so that it returns true if k belongs to ints, otherwise the method should return false.
Important note: Try to save CPU cycles if possible.
this is my code
import java.util.*;
class Main {
static boolean exists(int[] ints, int k) {
boolean flag = false;
for (int i = 0; i < ints.length; i++) {
if (ints[i] == k) {
flag = true;
return flag;
} else {
continue;
}
}
return flag;
}
public static void main(String []args){
int[] ints = {-9, 14, 37, 102};
System.out.println(Main.exists(ints,9)); // true
System.out.println(Main.exists(ints, 102));
}
}
but after submitted the code, this is the result
The solution works with a 'small' array
The solution works with an empty array
The solution Doesn't work in a reasonable time with one million items
So, why it is not working if anyone can clarify that
The solution works if k is the first element in the array
The solution doesn't use the J2SE API to perform the binary search
Your code uses an inefficient algorithm.
Modern CPUs and OSes are incredibly complicated and do vast amounts of bizarre optimizations. Thus, 'save some CPU cycles' is not something you can meaningfully reason about anymore. So let's objectify that statement into something that is useful:
The exercise wants you to find the algorithmically least complex algorithm for the task.
"Algorithmic complexity" is best thought of as follows: Define some variables. Let's say: The size of the input array. Let's call it N.
Now pick a bunch of numbers for N, run your algorithm a few times averaging the runtime. Then chart this. On the x-axis is N. On the y-axis is how long it took.
In other words, make some lists of size 10 and run the algorithm a bunch of times. Then go for size 20, size 30, size 40, and so on. A chart rolls out. At the beginning, for low N, it'll be all over the place, wildly different numbers. Your CPU is busy doing other things and who knows what - all sorts of esoteric factors (literally what song is playing on your music player, that kind of irrelevant stuff) control how long things take. But eventually, for a large enough N, you'll see a pattern - the line starts coalescing, the algorithmic complexity takes over.
At that point, the line (from there on out - so, looking to the 'right', to larger N) looks like a known graph. It may look like y = x - i.e. a straight line at an angle. Or like y = x^2. Or something complicated like y = x^3 + x!.
That is called the big-O number of your algorithm.
Your algorithm has O(n) performance. In other words, an angled line: If 10,000 items takes 5 milliseconds to process, than 20,000 items will take 10, and 40,000 items will take 20.
There is an O(log n) algorithm available instead. In other words, a line that becomes nearly entirely horizontal over time. If 10,000 items take 5 milliseconds, then 100,000 takes 10 milliseconds, and a million takes 20. Make N large enough and the algorithmically simpler one will trounce the other one, regardless of how many optimizations the algorithmically more complex one has. Because math says it has to be that way.
Hence trivially no amount of OS, JVM, and hardware optimizations could ever make an O(n^2) algorithm be faster than an O(n) one for a large enough N.
So let's figure out this O(log n) algorithm.
Imagine a phone book. I ask you to look up Mr. Smith.
You could open to page 1 and start reading. That's what your algorithm does. On average you have to read through half of the entire phonebook.
Here's another algorithm: Instead, turn to the middle of the book. Check the name. Is that name 'lower' or 'higher' than Smith? If it's lower, tear the top half of the phone book and toss it in the garbage. If it's higher, tear the bottom half off and get rid of that.
Then repeat: Pick the middle of the new (half-sized) phonebook. Keep tearing out half of that book until only one name remains. That's your match.
This is algorithmically less complicated: With a single lookup you eliminate half of the phonebook. Got a million entries in that phonebook? one lookup eliminates HALF of all names it could be.
In 20 lookups, you could get an answer even if the phonebook has 2^20 = 1 million items. Got 21 lookups? Then you can deal with 2 million items.
The crucial piece of information is that your input is sorted. It's like the phone book! You can apply the same algorithm! Pick a start and end, then look at the middle. Is it your answer? great, return true;. Is it not? If lower, then start now becomes the middle, and start the algorithm again. Is it higher? Then end now becomes the middle. Is start and end identical? Then return false.
That's the algorithm they want you to write. It's called "binary search". Wikipedia has a page on it, lots of web tutorials cover the notion. But it's good to know why binary search is faster.
NB: Moore's Law, an observation that computers get faster over time, more or less limits technical improvement at O(n^2). In other words, any algorithm whose algorithmic complexity is more complicated than O(n^2) is utterly safe - no amount of technological advancement will ever mean that an algorithm that cannot reasonably be run today can run in a flash on the hardware of tomorrow. Anything that is less complicated can just wait for technology to become faster. In that sense, anything more complex than O(n^2) will take literally quadrillions of years for large enough N, and always will. That's why we say 'this crypto algorithm is safe' - because it's algorithmically significantly more complicated than O(n^2) so the Apple M99 chip released in 2086 still can't crack anything unless you give it quadrillions of years. Perhaps one day quantum tech leapfrogs this entire notion, but that's a bit too out there for an SO answer :)

How can I get the most frequent 100 numbers out of 4,000,000,000 numbers?

Yesterday in a coding interview I was asked how to get the most frequent 100 numbers out of 4,000,000,000 integers (may contain duplicates), for example:
813972066
908187460
365175040
120428932
908187460
504108776
The first approach that came to my mind was using HashMap:
static void printMostFrequent100Numbers() throws FileNotFoundException {
// Group unique numbers, key=number, value=frequency
Map<String, Integer> unsorted = new HashMap<>();
try (Scanner scanner = new Scanner(new File("numbers.txt"))) {
while (scanner.hasNextLine()) {
String number = scanner.nextLine();
unsorted.put(number, unsorted.getOrDefault(number, 0) + 1);
}
}
// Sort by frequency in descending order
List<Map.Entry<String, Integer>> sorted = new LinkedList<>(unsorted.entrySet());
sorted.sort((o1, o2) -> o2.getValue().compareTo(o1.getValue()));
// Print first 100 numbers
int count = 0;
for (Map.Entry<String, Integer> entry : sorted) {
System.out.println(entry.getKey());
if (++count == 100) {
return;
}
}
}
But it probably would throw an OutOfMemory exception for the data set of 4,000,000,000 numbers. Moreover, since 4,000,000,000 exceeds the maximum length of a Java array, let's say numbers are in a text file and they are not sorted. I assume multithreading or Map Reduce would be more appropriate for big data set?
How can the top 100 values be calculated when the data does not fit into the available memory?
If the data is sorted, you can collect the top 100 in O(n) where n is the data's size. Because the data is sorted, the distinct values are contiguous. Counting them while traversing the data once gives you the global frequency, which is not available to you when the data is not sorted.
See the sample code below on how this can be done. There is also an implementation (in Kotlin) of the entire approach on GitHub
Note: Sorting is not required. What is required is that distinct values are contiguous and so there is no need for ordering to be defined - we get this from sorting but perhaps there is a way of doing this more efficiently.
You can sort the data file using (external) merge sort in roughly O(n log n) by splitting the input data file into smaller files that fit into your memory, sorting and writing them out into sorted files then merging them.
About this code sample:
Sorted data is represented by a long[]. Because the logic reads values one by one, it's an OK approximation of reading the data from a sorted file.
The OP didn't specify how multiple values with equal frequency should be treated; consequently, the code doesn't do anything beyond ensuring that the result is top N values in no particular order and not implying that there aren't other values with the same frequency.
import java.util.*;
import java.util.Map.Entry;
class TopN {
private final int maxSize;
private Map<Long, Long> countMap;
public TopN(int maxSize) {
this.maxSize = maxSize;
this.countMap = new HashMap(maxSize);
}
private void addOrReplace(long value, long count) {
if (countMap.size() < maxSize) {
countMap.put(value, count);
} else {
Optional<Entry<Long, Long>> opt = countMap.entrySet().stream().min(Entry.comparingByValue());
Entry<Long, Long> minEntry = opt.get();
if (minEntry.getValue() < count) {
countMap.remove(minEntry.getKey());
countMap.put(value, count);
}
}
}
public Set<Long> get() {
return countMap.keySet();
}
public void process(long[] data) {
long value = data[0];
long count = 0;
for (long current : data) {
if (current == value) {
++count;
} else {
addOrReplace(value, count);
value = current;
count = 1;
}
}
addOrReplace(value, count);
}
public static void main(String[] args) {
long[] data = {0, 2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 6, 7};
TopN topMap = new TopN(2);
topMap.process(data);
System.out.println(topMap.get()); // [5, 6]
}
}
Integers are signed 32 bits, so if only positive integers happen, we look at 2^31 max different entries. An array of 2^31 bytes should stay under max array size.
But that can't hold frequencies higher than 255, you would say? Yes, you're right.
So we add an hashmap for all entries that exceed the max value possible in your array (255 - if it's signed just start counting at -128). There are at most 16 million entries in this hash map (4 billion divided by 255), which should be possible.
We have two data structures:
a large array, indexed by the number read (0..2^31) of bytes.
a hashmap of (number read, frequency)
Algorithm:
while reading next number 'x'
{
if (hashmap.contains(x))
{
hashmap[x]++;
}
else
{
bigarray[x]++;
if (bigarray[x] > 250)
{
hashmap[x] = bigarray[x];
}
}
}
// when done:
// Look up top-100 in hashmap
// if not 100 yet, add more from bigarray, skipping those already taken from the hashmap
I'm not fluent in Java, so can't give a better code example.
Note that this algorithm is single-pass, works on unsorted input, and doesn't use external pre-processing steps.
All it does is assuming a maximum to the number read. It should work if the input are non-negative Integers, which have a maximum of 2^31. The sample input satisfies that constraint.
The algorithm above should satisfy most interviewers that ask this question. Whether you can code in Java should be established by a different question. This question is about designing data structures and efficient algorithms.
In pseudocode:
Perform an external sort
Do a pass to collect the top 100 frequencies (not which values have them)
Do another pass to collect the values that have those frequencies
Assumption: There are clear winners - no ties (outside the top 100).
Time complexity: O(n log n) (approx) due to sort.
Space complexity: Available memory, again due to sort.
Steps 2 and 3 are both O(n) time and O(1) space.
If there are no ties (outside the top 100), steps 2 and 3 can be combined into one pass, which wouldn’t improve the time complexity, but would improve the run time slightly.
If there are ties that would make the quantity of winners large, you couldn’t discover that and take special action (e.g., throw error or discard all ties) without two passes. You could however find the smallest 100 values from the ties with one pass.
But it probably would throw an OutOfMemory exception for the data set of 4000000000 numbers. Moreover, since 4000000000 exceeds max length of Java array, let's say numbers are in a text file and they are not sorted.
That depends on the value distribution. If you have 4E9 numbers, but the numbers are integers 1-1000, then you will end up with a map of 1000 entries. If the numbers are doubles or the value space is unrestricted, then you may have an issue.
As in the previous answer - there's a bug
unsorted.put(number, unsorted.getOrDefault(number, 0) + 1);
I personally would use "AtomicLong" for value, it allows to increase the value without updating the HashMap entries.
I assume multithreading or Map Reduce would be more appropriate for big data set?
What would be the most efficient solution for this problem?
This is a typical map-reduce exercise example, so in theory you could use multi-threaded or M-R approach. Maybe it's the goal of your exercise and you suppose to implement the multithreaded map-reduce tasks regardless if it's the most efficient way or not.
In reality you should calculate if it is worth the effort. If you're reading the input serially (as it's in your code using the Scanner), then definitely not. If you can split the input files and read multiple parts in parallel, considering the I/O throughput, it may be the case.
Or maybe if the value space is too large to fit into memory and you will need to downscale the dataset, you may consider different approach.
One option is a type of binary search. Consider a binary tree where each split corresponds to a bit in a 32-bit integer. So conceptually we have a binary tree of depth 32. At each node, we can compute the count of numbers in the set that start with the bit sequence for that node. This count is an O(n) operation, so the total cost of finding our most common sequence is going to be O(n * f(n)) where the function depends on how many nodes we need to enumerate.
Let's start by considering a depth-first search. This provides a reasonable upper bound to the stack size during enumeration. A brute force search of all nodes is obviously terrible (in that case, you can ignore the tree concept entirely and just enumerate over all the integers), but we have two things that can prevent us from needing to search all nodes:
If we ever reach a branch where there are 0 numbers in the set starting with that bit sequence, we can prune that branch and stop enumerating.
Once we hit a terminal node, we know how many occurrences of that specific number there are. We add this to our 'top 100' list, removing the lowest if necessary. Once this list fills up, we can start pruning any branches whose total count is lower than the lowest of the 'top 100' counts.
I'm not sure what the average and worst-case performance for this would be. It would tend to perform better for sets with fewer distinct numbers and probably performs worst for sets that approach uniformly distributed, since that implies more nodes will need to be searched.
A few observations:
There are at most N terminal nodes with non-zero counts, but since N > 2^32 in this specific case, that doesn't matter.
The total number of nodes for M leaf nodes (M = 2^32) is 2M-1. This is still linear in M, so worst case running time is bounded above at O(N*M).
This will perform worse than just searching all integers for some cases, but only by a linear scalar factor. Whether this performs better on average depends on the the expected data. For uniformly random data sets, my intuitive guess is that you'd be able to prune enough branches once the top-100 list fills up that you would tend to require fewer than M counts, but that would need to evaluated empirically or proven.
As a practical matter, the fact that this algorithm just requires read-only access to the data set (it only ever performs a count of numbers starting with a certain bit pattern) means it is amenable to parallelization by storing the data across multiple arrays, counting the subsets in parallel, then adding the counts together. This could be a pretty substantial speedup in a practical implementation that's harder to do with an approach that requires sorting.
A concrete example of how this might execute, for a simpler set of 3-bit numbers and only finding the single most frequent. Let's say the set is '000, 001, 100, 001, 100, 010'.
Count all numbers that start with '0'. This count is 4.
Go deeper, count all numbers that start with '00'. This count is 3.
Count all numbers that are '000'. This count is 1. This is our new most frequent.
Count all numbers that are '001'. This count is 2. This is our new most frequent.
Take next deep branch and count all numbers that start with '01'. This count is 1, which is less than our most frequent, so we can stop enumerating this branch.
Count all numbers that start with '1'. This count is 1, which is less than our most frequent, so we can stop enumerating this branch.
We're out of branches, so we're done and '001' is the most frequent.
Since the data set is presumably too big for memory, I'd do a hexadecimal radix sort. So the data set would get split between 16 files in each pass with as many passes as needed to get to the largest integer.
The second part would be to combine the files into one large data set.
The third part would be to read the file number by number and count the occurrence of each number. Save the number and number of occurrences into a two-dimensional array (the list) which is sorted by size. If the next number from the file has more occurrences than the number in the list with the lowest occurrences then replace that number.
Linux tools
That's simply done in a shell script on Linux/Mac:
sort inputfile | uniq -c | sort -nr | head -n 100
If the data is already sorted, you just use
uniq -c inputfile | sort -nr | head -n 100
File system
Another idea is to use the number as the filename and increase the file size for each hit
while read number;
do
echo -n "." >> number
done <<< inputfile
File system constraints could cause trouble with that many files, so you can create a directory tree with the first digits and store the files there.
When finished, you traverse through the tree and remember the 100 highest seen values for file size.
Database
You can use the same approach with a database, so you don't need to actually store the GB of data there (works too), just the counters (needs less space).
Interview
An interesting question would be how you handle edge cases, so what should happen if the 100th, 101st, ... number have the same frequency. Are the integers only positive?
What kind of output do they need, just the numbers or also the frequencies? Just think it through like a real task at work and ask everything you need to know to solve it. It's more about how you think and analyze a problem.
I have noticed there is a bug in this line.
unsorted.put(number, unsorted.getOrDefault(number, 1) + 1);
You should make the default value as 0 as you are then adding 1 to it. If not when you only have 1 occurrence of a value, it is recorded as the frequency of 2.
unsorted.put(number, unsorted.getOrDefault(number, 0) + 1);
One downside that I see is the unnecessity of keeping all 4 billion frequencies when you are sorting.
You can use a PriorityQueue to hold only 100 values.
Map<String, Integer> unsorted = new HashMap<>();
PriorityQueue<Map.Entry<String, Integer>> highestFrequentValues = new PriorityQueue<>(100,
(o1, o2) -> o2.getValue().compareTo(o1.getValue()));
// O(n)
try (Scanner scanner = new Scanner(new File("numbers.txt"))) {
while (scanner.hasNextLine()) {
String number = scanner.nextLine();
unsorted.put(number, unsorted.getOrDefault(number, 0) + 1);
}
}
// O(n)
for (Map.Entry<String, Integer> stringIntegerEntry : unsorted.entrySet()) {
if (highestFrequentValues.size() < 100) {
highestFrequentValues.add(stringIntegerEntry);
} else {
Map.Entry<String, Integer> minFrequencyWithinHundredEntries = highestFrequentValues.poll();
if (minFrequencyWithinHundredEntries.getValue() < stringIntegerEntry.getValue()) {
highestFrequentValues.add(stringIntegerEntry);
}
}
}
// O(n)
for (Map.Entry<String, Integer> frequentValue : highestFrequentValues) {
System.out.println(frequentValue.getKey());
}
OK, I know that the question is about Java and algorithms and solving this problem otherwise is not the point, but I still think this solution must be posted for completeness.
Solution in sh:
sort FILE | uniq -c | sort -nr | head -n 100
Explanation: sort | uniq -c lists only unique entries and counts the number of their occurrences in the input; sort -nr sorts the output numerically in reverse order (the lines with more occurrences on the top); head -n 100 keeps 100 top lines only. A file with 4,000,000,000 numbers up to 999999999 (as per OP) will take about ~40GB, so fits well on a disk of a single machine, so it is technically possible to use this solution.
Pro: simple, has constant and limited memory usage. Cons: sub-optimal (because of sort), consumes lots of the temporary disk space for the operation, and overall there is no doubt that a solution specifically designed for this problem will have a much better performance. The question remains (in all seriousness): in a general case, will writing (and then debugging and executing) an optimized solution take more or less time than using a sub-optimal one (as above) but available immediately? I ran the solution on a sample file with 400,000,000 lines (10x smaller) and it took about 7 minutes on my computer.
P.S. On a side note, OP mentions that this question was asked during a programming interview. This is interesting because I think this a kind of a solution worth mentioning in this context before starting to code another program from scratch. When people say "experienced engineers are 10x faster...", I personally don't think that this is because experienced engineers code faster or produce optimized algorithms off the top of the head, but because they explore the alternatives that can save time. In the context of an interview it is an important skill to demonstrate among others.
I suppose that 4 trillion was chosen to be sure the problem is too large to fit in memory on current desktop machines. So rent a large VM from Amazon or Microsoft for the purpose? That's an answer most people don't think of yet but is valid for real-world solutions.
The way I'd approach it is start by binning. The range of numbers is presumably all 32-bit unsigned integers (or whatever they said). How large of an array does fit in RAM? divide the range into that many equal bins and pass through the data once. Look over the distribution: Is it fairly uniform, or spikey, or a curve of some kind? If the first/last range of bins are zeros then it gives you the true range of input values, and you can adjust the program to just bin over that range and repeat, to get better accuracy.
Then depending on the distribution, decide how to proceed. In general, only the top 100 bins can possibly contain the top 100 values, so you can reconfigure with those ranges and the largest bins you can handle within that excerpted range. If the distribution is too uniform, you might get many many bins with all the same count, so drop the smaller bins even though you have many more than 100 bins remaining -- you still cut it down some.
Worst case is that all the bins come out the same and you can't cut it down this way! Someone prepared some pathological data assuming this kind of approach. So re-arrange the way you do the binning. Rather than simply chopping into contiguous ranges of equal size, us a 1:1 mapping to shuffle them. However, for large bins, this might preserve the property of being fairly uniform, so you don't want a conventional "good" hashing function.
Another approach
If binning works, and rapidly cuts down the problem, it's easy. But the data could be such that it's actually very difficult. So what's a way that always works, regardless of the data? Well, I can assume that the result exists: some 100 values will have more occurrences.
Instead of bins, pick n specific values (however many you can fit in memory). Either choose random numbers, or use the first N distinct values from your input. Count those, and copy the others to another file. That is, the values you don't have room to count get copied to a (smaller the original) file.
Now you'll at least have a useful pivot value: the exact cardinality of the 100 distinct top values that you did count exactly. Well, the ones you picked might still end up being all the same count! So you only have 1 distinct cardinality worst case. You know that this is not a "top" value since there are far more an 100 of them.
Run again on your new (smaller) file, and discard counts that are smaller than the top 100 you already know. Repeat.
This reminds me of something that I might have read in Knuth's TAOCP, but scaled up for modern machine sizes.
I would just drop all the numbers in a database (SQLite would be my first choice) with a table like
CREATE TABLE tbl (
number INTEGER PRIMARY KEY,
counter INTEGER
)
Then for every number received, just do a
INSERT INTO tbl (number,counter) VALUES (:number,1) ON DUPLICATE KEY UPDATE counter=counter+1;
or with SQLite syntax
INSERT INTO tbl (number,counter) VALUES (:number,1) ON CONFLICT(number) DO UPDATE SET counter=counter+1;
Then when all the numbers are accounted for,
SELECT number, counter FROM tbl ORDER BY counter DESC LIMIT 100
... then I would end up with the 100 most common numbers, and how often they occurred. This scheme will only break when you run out of disk space... (or when you reach ~20000000000000 (20 trillion) unique digits at some ~281 terabytes of disk space... )
Divide your numbers into two buckets
Find top 100 in each bucket
Merge those top 100 lists.
To divide, do median of medians (which can be modified to make medians of the top/bottom as well).
Each bucket has a distinct range of numbers in it. The initial median split makes 2 buckets, each with half (about) as many elements as the entire list in it.
To find the top 100, first know if the bucket is narrow (similar minimum and maximum) O(1) or small (few numbers in it) (O(n) time O(n*bucket count) memory). If either is true, a simple counting pass (possibly doing more than 1 bucket at once) solves it (you will have to do it more than once probably, as you have memory limits).
If neither is true, recurse and divide that bucket into two.
There are going to be fiddly bits with how you recurse without wasting too much time.
But the idea is that each bucket exponentially gets narrower or smaller. Narrow buckets have a minimum and maximum that is close, and small buckets have few elements.
You merge buckets so that you have enough storage to count the elements in the bucket (either width based, or volume based). Then you do a pass that counts that bucket and finds the top 100, and repeat. Each time you merge the top 100 from the scan into the previous top 100.
In-place, no sorting of the entire list needed, and devolves to simpler and more optimal strategies when the initial "bucket" is narrow or small.
I assume that the point of the challenge is to process this large amount of data without consuming too much memory, and avoid parsing the input too many times.
Here's an algorithm that would require two not too large arrays. Don't know about java, but I am confident that this can be made to run very fast in C:
Create a Count array of size 2^n to count the number of input numbers based on their n most significant bits. That will require a first scan over the input data but is really straightforward to do. I would first try with n=20 (about one million buckets).
Obviously, we won't process the data one bucket at a time, as that would require reading the input a million times, instead we choose our optimal batch size B and allocate a Batch array of size B. B could be like 40M, so that we aim at reading the input about 100 times. (It all depends on available memory).
Then we iterate over the count array to group the first range of buckets so that the sum is close to, but doesn't exceed B.
For each such range, we parse the input data, look for numbers in range and copy those numbers to the batch array. Since we already know the size of each bucket, we can immediately copy them grouped per bucket, so that we only have to sort them bucket by bucket (you can repurpose the count array to store the indices for where to write the next entry). Next we count the identical items in the sorted batch array and keep track of the top 100 so far.
Proceed the next range of buckets for which the sum of counts is under size B, etc...
Optimizations:
Once we start having a decent top 100, you can skip entire buckets whose size is below our 100th entry. For this we can use a special value (such as -1) in the count array, to indicate there is not index. Depending on the data, this can drastically reduce the number of passes required.
When counting identical items in the sorted Batch, we can make jumps of the size of your 100th entry (and then take a few steps backwards. I can share pseudo-code if needed)
Potential issues with this approach:
The input numbers could be concentrated in a small range, then you might get one or more single buckets that are larger than B. Possible solutions:
You could try another selection of n bits instead (eg. the n least significant bits). Note that that still won't help if the same numbers appears a billion times.
If the input is 32bit integers, then the range of possible values is limited, and there can only be a few thousand different numbers in each bucket. So if one bucket is really large, then we can process that bucket differently: Just keep a counter for each unique value in that range. We can repurpose the Batch array for that.

Which number of threshold will be better in Java7 ForkJoinTask

I was trying out the Java ForkJoin framework and wrote a program that to process a large data list.
It is well known that the field threshold is always set in ForkJoinTask to point out the minimum number for the partition of data list.
The question is, how big or small of threshold will make better performance, or is flexible and only associated with the core number of CPU or threads support?
Is there a best practice for threshold in parallel compute framework such as Forkjointask?
There is no set rule for threshold. A good number depends on the number of elements in the array (N), the type of processing for each element (Q) (doing a simple compare of two number is a low Q, doing an intricate calculation is a high Q.)
I use a general formula that works fairly well most of the time when I don't always know Q: I want to generate about 8 times as many tasks as threads or a minimum threshold of 32k (depending on N of course.)
int temp = count / (threads << 3);
threshold = (temp < 32768) ? 32768 : temp;
Where count is N and threads is the number of threads.

Obtain primes with threads. How to divide intervals?

I want to obtain all the prime numbers in the interval [min, max]. I want to do the calculus in n servers. Which is the best way to divide my initial interval in another n intervals, to manage approximately the same load in all the n servers?
[Optional]
I had an idea, but it didnt work as I expected. I assumed that all numbers are primes, so that a number i costs i instructions to verify is a prime.
If we keep in mind this method:
Then, the number of instructions to get primes in interval [1,100] is 1+2+..+99+100 = 100(1+100)/2 = 5050.
Now, if I want to do this calculus in 2 servers (n=2), I have to divide this load to each one (2525 instructions each one). The interval I want is defined by 2525 = x(1+x)/2 -> x=71.
In general terms, the general formula would be Load = (Interval(x) - Interval(x-1) + 1) * (Interval(x-1) + Interval(x)) / 2, being Load = (max - min + 1) * (min + max) / (2 * n).
Doing this, with x and y = [1:9999999] and n = 16, I have got this results:
(source: subirimagenes.com)
I dont get the same time and instructions in all servers, what means this is not the way to divide the intervals.
Any idea?
I think you looking for a parallel approach.
This is what the work stealing algorithm was designed for, aka Fork Join Pool. In fact, prime number calculation is a classic use case for work stealing because telling whether n is prime requires iterating till sqrt(n) so the bigger is n the longer it takes. So distributing them among your workers evenly and waiting for every worker to finish its job is unfair, the first core will quickly determine whether n is prime or not and sit idle and the other core will stay busy calculating bigger numbers. With work stealing the idle processor will steal work from its neighbours queues.
This implementation might be useful.
I solved this problem. I didn't do a first complete division of my interval and assign each part to a different server. I made the decision to divide the interval into very small parts ([min, max]/length^2 for example), and each calculus server got one of these parts. When they finish, they get another one until there is no more small intervals to calculate.
Why did I do this? The reason it's because I can't ensure that the servers I'm working with have the same calculus speed.

External shuffle: shuffling large amount of data out of memory

I am looking for a way to shuffle a large amount of data which does not fit into memory (approx. 40GB).
I have around 30 millions entries, of variable length, stored in one large file. I know the starting and ending positions of each entry in that file. I need to shuffle this data which does not fit in the RAM.
The only solution I thought of is to shuffle an array containing the numbers from 1 to N, where N is the number of entries, with the Fisher-Yates algorithm and then copy the entries in a new file, according to this order. Unfortunately, this solution involves a lot of seek operations, and thus, would be very slow.
Is there a better solution to shuffle large amount of data with uniform distribution?
First get the shuffle issue out of your face. Do this by inventing a hash algorithm for your entries that produces random-like results, then do a normal external sort on the hash.
Now you have transformed your shuffle into a sort your problems turn into finding an efficient external sort algorithm that fits your pocket and memory limits. That should now be as easy as google.
A simple approach is to pick a K such that 1/K of the data fits comfortably in memory. Perhaps K=4 for your data, assuming you've got 16GB RAM. I'll assume your random number function has the form rnd(n) which generates a uniform random number from 0 to n-1.
Then:
for i = 0 .. K-1
Initialize your random number generator to a known state.
Read through the input data, generating a random number rnd(K) for each item as you go.
Retain items in memory whenever rnd(K) == i.
After you've read the input file, shuffle the retained data in memory.
Write the shuffled retained items to the output file.
This is very easy to implement, will avoid a lot of seeking, and is clearly correct.
An alternative is to partition the input data into K files based on the random numbers, and then go through each, shuffling in memory and writing to disk. This reduces disk IO (each item is read twice and written twice, compared to the first approach where each item is read K times and written once), but you need to be careful to buffer the IO to avoid a lot of seeking, it uses more intermediate disk, and is somewhat more difficult to implement. If you've got only 40GB of data (so K is small), then the simple approach of multiple iterations through the input data is probably best.
If you use 20ms as the time for reading or writing 1MB of data (and assuming the in-memory shuffling cost is insignificant), the simple approach will take 40*1024*(K+1)*20ms, which is 1 minute 8 seconds (assuming K=4). The intermediate-file approach will take 40*1024*4*20ms, which is around 55 seconds, assuming you can minimize seeking. Note that SSD is approximately 20 times faster for reads and writes (even ignoring seeking), so you should expect to perform this task in well under 10s using an SSD. Numbers from Latency Numbers every Programmer should know
I suggest keeping your general approach, but inverting the map before doing the actual copy. That way, you read sequentially and do scattered writes rather than the other way round.
A read has to be done when requested before the program can continue. A write can be left in a buffer, increasing the probability of accumulating more than one write to the same disk block before actually doing the write.
Premise
From what I understand, using the Fisher-Yates algorithm and the data you have about the positions of the entries, you should be able to obtain (and compute) a list of:
struct Entry {
long long sourceStartIndex;
long long sourceEndIndex;
long long destinationStartIndex;
long long destinationEndIndex;
}
Problem
From this point onward, the naive solution is to seek each entry in the source file, read it, then seek to the new position of the entry in the destination file and write it.
The problem with this approach is that it uses way too many seeks.
Solution
A better way to do it, is to reduce the number of seeks, using two huge buffers, for each of the files.
I recommend a small buffer for the source file (say 64MB) and a big one for the destination file (as big as the user can afford - say 2GB).
Initially, the destination buffer will be mapped to the first 2GB of the destination file. At this point, read the whole source file, in chunks of 64MB, in the source buffer. As you read it, copy the proper entries to the destination buffer. When you reach the end of the file, the output buffer should contain all the proper data. Write it to the destination file.
Next, map the output buffer to the next 2GB of the destination file and repeat the procedure. Continue until you have wrote the whole output file.
Caution
Since the entries have arbitrary sizes, it's very likely that at the beginning and ending of the buffers you will have suffixes and prefixes of entries, so you need to make sure you copy the data properly!
Estimated time costs
The execution time depends, essentially, on the size of the source file, the available RAM for the application and the reading speed of the HDD. Assuming a 40GB file, a 2GB RAM and a 200MB/s HDD read speed, the program will need to read 800GB of data (40GB * (40GB / 2GB)). Assuming the HDD is not highly fragmented, the time spent on seeks will be negligible. This means the reads will take up one hour! But if, luckily, the user has 8GB of RAM available for your application, the time may decrease to only 15 to 20 minutes.
I hope this will be enough for you, as I don't see any other faster way.
Although you can use external sort on a random key, as proposed by OldCurmudgeon, the random key is not necessary. You can shuffle blocks of data in memory, and then join them with a "random merge," as suggested by aldel.
It's worth specifying what "random merge" means more clearly. Given two shuffled sequences of equal size, a random merge behaves exactly as in merge sort, with the exception that the next item to be added to the merged list is chosen using a boolean value from a shuffled sequence of zeros and ones, with exactly as many zeros as ones. (In merge sort, the choice would be made using a comparison.)
Proving it
My assertion that this works isn't enough. How do we know this process gives a shuffled sequence, such that every ordering is equally possible? It's possible to give a proof sketch with a diagram and a few calculations.
First, definitions. Suppose we have N unique items, where N is an even number, and M = N / 2. The N items are given to us in two M-item sequences labeled 0 and 1 that are guaranteed to be in a random order. The process of merging them produces a sequence of N items, such that each item comes from sequence 0 or sequence 1, and the same number of items come from each sequence. It will look something like this:
0: a b c d
1: w x y z
N: a w x b y c d z
Note that although the items in 0 and 1 appear to be in order, they are just labels here, and the order doesn't mean anything. It just serves to connect the order of 0 and 1 to the order of N.
Since we can tell from the labels which sequence each item came from, we can create a "source" sequence of zeros and ones. Call that c.
c: 0 1 1 0 1 0 0 1
By the definitions above, there will always be exactly as many zeros as ones in c.
Now observe that for any given ordering of labels in N, we can reproduce a c sequence directly, because the labels preserve information about the sequence they came from. And given N and c, we can reproduce the 0 and 1 sequences. So we know there's always one path back from a sequence N to one triple (0, 1, c). In other words, we have a reverse function r defined from the set of all orderings of N labels to triples (0, 1, c) -- r(N) = (0, 1, c).
We also have a forward function f from any triple r(n) that simply re-merges 0 and 1 according to the value of c. Together, these two functions show that there is a one-to-one correspondence between outputs of r(N) and orderings of N.
But what we really want to prove is that this one-to-one correspondence is exhaustive -- that is, we want to prove that there aren't extra orderings of N that don't correspond to any triple, and that there aren't extra triples that don't correspond to any ordering of N. If we can prove that, then we can choose orderings of N in a uniformly random way by choosing triples (0, 1, c) in a uniformly random way.
We can complete this last part of the proof by counting bins. Suppose every possible triple gets a bin. Then we drop every ordering of N in the bin for the triple that r(N) gives us. If there are exactly as many bins as orderings, then we have an exhaustive one-to-one correspondence.
From combinatorics, we know that number of orderings of N unique labels is N!. We also know that the number of orderings of 0 and 1 are both M!. And we know that the number of possible sequences c is N choose M, which is the same as N! / (M! * (N - M)!).
This means there are a total of
M! * M! * N! / (M! * (N - M)!)
triples. But N = 2 * M, so N - M = M, and the above reduces to
M! * M! * N! / (M! * M!)
That's just N!. QED.
Implementation
To pick triples in a uniformly random way, we must pick each element of the triple in a uniformly random way. For 0 and 1, we accomplish that using a straightforward Fisher-Yates shuffle in memory. The only remaining obstacle is generating a proper sequence of zeros and ones.
It's important -- important! -- to generate only sequences with equal numbers of zeros and ones. Otherwise, you haven't chosen from among Choose(N, M) sequences with uniform probability, and your shuffle may be biased. The really obvious way to do this is to shuffle a sequence containing an equal number of zeros and ones... but the whole premise of the question is that we can't fit that many zeros and ones in memory! So we need a way to generate random sequences of zeros and ones that are constrained such that there are exactly as many zeros as ones.
To do this in a way that is probabilistically coherent, we can simulate drawing balls labeled zero or one from an urn, without replacement. Suppose we start with fifty 0 balls and fifty 1 balls. If we keep count of the number of each kind of ball in the urn, we can maintain a running probability of choosing one or the other, so that the final result isn't biased. The (suspiciously Python-like) pseudocode would be something like this:
def generate_choices(N, M):
n0 = M
n1 = N - M
while n0 + n1 > 0:
if randrange(0, n0 + n1) < n0:
yield 0
n0 -= 1
else:
yield 1
n1 -= 1
This might not be perfect because of floating point errors, but it will be pretty close to perfect.
This last part of the algorithm is crucial. Going through the above proof exhaustively makes it clear that other ways of generating ones and zeros won't give us a proper shuffle.
Performing multiple merges in real data
There remain a few practical issues. The above argument assumes a perfectly balanced merge, and it also assumes you have only twice as much data as you have memory. Neither assumption is likely to hold.
The fist turns out not to be a big problem because the above argument doesn't actually require equally sized lists. It's just that if the list sizes are different, the calculations are a little more complex. If you go through the above replacing the M for list 1 with N - M throughout, the details all line up the same way. (The pseudocode is also written in a way that works for any M greater than zero and less than N. There will then be exactly M zeros and M - N ones.)
The second means that in practice, there might be many, many chunks to merge this way. The process inherits several properties of merge sort — in particular, it requires that for K chunks, you'll have to perform roughly K / 2 merges, and then K / 4 merges, and so on, until all the data has been merged. Each batch of merges will loop over the entire dataset, and there will be roughly log2(K) batches, for a run time of O(N * log(K)). An ordinary Fisher-Yates shuffle would be strictly linear in N, and so in theory would be faster for very large K. But until K gets very, very large, the penalty may be much smaller than the disk seeking penalties.
The benefit of this approach, then, comes from smart IO management. And with SSDs it might not even be worth it — the seek penalties might not be large enough to justify the overhead of multiple merges. Paul Hankin's answer has some practical tips for thinking through the practical issues raised.
Merging all data at once
An alternative to doing multiple binary merges would be to merge all the chunks at once -- which is theoretically possible, and might lead to an O(N) algorithm. The random number generation algorithm for values in c would need to generate labels from 0 to K - 1, such that the final outputs have exactly the right number of labels for each category. (In other words, if you're merging three chunks with 10, 12, and 13 items, then the final value of c would need to have 0 ten times, 1 twelve times, and 2 thirteen times.)
I think there is probably an O(N) time, O(1) space algorithm that will do that, and if I can find one or work one out, I'll post it here. The result would be a truly O(N) shuffle, much like the one Paul Hankin describes towards the end of his answer.
Logically partition your database entries (for e.g Alphabetically)
Create indexes based on your created partitions
build DAO to sensitize based on index

Categories

Resources