Which number of threshold will be better in Java7 ForkJoinTask

Which number of threshold will be better in Java7 ForkJoinTask - java

I was trying out the Java ForkJoin framework and wrote a program that to process a large data list.
It is well known that the field threshold is always set in ForkJoinTask to point out the minimum number for the partition of data list.
The question is, how big or small of threshold will make better performance, or is flexible and only associated with the core number of CPU or threads support?
Is there a best practice for threshold in parallel compute framework such as Forkjointask?

There is no set rule for threshold. A good number depends on the number of elements in the array (N), the type of processing for each element (Q) (doing a simple compare of two number is a low Q, doing an intricate calculation is a high Q.)
I use a general formula that works fairly well most of the time when I don't always know Q: I want to generate about 8 times as many tasks as threads or a minimum threshold of 32k (depending on N of course.)
int temp = count / (threads << 3);
threshold = (temp < 32768) ? 32768 : temp;
Where count is N and threads is the number of threads.

Related

OHC Caching in Java

I am learning about Java off-heap cache and I use OHC cache. I found out about the source code of OHC and it contains methods that I don't know what they are used for. Hope someone can explain it to me, thanks.
int cpus = Runtime.getRuntime().availableProcessors(); // my CPU = 4
segmentCount = roundUpToPowerOf2(cpus * 2, 1 << 30);
capacity = Math.min(cpus * 16, 64) * 1024 * 1024;
static int roundUpToPowerOf2(int number, int max) {
return number >= max ? max : (number > 1) ? Integer.highestOneBit((number - 1) << 1) : 1;
}

To minimize lock contention, OHC splits the entire cache into 'segments' so that only if two entries hash to the same segment must one operation wait for the other. This is something like table-level locking in a relational database.
The meaning of cpus is clear enough.
Default segmentCount is the smallest power of 2 larger than twice the CPU count. For the logic of doubling the CPU count for throughput optimization, see for example https://stackoverflow.com/a/4771384.
capacity is the total storable data size in the cache. The default is 16MB per core. This is probably designed to correspond with CPU cache sizes, though presumably a user would have an idea of their application's actual capacity needs and would very likely be configuring this value and not using the default anyway.
The actual roundUpToPowerOf2 can be explained as follows:
Do not go above max, nor below 1. (I suppose it's up to the caller to assure that max is itelf a power of two or, it's ok for it not to be in this case.) In betweeen: To get a power of two, we want an int comprised of a single one bit (or all zeros). Integer#highestOneBit gives such a number with the turned on bit being the leftmost turned on bit in its argument. So we need to provide it with a number with whose leftmost turned on bit is:
the same as number if it is already a power of two, or
one position to the left of number's leftmost one bit.
Calculating number - 1 before left shifting is for the first case. If number is a power of two, then left shifting as-is would give us the next power of two, which isn't what we want. For the second case, the number (or it's value subtracted by 1) left shifted turns on the next larger bit, then Integer#highestOneBit effectively blanks out all of the bits to the right.

Obtain primes with threads. How to divide intervals?

I want to obtain all the prime numbers in the interval [min, max]. I want to do the calculus in n servers. Which is the best way to divide my initial interval in another n intervals, to manage approximately the same load in all the n servers?
[Optional]
I had an idea, but it didnt work as I expected. I assumed that all numbers are primes, so that a number i costs i instructions to verify is a prime.
If we keep in mind this method:
Then, the number of instructions to get primes in interval [1,100] is 1+2+..+99+100 = 100(1+100)/2 = 5050.
Now, if I want to do this calculus in 2 servers (n=2), I have to divide this load to each one (2525 instructions each one). The interval I want is defined by 2525 = x(1+x)/2 -> x=71.
In general terms, the general formula would be Load = (Interval(x) - Interval(x-1) + 1) * (Interval(x-1) + Interval(x)) / 2, being Load = (max - min + 1) * (min + max) / (2 * n).
Doing this, with x and y = [1:9999999] and n = 16, I have got this results:
(source: subirimagenes.com)
I dont get the same time and instructions in all servers, what means this is not the way to divide the intervals.
Any idea?

I think you looking for a parallel approach.
This is what the work stealing algorithm was designed for, aka Fork Join Pool. In fact, prime number calculation is a classic use case for work stealing because telling whether n is prime requires iterating till sqrt(n) so the bigger is n the longer it takes. So distributing them among your workers evenly and waiting for every worker to finish its job is unfair, the first core will quickly determine whether n is prime or not and sit idle and the other core will stay busy calculating bigger numbers. With work stealing the idle processor will steal work from its neighbours queues.
This implementation might be useful.

I solved this problem. I didn't do a first complete division of my interval and assign each part to a different server. I made the decision to divide the interval into very small parts ([min, max]/length^2 for example), and each calculus server got one of these parts. When they finish, they get another one until there is no more small intervals to calculate.
Why did I do this? The reason it's because I can't ensure that the servers I'm working with have the same calculus speed.

What is the time complexity of Arrays.parallelSetAll()?

I just read from : Everything about java 8
that, java 8 adds Arrays.parallelSetAll()
int[] array = new int[8];
AtomicInteger i= new AtomicInteger();
Arrays.parallelSetAll(array, operand -> i.incrementAndGet());
[Edited] Is it O(1) or a constant time complexity on the same machine for same no.of elements in array ? What sort of performance improvement is indicated by the method name?

To start off, it can never be O(1), more clarification following:
I am using that n = array.length, which in your case is 8, however that does not matter as it could also be a very big number.
Now observe that normally you would do:
for (int i = 0; i < n; i++) {
array[i] = i.incrementAndGet();
}
This is with Java 8 much easier:
Arrays.setAll(array, v -> i.incrementAndGet());
Observe that they both take O(n) time.
Now take into account that you execute the code parallel, but there are no guarantees as to how it executes it, you do not know the number of parallellizations it does under the hood, if any at all for such a low number.
Therefore it still takes O(n) time, because you cannot prove that it will parallellize over n threads.
Edit, as an extra, I have observed that you seem to think that parallellizing an action means that any O(k) will converge to O(1), where k = n or k = n^2, etc.
This is not the case in practice as you can prove that you never have k processor cores available.
An intuitive argument is your own computer, if you are lucky it may have 8 cores, therefore the maximum time you could get under perfect parallellization conditions is O(n / 8).
I can already hear the people from the future laughing at that we only had 8 CPU cores...

It is O(N). Calling Arrays.parallelSetAll(...) involves assignments to set a total of array.length array elements. Even if those assignments are spread across P processors, the total number of assignments is linearly proportional to the length of the array. Take N as the length of the array, and math is obvious.
The thing to realize is that P ... the number of available processors ... is going to be a constant for any given execution of a program on a single computer. (Or if it is not a constant, there will be a constant upper bound.) And a computation whose sole purpose is to assign values to an array only makes sense when executed on a single computer.

External shuffle: shuffling large amount of data out of memory

I am looking for a way to shuffle a large amount of data which does not fit into memory (approx. 40GB).
I have around 30 millions entries, of variable length, stored in one large file. I know the starting and ending positions of each entry in that file. I need to shuffle this data which does not fit in the RAM.
The only solution I thought of is to shuffle an array containing the numbers from 1 to N, where N is the number of entries, with the Fisher-Yates algorithm and then copy the entries in a new file, according to this order. Unfortunately, this solution involves a lot of seek operations, and thus, would be very slow.
Is there a better solution to shuffle large amount of data with uniform distribution?

First get the shuffle issue out of your face. Do this by inventing a hash algorithm for your entries that produces random-like results, then do a normal external sort on the hash.
Now you have transformed your shuffle into a sort your problems turn into finding an efficient external sort algorithm that fits your pocket and memory limits. That should now be as easy as google.

A simple approach is to pick a K such that 1/K of the data fits comfortably in memory. Perhaps K=4 for your data, assuming you've got 16GB RAM. I'll assume your random number function has the form rnd(n) which generates a uniform random number from 0 to n-1.
Then:
for i = 0 .. K-1
Initialize your random number generator to a known state.
Read through the input data, generating a random number rnd(K) for each item as you go.
Retain items in memory whenever rnd(K) == i.
After you've read the input file, shuffle the retained data in memory.
Write the shuffled retained items to the output file.
This is very easy to implement, will avoid a lot of seeking, and is clearly correct.
An alternative is to partition the input data into K files based on the random numbers, and then go through each, shuffling in memory and writing to disk. This reduces disk IO (each item is read twice and written twice, compared to the first approach where each item is read K times and written once), but you need to be careful to buffer the IO to avoid a lot of seeking, it uses more intermediate disk, and is somewhat more difficult to implement. If you've got only 40GB of data (so K is small), then the simple approach of multiple iterations through the input data is probably best.
If you use 20ms as the time for reading or writing 1MB of data (and assuming the in-memory shuffling cost is insignificant), the simple approach will take 40*1024*(K+1)*20ms, which is 1 minute 8 seconds (assuming K=4). The intermediate-file approach will take 40*1024*4*20ms, which is around 55 seconds, assuming you can minimize seeking. Note that SSD is approximately 20 times faster for reads and writes (even ignoring seeking), so you should expect to perform this task in well under 10s using an SSD. Numbers from Latency Numbers every Programmer should know

I suggest keeping your general approach, but inverting the map before doing the actual copy. That way, you read sequentially and do scattered writes rather than the other way round.
A read has to be done when requested before the program can continue. A write can be left in a buffer, increasing the probability of accumulating more than one write to the same disk block before actually doing the write.

Premise
From what I understand, using the Fisher-Yates algorithm and the data you have about the positions of the entries, you should be able to obtain (and compute) a list of:
struct Entry {
long long sourceStartIndex;
long long sourceEndIndex;
long long destinationStartIndex;
long long destinationEndIndex;
}
Problem
From this point onward, the naive solution is to seek each entry in the source file, read it, then seek to the new position of the entry in the destination file and write it.
The problem with this approach is that it uses way too many seeks.
Solution
A better way to do it, is to reduce the number of seeks, using two huge buffers, for each of the files.
I recommend a small buffer for the source file (say 64MB) and a big one for the destination file (as big as the user can afford - say 2GB).
Initially, the destination buffer will be mapped to the first 2GB of the destination file. At this point, read the whole source file, in chunks of 64MB, in the source buffer. As you read it, copy the proper entries to the destination buffer. When you reach the end of the file, the output buffer should contain all the proper data. Write it to the destination file.
Next, map the output buffer to the next 2GB of the destination file and repeat the procedure. Continue until you have wrote the whole output file.
Caution
Since the entries have arbitrary sizes, it's very likely that at the beginning and ending of the buffers you will have suffixes and prefixes of entries, so you need to make sure you copy the data properly!
Estimated time costs
The execution time depends, essentially, on the size of the source file, the available RAM for the application and the reading speed of the HDD. Assuming a 40GB file, a 2GB RAM and a 200MB/s HDD read speed, the program will need to read 800GB of data (40GB * (40GB / 2GB)). Assuming the HDD is not highly fragmented, the time spent on seeks will be negligible. This means the reads will take up one hour! But if, luckily, the user has 8GB of RAM available for your application, the time may decrease to only 15 to 20 minutes.
I hope this will be enough for you, as I don't see any other faster way.

Although you can use external sort on a random key, as proposed by OldCurmudgeon, the random key is not necessary. You can shuffle blocks of data in memory, and then join them with a "random merge," as suggested by aldel.
It's worth specifying what "random merge" means more clearly. Given two shuffled sequences of equal size, a random merge behaves exactly as in merge sort, with the exception that the next item to be added to the merged list is chosen using a boolean value from a shuffled sequence of zeros and ones, with exactly as many zeros as ones. (In merge sort, the choice would be made using a comparison.)
Proving it
My assertion that this works isn't enough. How do we know this process gives a shuffled sequence, such that every ordering is equally possible? It's possible to give a proof sketch with a diagram and a few calculations.
First, definitions. Suppose we have N unique items, where N is an even number, and M = N / 2. The N items are given to us in two M-item sequences labeled 0 and 1 that are guaranteed to be in a random order. The process of merging them produces a sequence of N items, such that each item comes from sequence 0 or sequence 1, and the same number of items come from each sequence. It will look something like this:
0: a b c d
1: w x y z
N: a w x b y c d z
Note that although the items in 0 and 1 appear to be in order, they are just labels here, and the order doesn't mean anything. It just serves to connect the order of 0 and 1 to the order of N.
Since we can tell from the labels which sequence each item came from, we can create a "source" sequence of zeros and ones. Call that c.
c: 0 1 1 0 1 0 0 1
By the definitions above, there will always be exactly as many zeros as ones in c.
Now observe that for any given ordering of labels in N, we can reproduce a c sequence directly, because the labels preserve information about the sequence they came from. And given N and c, we can reproduce the 0 and 1 sequences. So we know there's always one path back from a sequence N to one triple (0, 1, c). In other words, we have a reverse function r defined from the set of all orderings of N labels to triples (0, 1, c) -- r(N) = (0, 1, c).
We also have a forward function f from any triple r(n) that simply re-merges 0 and 1 according to the value of c. Together, these two functions show that there is a one-to-one correspondence between outputs of r(N) and orderings of N.
But what we really want to prove is that this one-to-one correspondence is exhaustive -- that is, we want to prove that there aren't extra orderings of N that don't correspond to any triple, and that there aren't extra triples that don't correspond to any ordering of N. If we can prove that, then we can choose orderings of N in a uniformly random way by choosing triples (0, 1, c) in a uniformly random way.
We can complete this last part of the proof by counting bins. Suppose every possible triple gets a bin. Then we drop every ordering of N in the bin for the triple that r(N) gives us. If there are exactly as many bins as orderings, then we have an exhaustive one-to-one correspondence.
From combinatorics, we know that number of orderings of N unique labels is N!. We also know that the number of orderings of 0 and 1 are both M!. And we know that the number of possible sequences c is N choose M, which is the same as N! / (M! * (N - M)!).
This means there are a total of
M! * M! * N! / (M! * (N - M)!)
triples. But N = 2 * M, so N - M = M, and the above reduces to
M! * M! * N! / (M! * M!)
That's just N!. QED.
Implementation
To pick triples in a uniformly random way, we must pick each element of the triple in a uniformly random way. For 0 and 1, we accomplish that using a straightforward Fisher-Yates shuffle in memory. The only remaining obstacle is generating a proper sequence of zeros and ones.
It's important -- important! -- to generate only sequences with equal numbers of zeros and ones. Otherwise, you haven't chosen from among Choose(N, M) sequences with uniform probability, and your shuffle may be biased. The really obvious way to do this is to shuffle a sequence containing an equal number of zeros and ones... but the whole premise of the question is that we can't fit that many zeros and ones in memory! So we need a way to generate random sequences of zeros and ones that are constrained such that there are exactly as many zeros as ones.
To do this in a way that is probabilistically coherent, we can simulate drawing balls labeled zero or one from an urn, without replacement. Suppose we start with fifty 0 balls and fifty 1 balls. If we keep count of the number of each kind of ball in the urn, we can maintain a running probability of choosing one or the other, so that the final result isn't biased. The (suspiciously Python-like) pseudocode would be something like this:
def generate_choices(N, M):
n0 = M
n1 = N - M
while n0 + n1 > 0:
if randrange(0, n0 + n1) < n0:
yield 0
n0 -= 1
else:
yield 1
n1 -= 1
This might not be perfect because of floating point errors, but it will be pretty close to perfect.
This last part of the algorithm is crucial. Going through the above proof exhaustively makes it clear that other ways of generating ones and zeros won't give us a proper shuffle.
Performing multiple merges in real data
There remain a few practical issues. The above argument assumes a perfectly balanced merge, and it also assumes you have only twice as much data as you have memory. Neither assumption is likely to hold.
The fist turns out not to be a big problem because the above argument doesn't actually require equally sized lists. It's just that if the list sizes are different, the calculations are a little more complex. If you go through the above replacing the M for list 1 with N - M throughout, the details all line up the same way. (The pseudocode is also written in a way that works for any M greater than zero and less than N. There will then be exactly M zeros and M - N ones.)
The second means that in practice, there might be many, many chunks to merge this way. The process inherits several properties of merge sort — in particular, it requires that for K chunks, you'll have to perform roughly K / 2 merges, and then K / 4 merges, and so on, until all the data has been merged. Each batch of merges will loop over the entire dataset, and there will be roughly log2(K) batches, for a run time of O(N * log(K)). An ordinary Fisher-Yates shuffle would be strictly linear in N, and so in theory would be faster for very large K. But until K gets very, very large, the penalty may be much smaller than the disk seeking penalties.
The benefit of this approach, then, comes from smart IO management. And with SSDs it might not even be worth it — the seek penalties might not be large enough to justify the overhead of multiple merges. Paul Hankin's answer has some practical tips for thinking through the practical issues raised.
Merging all data at once
An alternative to doing multiple binary merges would be to merge all the chunks at once -- which is theoretically possible, and might lead to an O(N) algorithm. The random number generation algorithm for values in c would need to generate labels from 0 to K - 1, such that the final outputs have exactly the right number of labels for each category. (In other words, if you're merging three chunks with 10, 12, and 13 items, then the final value of c would need to have 0 ten times, 1 twelve times, and 2 thirteen times.)
I think there is probably an O(N) time, O(1) space algorithm that will do that, and if I can find one or work one out, I'll post it here. The result would be a truly O(N) shuffle, much like the one Paul Hankin describes towards the end of his answer.

Logically partition your database entries (for e.g Alphabetically)
Create indexes based on your created partitions
build DAO to sensitize based on index

Parallelizing Sieve of Eratosthenes Algorithm for finding Prime Number

Parallelize Sieve of Eratosthenes method in two ways
using Java and
using C/C++ with Pthreads
Find the best values for THRESHOLD for 2 and 4 Core CPUs.
can any one help me how to do it. i am learning threads of java & C/C++.what things will i need to parallelize this algorithm

Note that using the Sieve of Eratostheens method to find the prime numbers table, once you find a prime number i - you set i*n as non-prime for each n.
Note that for 2 elements that you know they are prime numbers - i,j you can do it in parallel, i does not require any information from j and vise versa.
The same of course holds for each k prime numbers.
Note however that finding if a number is prime - depends on last calculations! Thus, a barrier is needed between marking numbers as non-primes and finding next prime number to work with.
A good place to start could be:
repeat until finished filling the table:
1. Find k prime numbers [serially]
2. Fill the table for these k numbers [parallely]
3. after all threads finished step 2 [barrier] - return to 1
Since it seems homework, I'll only give these hints, and let you do the rest of the work. If you later have any specific problem - ask a new question and show us what you already did.
EDIT: one more important hint, note that if i is prime, the calculation of all the non prime numbers that derive from i will not affect the fact that j is prime for each i < j < 2i. Use this fact to find the k prime numbers, you don't want for example to take 2,3,4 as prime numbers on your first iteration of the algorithm.

Prime sieves use a lot of space.
In case of 1 bit per number, you'd need 1 giga-bit (2ˇ30) to store sieve for N = ~10ˇ9. And 1 terabit (2ˇ40) to store sieve for N = ~10ˇ12, which is 2ˇ37 bytes = 2ˇ7 GB = 128GB.
If only storing odd numbers, 16 numbers/B, 64GB is needed for N=10ˇ9. With your laptops 16GB you'll be limited to N < 2ˇ35. Good to start cracking RSA 70-bit.
If making algorithm much more complex and not storing multiples of 3 5 or 7 in the seive, 1 byte would hold sieve-bits for 30 numbers, which would let ~2ˇ36 = 2ˇ(30+6) = ~64 x 10ˇ9, dangerous for RSA72 :lol: To crack RSA1024, youd need (1024/2 = 512, 512-36 = 476)
2ˇ476 (~10ˇ159) times more memory.
So memory usage is the main problem.
After all, even fast RAM is hundreds of times slower than CPU's L1 cache, so you want to fit the data to 32KB or upto 256KB instead of zillion TBs.
As the sieve is accessesd pretty randomly, it is not automagically loaded to cache and you'd get contineous cache-misses. You need to go through the range of numbers finishing the whole job chunk by chunk. While chunk-size is your fastest memory (L1 L2 of CPU or GPU).
Let's skip the complex part...
You'll have a list of primes that are needed to unmark composites in current junk. If already sieving big numbers, lets say 10ˇ30 upto 10ˇ30+10ˇ8, the range of 100M numbers has around 1M primes waiting in the list just to unmark 1 of their composites, each prime of that size would need (i guess) around 128B to be stored, so that you'd need 128MB for it. Luckyly this list is accessed sequentially and it may be already in cache when needed. But anyway you'll need zillion bytes to store the prime-lists for the next sieves.
About multithreading, scalability to GPUs, thousands of threads e.g.
Each prime in the list to unmark 1 bit in sieve in the fastest memory, accesses it basically randomly. GPu doesn't write bit by bit, instead it writes a la 128B per operation, it would be 1024bits. While 1 thread tries to unmark 1 bit, the other thread unmarks the otherbit while the bit of the 1st thread will have value 1 again. To fence/barrier/lock the memory would stop all the threads and no speed increase althoough lots of threads running.
So threads should not share the sieve. Would need to share sieving-prime-lists instead, so that each thread would have its own chunk of a chunk of sieve, and would use the same primes for unmarking. But after unmarking, they need to "schedule" that prime to the shared primes-list for its future sieve, which will be executed after millions of sievings, at the same time while other threads are changing the same list. Pretyy much stuck again.
It is very easy to make it parallel while slowing it down. Rather often it would be faster to recalculate something in each thread than to ask it over a slow bus like RAM, PCIe, SSD, GBit-net... but it is possible and not very complex if you have only some threads.

An approach is to let a single thread find the next prime number in the sieve, and then let all threads mark the multiples concurently. Every thread will be assigned a different section of the array to avoid memory sharing as much as possible. So every thread needs to determine what range of multiples it will handle.

First, should the parallelization affect the creation of the table, or
simply using it to determine whether a number is prime or not. In a
real application, I'd do the first off line, so the table would appear
in the final C++ code as a statically initialized C style array. So the
question of parallelization would be irrelevant. And since nothing
should be modified in the second, you can access from as many threads as
you want, without concern.
I suspect, however, that the purpose of the exercise is for you to use
multiple threads to construct the table. (Otherwise, it doesn't make
sense.) In this case: the table is constructed by a series of loops,
with steps 2, 3, 5... Each of these loops can be executed in a separate
thread, but... some sort of synchronization will be needed for
concurrent accesses. If you treat the table as a single object, with
just one lock, you'll end up either running the loops sequencially
(because you're acquiring the lock outside of the loop), or spending
more time acquiring and releasing the lock than doing any real work.
(Acquiring and releasing an uncontested lock can be very fast. But not
as fast as just setting a bool. And in this case, the lock is going
to be very, very contested, since all of the threads want it most of the
time.) If you create a lock per bool, that's going to be an awful lot
of locks—it will probably take less time to construct the table in
a single thread than to create all of the mutexes.
Of course, in C++ (and perhaps in Java as well), you'll want to use a
bitmap, rather than one bool per entry; the larger the table, the
larger the maximum number you can handle. (Something like bool
sieve[INT_MAX]; is almost certain to fail; you might be able to get
away with unsigned char sieve[INT_MAX / 8 + 1];, however.) In this
case, you'll need a mutex per element, not per entry (which would be a
single bit in the element). Given that each mutex eats up some
resources as well, you probably want to divide the table into discrete
blocks, with a mutex per block, and use a nested loop:
int j = 0;
for ( int i = 0; i < numberOfBlocks; ++ i ) {
scoped_lock( mutex_table[i] );
while ( j < (i + 1) * bitsInBlock ) {
// ...
j += step;
}
Once this is working, a bit of tuning will be necessary to determine the
optimal block size (but I would guess fairly big).

Your teacher won't like this, but I guess there is a brutal approach which can be worth considering.
Just let every thread repeat
find the next prime in the sieve
mark all multiples of this prime
independently of all others and without any synchronization. Every thread stops when there it finds no more prime.
This is brutal because several threads may work on the same prime by accident, but the final sieve will be correct (all composites detected), and what you loose in terms of duplicate work could be regained by the absence of synchronization.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.