Related
I was trying out the Java ForkJoin framework and wrote a program that to process a large data list.
It is well known that the field threshold is always set in ForkJoinTask to point out the minimum number for the partition of data list.
The question is, how big or small of threshold will make better performance, or is flexible and only associated with the core number of CPU or threads support?
Is there a best practice for threshold in parallel compute framework such as Forkjointask?
There is no set rule for threshold. A good number depends on the number of elements in the array (N), the type of processing for each element (Q) (doing a simple compare of two number is a low Q, doing an intricate calculation is a high Q.)
I use a general formula that works fairly well most of the time when I don't always know Q: I want to generate about 8 times as many tasks as threads or a minimum threshold of 32k (depending on N of course.)
int temp = count / (threads << 3);
threshold = (temp < 32768) ? 32768 : temp;
Where count is N and threads is the number of threads.
This is the code I don't understand:
// Divide n by two until small enough for nextInt. On each
// iteration (at most 31 of them but usually much less),
What? With my trivial simulation for a randomly chosen n I get up to 32 iterations and 31 is the average.
// randomly choose both whether to include high bit in result
// (offset) and whether to continue with the lower vs upper
// half (which makes a difference only if odd).
This makes sense.
long offset = 0;
while (n >= Integer.MAX_VALUE) {
int bits = next(2);
Get two bits for the two decisions, makes sense.
long half = n >>> 1;
long nextn = ((bits & 2) == 0) ? half : n - half;
Here, n is n/2.0 rounded up or down, nice.
if ((bits & 1) == 0) offset += n - nextn;
n = nextn;
I'm lost.
}
return offset + nextInt((int) n);
I can see that it generates a properly sized number, but it looks pretty complicated and rather slow and I definitely can't see why the result should be uniformly distributed.1
1It can't be really uniformly distributed as the state is 48 bits only, so it can generate no more than 2**48 different numbers. The vast majority of longs can't be generated, but an experiment showing it would probably take years.
I think you're somewhat mistaken.... let me try to describe the algorithm the way I see it:
First, assume that nextInt(big) (nextInt, not nextLong) is correctly able to generate a well distributed range of values between 0 (inclusive) and big (exclusive).
This nextInt(int) function is used as part of nextLong(long)
So, the algorithm works by looping until the value is less than Integer.MAX_INT, at which point it uses nextInt(int). The more interesting thing is what it does before that...
Mathematically, if we take half a number, subtract half of it, and then subtract half of the half, and then half of the half of the half, etc. and we do that enough times it will tend toward zero. Similarly, if you take half of a number, and add it to half of the half of the half, etc. it will tend toward the original number.
What the algorithm does here is it takes half of the number. By doing integer division, if the number is odd, then there's a 'big' half and a 'small' half. The algorithm 'randomly' chooses one of those halves (either big or small).
Then, it randomly chooses to add that half, or not add that half to the output.
It keeps halving the number and (possibly) adding the half until the half is less than Integer.MAX_INT. At that point it simply computes the nextInt(half) value and adds that to the result.
Assuming the initial long limit was much greater than Integer.MAX_VALUE then the final result will get all the benefit of nextInt(int) for a large int value which is at least 32 bits of state, as well as 2 bits of state for all the higher bits above Integer.MAX_VALUE.
The larger the orignal limit is (closer it is to Long.MAX_VALUE), the more times the loop will iterate. At worst case it will go 32 times, but for smaller limits it will go through fewer times. At worst case, for very large limits you will get 64 bits of randomness used for the loops, and then whatever is needed for the nextInt(half) too.
EDIT: WalkThrough added - Working out the number of outcomes is harder, but, all values of long from 0 to Long.MAX_VALUE - 1 are possible outcomes. As 'proof' using nextLong(0x4000000000000000) is a neat example to start from because all the halving processes will be even and it has bit 63 set.
Because bit 63 is set (the most significant bit available because bit64 would make the number negative, which is illegal) it means that we will halve the value 32 times before the value is <= Integer.MAX_VALUE (which is 0x000000007fffffff - and our half will be 0x0000000004000000 when we get there....). Because halving and bit-shifting are the same process it holds that there are as many halves to do as the difference between the highest bit set and bit 31. 63 - 31 is 32, so we halve things 32 times, and thus we do 32 loops in the while loop. The initial start value of 0x4000000000000000 means that as we halve the value, there will only be one bit set in the half, and it will 'walk' down the value - shifting 1 to the right each time through the loop.
Because I chose the initial value carefully, it is apparent that in the while loop the logic is essentially deciding whether to set each bit or not. It takes half of the input value (which is 0x2000000000000000) and decides whether to add that to the result or not. Let's assume for the sake of argument that all our loops decide to add the half to the offset, in which case, we start with an offset of 0x0000000000000000, and then each time through the loop we add a half, which means each time we add:
0x2000000000000000
0x1000000000000000
0x0800000000000000
0x0400000000000000
.......
0x0000000100000000
0x0000000080000000
0x0000000040000000 (this is added because it is the last 'half' calculated)
At this point our loop has run 32 times, it has 'chosen' to add the value 32 times, and thus has at least 32 states in the value (64 if you count the big/little half decision). The actual offset is now 0x3fffffffc0000000 (all bits from 62 down to 31 are set).
Then, we call nextInt(0x40000000) which, as luck would have it, produces the result 0x3fffffff, using 31 bits of state to get there. We add this value to our offset and get the result:
0x3fffffffffffffff
With a 'perfect' distribution of nextInt(0x40000000) results we would have had a perfect coverage of the values 0x7fffffffc0000000 through 0x3fffffffffffffff with no gaps. With perfect randomness in our while loop, our high bits would have been a perfect distribution of 0x0000000000000000 through 0x3fffffffc0000000 Combined there is full coverage from 0 through to our limit (exclusive) of 0x4000000000000000
With the 32 bits of state from the high bits, and the (assumed) minimum 31 bits of state from nextInt(0x40000000) we have 63bits of state (more if you count the big/little half decision), and full coverage.
I am looking for a way to shuffle a large amount of data which does not fit into memory (approx. 40GB).
I have around 30 millions entries, of variable length, stored in one large file. I know the starting and ending positions of each entry in that file. I need to shuffle this data which does not fit in the RAM.
The only solution I thought of is to shuffle an array containing the numbers from 1 to N, where N is the number of entries, with the Fisher-Yates algorithm and then copy the entries in a new file, according to this order. Unfortunately, this solution involves a lot of seek operations, and thus, would be very slow.
Is there a better solution to shuffle large amount of data with uniform distribution?
First get the shuffle issue out of your face. Do this by inventing a hash algorithm for your entries that produces random-like results, then do a normal external sort on the hash.
Now you have transformed your shuffle into a sort your problems turn into finding an efficient external sort algorithm that fits your pocket and memory limits. That should now be as easy as google.
A simple approach is to pick a K such that 1/K of the data fits comfortably in memory. Perhaps K=4 for your data, assuming you've got 16GB RAM. I'll assume your random number function has the form rnd(n) which generates a uniform random number from 0 to n-1.
Then:
for i = 0 .. K-1
Initialize your random number generator to a known state.
Read through the input data, generating a random number rnd(K) for each item as you go.
Retain items in memory whenever rnd(K) == i.
After you've read the input file, shuffle the retained data in memory.
Write the shuffled retained items to the output file.
This is very easy to implement, will avoid a lot of seeking, and is clearly correct.
An alternative is to partition the input data into K files based on the random numbers, and then go through each, shuffling in memory and writing to disk. This reduces disk IO (each item is read twice and written twice, compared to the first approach where each item is read K times and written once), but you need to be careful to buffer the IO to avoid a lot of seeking, it uses more intermediate disk, and is somewhat more difficult to implement. If you've got only 40GB of data (so K is small), then the simple approach of multiple iterations through the input data is probably best.
If you use 20ms as the time for reading or writing 1MB of data (and assuming the in-memory shuffling cost is insignificant), the simple approach will take 40*1024*(K+1)*20ms, which is 1 minute 8 seconds (assuming K=4). The intermediate-file approach will take 40*1024*4*20ms, which is around 55 seconds, assuming you can minimize seeking. Note that SSD is approximately 20 times faster for reads and writes (even ignoring seeking), so you should expect to perform this task in well under 10s using an SSD. Numbers from Latency Numbers every Programmer should know
I suggest keeping your general approach, but inverting the map before doing the actual copy. That way, you read sequentially and do scattered writes rather than the other way round.
A read has to be done when requested before the program can continue. A write can be left in a buffer, increasing the probability of accumulating more than one write to the same disk block before actually doing the write.
Premise
From what I understand, using the Fisher-Yates algorithm and the data you have about the positions of the entries, you should be able to obtain (and compute) a list of:
struct Entry {
long long sourceStartIndex;
long long sourceEndIndex;
long long destinationStartIndex;
long long destinationEndIndex;
}
Problem
From this point onward, the naive solution is to seek each entry in the source file, read it, then seek to the new position of the entry in the destination file and write it.
The problem with this approach is that it uses way too many seeks.
Solution
A better way to do it, is to reduce the number of seeks, using two huge buffers, for each of the files.
I recommend a small buffer for the source file (say 64MB) and a big one for the destination file (as big as the user can afford - say 2GB).
Initially, the destination buffer will be mapped to the first 2GB of the destination file. At this point, read the whole source file, in chunks of 64MB, in the source buffer. As you read it, copy the proper entries to the destination buffer. When you reach the end of the file, the output buffer should contain all the proper data. Write it to the destination file.
Next, map the output buffer to the next 2GB of the destination file and repeat the procedure. Continue until you have wrote the whole output file.
Caution
Since the entries have arbitrary sizes, it's very likely that at the beginning and ending of the buffers you will have suffixes and prefixes of entries, so you need to make sure you copy the data properly!
Estimated time costs
The execution time depends, essentially, on the size of the source file, the available RAM for the application and the reading speed of the HDD. Assuming a 40GB file, a 2GB RAM and a 200MB/s HDD read speed, the program will need to read 800GB of data (40GB * (40GB / 2GB)). Assuming the HDD is not highly fragmented, the time spent on seeks will be negligible. This means the reads will take up one hour! But if, luckily, the user has 8GB of RAM available for your application, the time may decrease to only 15 to 20 minutes.
I hope this will be enough for you, as I don't see any other faster way.
Although you can use external sort on a random key, as proposed by OldCurmudgeon, the random key is not necessary. You can shuffle blocks of data in memory, and then join them with a "random merge," as suggested by aldel.
It's worth specifying what "random merge" means more clearly. Given two shuffled sequences of equal size, a random merge behaves exactly as in merge sort, with the exception that the next item to be added to the merged list is chosen using a boolean value from a shuffled sequence of zeros and ones, with exactly as many zeros as ones. (In merge sort, the choice would be made using a comparison.)
Proving it
My assertion that this works isn't enough. How do we know this process gives a shuffled sequence, such that every ordering is equally possible? It's possible to give a proof sketch with a diagram and a few calculations.
First, definitions. Suppose we have N unique items, where N is an even number, and M = N / 2. The N items are given to us in two M-item sequences labeled 0 and 1 that are guaranteed to be in a random order. The process of merging them produces a sequence of N items, such that each item comes from sequence 0 or sequence 1, and the same number of items come from each sequence. It will look something like this:
0: a b c d
1: w x y z
N: a w x b y c d z
Note that although the items in 0 and 1 appear to be in order, they are just labels here, and the order doesn't mean anything. It just serves to connect the order of 0 and 1 to the order of N.
Since we can tell from the labels which sequence each item came from, we can create a "source" sequence of zeros and ones. Call that c.
c: 0 1 1 0 1 0 0 1
By the definitions above, there will always be exactly as many zeros as ones in c.
Now observe that for any given ordering of labels in N, we can reproduce a c sequence directly, because the labels preserve information about the sequence they came from. And given N and c, we can reproduce the 0 and 1 sequences. So we know there's always one path back from a sequence N to one triple (0, 1, c). In other words, we have a reverse function r defined from the set of all orderings of N labels to triples (0, 1, c) -- r(N) = (0, 1, c).
We also have a forward function f from any triple r(n) that simply re-merges 0 and 1 according to the value of c. Together, these two functions show that there is a one-to-one correspondence between outputs of r(N) and orderings of N.
But what we really want to prove is that this one-to-one correspondence is exhaustive -- that is, we want to prove that there aren't extra orderings of N that don't correspond to any triple, and that there aren't extra triples that don't correspond to any ordering of N. If we can prove that, then we can choose orderings of N in a uniformly random way by choosing triples (0, 1, c) in a uniformly random way.
We can complete this last part of the proof by counting bins. Suppose every possible triple gets a bin. Then we drop every ordering of N in the bin for the triple that r(N) gives us. If there are exactly as many bins as orderings, then we have an exhaustive one-to-one correspondence.
From combinatorics, we know that number of orderings of N unique labels is N!. We also know that the number of orderings of 0 and 1 are both M!. And we know that the number of possible sequences c is N choose M, which is the same as N! / (M! * (N - M)!).
This means there are a total of
M! * M! * N! / (M! * (N - M)!)
triples. But N = 2 * M, so N - M = M, and the above reduces to
M! * M! * N! / (M! * M!)
That's just N!. QED.
Implementation
To pick triples in a uniformly random way, we must pick each element of the triple in a uniformly random way. For 0 and 1, we accomplish that using a straightforward Fisher-Yates shuffle in memory. The only remaining obstacle is generating a proper sequence of zeros and ones.
It's important -- important! -- to generate only sequences with equal numbers of zeros and ones. Otherwise, you haven't chosen from among Choose(N, M) sequences with uniform probability, and your shuffle may be biased. The really obvious way to do this is to shuffle a sequence containing an equal number of zeros and ones... but the whole premise of the question is that we can't fit that many zeros and ones in memory! So we need a way to generate random sequences of zeros and ones that are constrained such that there are exactly as many zeros as ones.
To do this in a way that is probabilistically coherent, we can simulate drawing balls labeled zero or one from an urn, without replacement. Suppose we start with fifty 0 balls and fifty 1 balls. If we keep count of the number of each kind of ball in the urn, we can maintain a running probability of choosing one or the other, so that the final result isn't biased. The (suspiciously Python-like) pseudocode would be something like this:
def generate_choices(N, M):
n0 = M
n1 = N - M
while n0 + n1 > 0:
if randrange(0, n0 + n1) < n0:
yield 0
n0 -= 1
else:
yield 1
n1 -= 1
This might not be perfect because of floating point errors, but it will be pretty close to perfect.
This last part of the algorithm is crucial. Going through the above proof exhaustively makes it clear that other ways of generating ones and zeros won't give us a proper shuffle.
Performing multiple merges in real data
There remain a few practical issues. The above argument assumes a perfectly balanced merge, and it also assumes you have only twice as much data as you have memory. Neither assumption is likely to hold.
The fist turns out not to be a big problem because the above argument doesn't actually require equally sized lists. It's just that if the list sizes are different, the calculations are a little more complex. If you go through the above replacing the M for list 1 with N - M throughout, the details all line up the same way. (The pseudocode is also written in a way that works for any M greater than zero and less than N. There will then be exactly M zeros and M - N ones.)
The second means that in practice, there might be many, many chunks to merge this way. The process inherits several properties of merge sort — in particular, it requires that for K chunks, you'll have to perform roughly K / 2 merges, and then K / 4 merges, and so on, until all the data has been merged. Each batch of merges will loop over the entire dataset, and there will be roughly log2(K) batches, for a run time of O(N * log(K)). An ordinary Fisher-Yates shuffle would be strictly linear in N, and so in theory would be faster for very large K. But until K gets very, very large, the penalty may be much smaller than the disk seeking penalties.
The benefit of this approach, then, comes from smart IO management. And with SSDs it might not even be worth it — the seek penalties might not be large enough to justify the overhead of multiple merges. Paul Hankin's answer has some practical tips for thinking through the practical issues raised.
Merging all data at once
An alternative to doing multiple binary merges would be to merge all the chunks at once -- which is theoretically possible, and might lead to an O(N) algorithm. The random number generation algorithm for values in c would need to generate labels from 0 to K - 1, such that the final outputs have exactly the right number of labels for each category. (In other words, if you're merging three chunks with 10, 12, and 13 items, then the final value of c would need to have 0 ten times, 1 twelve times, and 2 thirteen times.)
I think there is probably an O(N) time, O(1) space algorithm that will do that, and if I can find one or work one out, I'll post it here. The result would be a truly O(N) shuffle, much like the one Paul Hankin describes towards the end of his answer.
Logically partition your database entries (for e.g Alphabetically)
Create indexes based on your created partitions
build DAO to sensitize based on index
As per Sun Java Implementation, during expansion, ArrayList grows to 3/2 it's initial capacity whereas for HashMap the expansion rate is double. What is reason behind this?
As per the implementation, for HashMap, the capacity should always be in the power of two. That may be a reason for HashMap's behavior. But in that case the question is, for HashMap why the capacity should always be in power of two?
The expensive part at increasing the capacity of an ArrayList is copying the content of the backing array a new (larger) one.
For the HashMap, it is creating a new backing array and putting all map entries in the new array. And, the higher the capacity, the lower the risk of collisions. This is more expensive and explains, why the expansion factor is higher. The reason for 1.5 vs. 2.0? I consider this as "best practise" or "good tradeoff".
for HashMap why the capacity should always be in power of two?
I can think of two reasons.
You can quickly determine the bucket a hashcode goes in to. You only need a bitwise AND and no expensive modulo. int bucket = hashcode & (size-1);
Let's say we have a grow factor of 1.7. If we start with a size 11, the next size would be 18, then 31. No problem. Right? But the hashcodes of Strings in Java, are calculated with a prime factor of 31. The bucket a string goes into,hashcode%31, is then determined only by the last character of the String. Bye bye O(1) if you store folders that all end in /. If you use a size of, for example, 3^n, the distribution will not get worse if you increase n. Going from size 3 to 9, every element in bucket 2, will now go to bucket 2,5 or 7, depending on the higher digit. It's like splitting each bucket in three pieces. So a size of integer growth factor would be preferred. (Off course this all depends on how you calculate hashcodes, but a arbitrary growth factor doesn't feel 'stable'.)
The way HashMap is designed/implemented its underlying number of buckets must be a power of 2 (even if you give it a different size, it makes it a power of 2), thus it grows by a factor of two each time. An ArrayList can be any size and it can be more conservative in how it grows.
The accepted answer is not actually giving exact response to the question, but comment from #user837703 to that answer is clearly explaining why HashMap grows by power of two.
I found this article, which explains it in detail http://coding-geek.com/how-does-a-hashmap-work-in-java/
Let me post fragment of it, which gives detailed answer to the question:
// the function that returns the index of the bucket from the rehashed hash
static int indexFor(int h, int length) {
return h & (length-1);
}
In order to work efficiently, the size of the inner array needs to be a power of 2, let’s see why.
Imagine the array size is 17, the mask value is going to be 16 (size -1). The binary representation of 16 is 0…010000, so for any hash value H the index generated with the bitwise formula “H AND 16” is going to be either 16 or 0. This means that the array of size 17 will only be used for 2 buckets: the one at index 0 and the one at index 16, not very efficient…
But, if you now take a size that is a power of 2 like 16, the bitwise index formula is “H AND 15”. The binary representation of 15 is 0…001111 so the index formula can output values from 0 to 15 and the array of size 16 is fully used. For example:
if H = 952 , its binary representation is 0..01110111000, the associated index is 0…01000 = 8
if H = 1576 its binary representation is 0..011000101000, the associated index is 0…01000 = 8
if H = 12356146, its binary representation is 0..0101111001000101000110010, the associated index is 0…00010 = 2
if H = 59843, its binary representation is 0..01110100111000011, the associated index is 0…00011 = 3
This is why the array size is a power of two. This mechanism is transparent for the developer: if he chooses a HashMap with a size of 37, the Map will automatically choose the next power of 2 after 37 (64) for the size of its inner array.
Hashing takes advantage of distributing data evenly into buckets. The algorithm tries to prevent multiple entries in the buckets ("hash collisions"), as they will decrease performance.
Now when the capacity of a HashMap is reached, size is extended and existing data is re-distributed with the new buckets. If the size-increas would be too small, this re-allocation of space and re-dsitribution would happen too often.
A general rule to avoid collisions on Maps is to keep to load factor max at around 0.75
To decrease possibility of collisions and avoid expensive copying process HashMap grows at a larger rate.
Also as #Peter says, it must be a power of 2.
I can't give you a reason why this is so (you'd have to ask Sun developers), but to see how this happens take a look at source:
HashMap: Take a look at how HashMap resizes to new size (source line 799)
resize(2 * table.length);
ArrayList: source, line 183:
int newCapacity = (oldCapacity * 3)/2 + 1;
Update: I mistakenly linked to sources of Apache Harmony JDK - changed it to Sun's JDK.
Ok, here's my situation:
I have an Array of States, which may contain duplicates. To get rid of the duplicates, I can add them all to a Set.
However when I create the Set, it wants the initial capacity and load factor to be defined, but what should they be set to?
From googling, I have come up with:
String[] allStates = getAllStates();
Set<String> uniqueStates = new HashSet<String>(allStates.length, 0.75);
The problem with this, is that allStates can contain anwhere between 1 and 5000 states. So the Set will have capacity of over 5000, but only containing at most 50.
So alternatively set the max size of the Set could be set to be the max number of states, and the load factor to be 1.
I guess my questions really are:
What should you set the initial capacity to be when you don't know how many items are to be in the Set?
Does it really matter what it gets set to when the most it could contain is 50?
Should I even be worrying about it?
Assuming that you know there won't be more than 50 states (do you mean US States?), the
Set<String> uniqueStates = new HashSet<String>(allStates.length, 0.75);
quoted is definitely wrong. I'd suggest you go for an initial capacity of 50 / 0.75 = 67, or perhaps 68 to be on the safe side.
I also feel the need to point out that you're probably overthinking this intensely. Resizing the arraylist twice from 16 up to 64 isn't going to give you a noticeable performance hit unless this is right in the most performance-critical part of the program.
So the best answer is probably to use:
new HashSet<String>();
That way, you won't come back a year later and puzzle over why you chose such strange constructor arguments.
Use the constructor where you don't need to specify these values, then reasonable defaults are chosen.
First, I'm going to say that in your case you're definitely overthinking it. However, there are probably situations where one would want to get it right. So here's what I understand:
1) The number of items you can hold in your HashSet = initial capacity x load factor. So if you want to be able to hold n items, you need to do what Zarkonnen did and divide n by the load factor.
2) Under the covers, the initial capacity is rounded up to a power of two per Oracle tutorial.
3) Load factor should be no more than .80 to prevent excessive collisions, as noted by Tom Hawtin - tackline.
If you just accept the default values (initial capacity = 16, load factor = .75), you'll end up doubling your set in size 3 times. (Initial max size = 12, first increase makes capacity 32 and max size 24 (32 * .75), second increase makes capacity 64 and max size 48 (64 * .75), third increase makes capacity 128 and max size 96 (128 * .75).)
To get your max size closer to 50, but keep the set as small as possible, consider an initial capacity of 64 (a power of two) and a load factor of .79 or more. 64 * .79 = 50.56, so you can get all 50 states in there. Specifying 32 < initial capacity < 64 will result in initial capacity being rounded up to 64, so that's the same as specifying 64 up front. Specifying initial capacity <= 32 will result in a size increase. Using a load factor < .79 will also result in a size increase unless your initial capacity > 64.
So my recommendation is to specify initial capacity = 64 and load factor = .79.
The safe bet is go for a size that is too small.
Because resizing is ameliorated by an exponential growth algorithm (see the stackoverflow podcast from a few weeks back), going small will never cost you that much. If you have lots of sets (lucky you), then it will matter to performance if they are oversize.
Load factor is a tricky one. I suggest leaving it at the default. I understand: Below about 0.70f you are making the array too large and therefore slower. Above 0.80f and you'll start getting to many key clashes. Presumably probing algorithms will require lower load factors than bucket algorithms.
Also note that the "initial capacity" means something slightly different than it appears most people think. It refers to the number of entries in the array. To get the exact capacity for a number of elements, divide by the desired load factor (and round appropriately).
Make a good guess. There is no hard rule. If you know there's likely to be say 10-20 states, i'd start off with that number (20).
I second Zarkonnen. Your last question is the most important one. If this happens to occur in a hotspot of your application it might be worth the effort to look at it and try to optimise, otherwise CPU cycles are cheaper than burning up your own neurons.
If you were to optimize this -- and it may be appropriate to do that -- some of your decision will depend on how many duplicates you expect the array to have.
If there are very many duplicates, you will want a smaller initial
capacity. Large, sparse hash tables are bad when iterating.
If there are not expected to be very many duplicates, you will want
an initial capacity such that the entire array could fit without
resizing.
My guess is that you want the latter, but this is something worth considering if you pursue this.