Related
Java official document says "Ideally, under random hashCodes, the frequency of nodes in bins follows a Poisson distribution (http://en.wikipedia.org/wiki/Poisson_distribution) with a parameter of about 0.5 on average for the default resizing threshold of 0.75". I want to know How do we get bucket being empty or not is 0.5. How to prove it mathematically.
Here's a discussion of where the 0.5 probably came from.
The context is in the Java HashMap implementation, regarding the point at which a bucket should be converted ("treeified") from a linked list to a red-black tree. This should occur extremely rarely under typical operation. What's a good value to choose that gives "extremely rarely"? An implementation comment from the HashMap implementation bases its justification on a Poisson distribution:
Ideally, under random hashCodes, the frequency of
nodes in bins follows a Poisson distribution
(http://en.wikipedia.org/wiki/Poisson_distribution) with a
parameter of about 0.5 on average for the default resizing
threshold of 0.75, although with a large variance because of
resizing granularity. Ignoring variance, the expected
occurrences of list size k are (exp(-0.5) * pow(0.5, k) /
factorial(k)).
https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/java/util/HashMap.java#L181
It concludes that a bucket with 8 or more collisions occurs with a probability of less than one in ten million, so 8 was chosen as the "treeify threshold".
Where did the parameter of 0.5 come from? In this case, the parameter means the average load (or fullness) of a HashMap, that is, the number of mappings divided by the "capacity" (table length, or number of buckets). The question can thus be restated as: What is the average load of a HashMap? This can't be derived mathematically, because it depends on the way a program uses a HashMap, the program's input, and so forth. But we can make some reasonable assumptions and arrive at some conclusions which probably apply to typical cases.
First, assume most HashMaps use a load factor of 0.75. Anecdotally I believe this to be true, as I've rarely seen cases where code uses an explicit load factor.
Second, assume that the number of mappings in a HashMap is uniformly distributed. I'm almost positive this is not true, but let's start with this as a working assumption and revisit it later.
Third, let's set aside cases where a HashMap contains over a billion or so elements, where things like Java's array size limitations come into play.
Fourth, for simplicity, let's consider only HashMaps that are created without a preallocated capacity and are populated with some unknown number of mappings.
Recall the definition of load factor: if a HashMap's load exceeds the load factor, the HashMap will be resized larger to keep the load below the load factor.
Clearly, a HashMap in normal operation can't have a load of more than 0.75. A HashMap also won't have a load of less than 0.375, because it won't be resized until its load reaches 0.75, and resizing halves the load. So maybe the average load is midway between 0.375 and 0.75, which is 0.5625.
One could do some math to figure this out, but it was easier for me to write a program to populate HashMaps of sizes between 1 and N and compute the average load over all those cases. If N is about 0.75 of a power of two (say, 768) then the average load is indeed quite close to 0.5625. However, this varies a lot depending on one's choice of N. If a chosen N is midway between 0.75 of successive powers of two (say, 576) then the average load is only 0.507. So the average load of HashMaps with up to N mappings varies between about 0.507 and 0.5625, depending on one's choice of N. (This is the "resizing granularity" variance mentioned in the comment.)
Recall that we assumed that HashMap sizes are uniformly distributed. It's probably not true, but what is the actual distribution? I don't know, but I'd guess that there is exponential falloff as the sizes get larger. If so, this would skew the distribution of sizes towards smaller numbers, reducing the average load to be closer to 0.507 than 0.5625. I'd also guess that a lot of people create HashMaps with the default capacity (16) and populate them with six or fewer mappings. That would pull the average down farther.
In summary, our model is telling us that the average load is somewhere between 0.507 and 0.5625. Some educated guesses tell us that sizes are skewed smaller, so the average load is also smaller. The choice of 0.5 therefore seems reasonable.
But we really don't need a precise answer for the analysis of when treeify HashMap buckets. The point of the analysis is to find a threshold such that conversion is extremely unlikely to occur during normal operation. If 0.4 or 0.7 were used instead of 0.5, the Poisson analysis would still indicate that the chance of having 8 collisions in a single bucket is still less than one in ten million.
HashMap has two important properties: size and load factor. I went through the Java documentation and it says 0.75f is the initial load factor. But I can't find the actual use of it.
Can someone describe what are the different scenarios where we need to set load factor and what are some sample ideal values for different cases?
The documentation explains it pretty well:
An instance of HashMap has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.
As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of the HashMap class, including get and put). The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.
As with all performance optimizations, it is a good idea to avoid optimizing things prematurely (i.e. without hard data on where the bottlenecks are).
Default initial capacity of the HashMap takes is 16 and load factor is 0.75f (i.e 75% of current map size). The load factor represents at what level the HashMap capacity should be doubled.
For example product of capacity and load factor as 16 * 0.75 = 12. This represents that after storing the 12th key – value pair into the HashMap , its capacity becomes 32.
Actually, from my calculations, the "perfect" load factor is closer to log 2 (~ 0.7). Although any load factor less than this will yield better performance. I think that .75 was probably pulled out of a hat.
Proof:
Chaining can be avoided and branch prediction exploited by predicting if a
bucket is empty or not. A bucket is probably empty if the probability of it
being empty exceeds .5.
Let s represent the size and n the number of keys added. Using the binomial
theorem, the probability of a bucket being empty is:
P(0) = C(n, 0) * (1/s)^0 * (1 - 1/s)^(n - 0)
Thus, a bucket is probably empty if there are less than
log(2)/log(s/(s - 1)) keys
As s reaches infinity and if the number of keys added is such that
P(0) = .5, then n/s approaches log(2) rapidly:
lim (log(2)/log(s/(s - 1)))/s as s -> infinity = log(2) ~ 0.693...
What is load factor ?
The amount of capacity which is to be exhausted for the HashMap to increase its capacity.
Why load factor ?
Load factor is by default 0.75 of the initial capacity (16) therefore 25% of the buckets will be free before there is an increase in the capacity & this makes many new buckets with new hashcodes pointing to them to exist just after the increase in the number of buckets.
Why should you keep many free buckets & what is the impact of keeping free buckets on the performance ?
If you set the loading factor to say 1.0 then something very interesting might happen.
Say you are adding an object x to your hashmap whose hashCode is 888 & in your hashmap the bucket representing the hashcode is free , so the object x gets added to the bucket, but now again say if you are adding another object y whose hashCode is also 888 then your object y will get added for sure BUT at the end of the bucket (because the buckets are nothing but linkedList implementation storing key,value & next) now this has a performance impact ! Since your object y is no longer present in the head of the bucket if you perform a lookup the time taken is not going to be O(1) this time it depends on how many items are there in the same bucket. This is called hash collision by the way & this even happens when your loading factor is less than 1.
Correlation between performance, hash collision & loading factor
Lower load factor = more free buckets = less chances of collision = high performance = high space requirement.
Higher load factor = fewer free buckets = higher chance of collision = lower performance = lower space requirement.
From the documentation:
The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased
It really depends on your particular requirements, there's no "rule of thumb" for specifying an initial load factor.
For HashMap DEFAULT_INITIAL_CAPACITY = 16 and DEFAULT_LOAD_FACTOR = 0.75f
it means that MAX number of ALL Entries in the HashMap = 16 * 0.75 = 12. When the thirteenth element will be added capacity (array size) of HashMap will be doubled!
Perfect illustration answered this question:
image is taken from here:
https://javabypatel.blogspot.com/2015/10/what-is-load-factor-and-rehashing-in-hashmap.html
If the buckets get too full, then we have to look through
a very long linked list.
And that's kind of defeating the point.
So here's an example where I have four buckets.
I have elephant and badger in my HashSet so far.
This is a pretty good situation, right?
Each element has zero or one elements.
Now we put two more elements into our HashSet.
buckets elements
------- -------
0 elephant
1 otter
2 badger
3 cat
This isn't too bad either.
Every bucket only has one element
.
So if I wanna know, does this contain panda?
I can very quickly look at bucket number 1 and it's not
there and
I known it's not in our collection.
If I wanna know if it contains cat, I look at bucket
number 3,
I find cat, I very quickly know if it's in our
collection.
What if I add koala, well that's not so bad.
buckets elements
------- -------
0 elephant
1 otter -> koala
2 badger
3 cat
Maybe now instead of in bucket number 1 only looking at
one element,
I need to look at two.
But at least I don't have to look at elephant, badger and
cat.
If I'm again looking for panda, it can only be in bucket
number 1 and
I don't have to look at anything other then otter and
koala.
But now I put alligator in bucket number 1 and you can
see maybe where this is going.
That if bucket number 1 keeps getting bigger and bigger
and
bigger, then I'm basically having to look through all of
those elements to find
something that should be in bucket number 1.
buckets elements
------- -------
0 elephant
1 otter -> koala ->alligator
2 badger
3 cat
If I start adding strings to other buckets,
right, the problem just gets bigger and bigger in every
single bucket.
How do we stop our buckets from getting too full?
The solution here is that
"the HashSet can automatically
resize the number of buckets."
There's the HashSet realizes that the buckets are getting
too full.
It's losing this advantage of this all of one lookup for
elements.
And it'll just create more buckets(generally twice as before) and
then place the elements into the correct bucket.
So here's our basic HashSet implementation with separate
chaining.
Now I'm going to create a "self-resizing HashSet".
This HashSet is going to realize that the buckets are
getting too full and
it needs more buckets.
loadFactor is another field in our HashSet class.
loadFactor represents the average number of elements per
bucket,
above which we want to resize.
loadFactor is a balance between space and time.
If the buckets get too full then we'll resize.
That takes time, of course, but
it may save us time down the road if the buckets are a
little more empty.
Let's see an example.
Here's a HashSet, we've added four elements so far.
Elephant, dog, cat and fish.
buckets elements
------- -------
0
1 elephant
2 cat ->dog
3 fish
4
5
At this point, I've decided that the loadFactor, the
threshold,
the average number of elements per bucket that I'm okay
with, is 0.75.
The number of buckets is buckets.length, which is 6, and
at this point our HashSet has four elements, so the
current size is 4.
We'll resize our HashSet, that is we'll add more buckets,
when the average number of elements per bucket exceeds
the loadFactor.
That is when current size divided by buckets.length is
greater than loadFactor.
At this point, the average number of elements per bucket
is 4 divided by 6.
4 elements, 6 buckets, that's 0.67.
That's less than the threshold I set of 0.75 so we're
okay.
We don't need to resize.
But now let's say we add woodchuck.
buckets elements
------- -------
0
1 elephant
2 woodchuck-> cat ->dog
3 fish
4
5
Woodchuck would end up in bucket number 3.
At this point, the currentSize is 5.
And now the average number of elements per bucket
is the currentSize divided by buckets.length.
That's 5 elements divided by 6 buckets is 0.83.
And this exceeds the loadFactor which was 0.75.
In order to address this problem, in order to make the
buckets perhaps a little
more empty so that operations like determining whether a
bucket contains
an element will be a little less complex, I wanna resize
my HashSet.
Resizing the HashSet takes two steps.
First I'll double the number of buckets, I had 6 buckets,
now I'm going to have 12 buckets.
Note here that the loadFactor which I set to 0.75 stays the same.
But the number of buckets changed is 12,
the number of elements stayed the same, is 5.
5 divided by 12 is around 0.42, that's well under our
loadFactor,
so we're okay now.
But we're not done because some of these elements are in
the wrong bucket now.
For instance, elephant.
Elephant was in bucket number 2 because the number of
characters in elephant
was 8.
We have 6 buckets, 8 minus 6 is 2.
That's why it ended up in number 2.
But now that we have 12 buckets, 8 mod 12 is 8, so
elephant does not belong in bucket number 2 anymore.
Elephant belongs in bucket number 8.
What about woodchuck?
Woodchuck was the one that started this whole problem.
Woodchuck ended up in bucket number 3.
Because 9 mod 6 is 3.
But now we do 9 mod 12.
9 mod 12 is 9, woodchuck goes to bucket number 9.
And you see the advantage of all this.
Now bucket number 3 only has two elements whereas before
it had 3.
So here's our code,
where we had our HashSet with separate chaining that
didn't do any resizing.
Now, here's a new implementation where we use resizing.
Most of this code is the same,
we're still going to determine whether it contains the
value already.
If it doesn't, then we'll figure it out which bucket it
should go into and
then add it to that bucket, add it to that LinkedList.
But now we increment the currentSize field.
currentSize was the field that kept track of the number
of elements in our HashSet.
We're going to increment it and then we're going to look
at the average load,
the average number of elements per bucket.
We'll do that division down here.
We have to do a little bit of casting here to make sure
that we get a double.
And then, we'll compare that average load to the field
that I've set as
0.75 when I created this HashSet, for instance, which was
the loadFactor.
If the average load is greater than the loadFactor,
that means there's too many elements per bucket on
average, and I need to reinsert.
So here's our implementation of the method to reinsert
all the elements.
First, I'll create a local variable called oldBuckets.
Which is referring to the buckets as they currently stand
before I start resizing everything.
Note I'm not creating a new array of linked lists just yet.
I'm just renaming buckets as oldBuckets.
Now remember buckets was a field in our class, I'm going
to now create a new array
of linked lists but this will have twice as many elements
as it did the first time.
Now I need to actually do the reinserting,
I'm going to iterate through all of the old buckets.
Each element in oldBuckets is a LinkedList of strings
that is a bucket.
I'll go through that bucket and get each element in that
bucket.
And now I'm gonna reinsert it into the newBuckets.
I will get its hashCode.
I will figure out which index it is.
And now I get the new bucket, the new LinkedList of
strings and
I'll add it to that new bucket.
So to recap, HashSets as we've seen are arrays of Linked
Lists, or buckets.
A self resizing HashSet can realize using some ratio or
I would pick a table size of n * 1.5 or n + (n >> 1), this would give a load factor of .66666~ without division, which is slow on most systems, especially on portable systems where there is no division in the hardware.
I've seen some interesting claims on SO re Java hashmaps and their O(1) lookup time. Can someone explain why this is so? Unless these hashmaps are vastly different from any of the hashing algorithms I was bought up on, there must always exist a dataset that contains collisions.
In which case, the lookup would be O(n) rather than O(1).
Can someone explain whether they are O(1) and, if so, how they achieve this?
A particular feature of a HashMap is that unlike, say, balanced trees, its behavior is probabilistic. In these cases its usually most helpful to talk about complexity in terms of the probability of a worst-case event occurring would be. For a hash map, that of course is the case of a collision with respect to how full the map happens to be. A collision is pretty easy to estimate.
pcollision = n / capacity
So a hash map with even a modest number of elements is pretty likely to experience at least one collision. Big O notation allows us to do something more compelling. Observe that for any arbitrary, fixed constant k.
O(n) = O(k * n)
We can use this feature to improve the performance of the hash map. We could instead think about the probability of at most 2 collisions.
pcollision x 2 = (n / capacity)2
This is much lower. Since the cost of handling one extra collision is irrelevant to Big O performance, we've found a way to improve performance without actually changing the algorithm! We can generalzie this to
pcollision x k = (n / capacity)k
And now we can disregard some arbitrary number of collisions and end up with vanishingly tiny likelihood of more collisions than we are accounting for. You could get the probability to an arbitrarily tiny level by choosing the correct k, all without altering the actual implementation of the algorithm.
We talk about this by saying that the hash-map has O(1) access with high probability
You seem to mix up worst-case behaviour with average-case (expected) runtime. The former is indeed O(n) for hash tables in general (i.e. not using a perfect hashing) but this is rarely relevant in practice.
Any dependable hash table implementation, coupled with a half decent hash, has a retrieval performance of O(1) with a very small factor (2, in fact) in the expected case, within a very narrow margin of variance.
In Java, how HashMap works?
Using hashCode to locate the corresponding bucket [inside buckets container model].
Each bucket is a LinkedList (or a Balanced Red-Black Binary Tree under some conditions starting from Java 8) of items residing in that bucket.
The items are scanned one by one, using equals for comparison.
When adding more items, the HashMap is resized (doubling the size) once a certain load percentage is reached.
So, sometimes it will have to compare against a few items, but generally, it's much closer to O(1) than O(n) / O(log n).
For practical purposes, that's all you should need to know.
Remember that o(1) does not mean that each lookup only examines a single item - it means that the average number of items checked remains constant w.r.t. the number of items in the container. So if it takes on average 4 comparisons to find an item in a container with 100 items, it should also take an average of 4 comparisons to find an item in a container with 10000 items, and for any other number of items (there's always a bit of variance, especially around the points at which the hash table rehashes, and when there's a very small number of items).
So collisions don't prevent the container from having o(1) operations, as long as the average number of keys per bucket remains within a fixed bound.
I know this is an old question, but there's actually a new answer to it.
You're right that a hash map isn't really O(1), strictly speaking, because as the number of elements gets arbitrarily large, eventually you will not be able to search in constant time (and O-notation is defined in terms of numbers that can get arbitrarily large).
But it doesn't follow that the real time complexity is O(n)--because there's no rule that says that the buckets have to be implemented as a linear list.
In fact, Java 8 implements the buckets as TreeMaps once they exceed a threshold, which makes the actual time O(log n).
O(1+n/k) where k is the number of buckets.
If implementation sets k = n/alpha then it is O(1+alpha) = O(1) since alpha is a constant.
If the number of buckets (call it b) is held constant (the usual case), then lookup is actually O(n).
As n gets large, the number of elements in each bucket averages n/b. If collision resolution is done in one of the usual ways (linked list for example), then lookup is O(n/b) = O(n).
The O notation is about what happens when n gets larger and larger. It can be misleading when applied to certain algorithms, and hash tables are a case in point. We choose the number of buckets based on how many elements we're expecting to deal with. When n is about the same size as b, then lookup is roughly constant-time, but we can't call it O(1) because O is defined in terms of a limit as n → ∞.
Elements inside the HashMap are stored as an array of linked list (node), each linked list in the array represents a bucket for unique hash value of one or more keys.
While adding an entry in the HashMap, the hashcode of the key is used to determine the location of the bucket in the array, something like:
location = (arraylength - 1) & keyhashcode
Here the & represents bitwise AND operator.
For example: 100 & "ABC".hashCode() = 64 (location of the bucket for the key "ABC")
During the get operation it uses same way to determine the location of bucket for the key. Under the best case each key has unique hashcode and results in a unique bucket for each key, in this case the get method spends time only to determine the bucket location and retrieving the value which is constant O(1).
Under the worst case, all the keys have same hashcode and stored in same bucket, this results in traversing through the entire list which leads to O(n).
In the case of java 8, the Linked List bucket is replaced with a TreeMap if the size grows to more than 8, this reduces the worst case search efficiency to O(log n).
We've established that the standard description of hash table lookups being O(1) refers to the average-case expected time, not the strict worst-case performance. For a hash table resolving collisions with chaining (like Java's hashmap) this is technically O(1+α) with a good hash function, where α is the table's load factor. Still constant as long as the number of objects you're storing is no more than a constant factor larger than the table size.
It's also been explained that strictly speaking it's possible to construct input that requires O(n) lookups for any deterministic hash function. But it's also interesting to consider the worst-case expected time, which is different than average search time. Using chaining this is O(1 + the length of the longest chain), for example Θ(log n / log log n) when α=1.
If you're interested in theoretical ways to achieve constant time expected worst-case lookups, you can read about dynamic perfect hashing which resolves collisions recursively with another hash table!
It is O(1) only if your hashing function is very good. The Java hash table implementation does not protect against bad hash functions.
Whether you need to grow the table when you add items or not is not relevant to the question because it is about lookup time.
This basically goes for most hash table implementations in most programming languages, as the algorithm itself doesn't really change.
If there are no collisions present in the table, you only have to do a single look-up, therefore the running time is O(1). If there are collisions present, you have to do more than one look-up, which drives down the performance towards O(n).
It depends on the algorithm you choose to avoid collisions. If your implementation uses separate chaining then the worst case scenario happens where every data element is hashed to the same value (poor choice of the hash function for example). In that case, data lookup is no different from a linear search on a linked list i.e. O(n). However, the probability of that happening is negligible and lookups best and average cases remain constant i.e. O(1).
Only in theoretical case, when hashcodes are always different and bucket for every hash code is also different, the O(1) will exist. Otherwise, it is of constant order i.e. on increment of hashmap, its order of search remains constant.
Academics aside, from a practical perspective, HashMaps should be accepted as having an inconsequential performance impact (unless your profiler tells you otherwise.)
Of course the performance of the hashmap will depend based on the quality of the hashCode() function for the given object. However, if the function is implemented such that the possibility of collisions is very low, it will have a very good performance (this is not strictly O(1) in every possible case but it is in most cases).
For example the default implementation in the Oracle JRE is to use a random number (which is stored in the object instance so that it doesn't change - but it also disables biased locking, but that's an other discussion) so the chance of collisions is very low.
I am looking for a way to shuffle a large amount of data which does not fit into memory (approx. 40GB).
I have around 30 millions entries, of variable length, stored in one large file. I know the starting and ending positions of each entry in that file. I need to shuffle this data which does not fit in the RAM.
The only solution I thought of is to shuffle an array containing the numbers from 1 to N, where N is the number of entries, with the Fisher-Yates algorithm and then copy the entries in a new file, according to this order. Unfortunately, this solution involves a lot of seek operations, and thus, would be very slow.
Is there a better solution to shuffle large amount of data with uniform distribution?
First get the shuffle issue out of your face. Do this by inventing a hash algorithm for your entries that produces random-like results, then do a normal external sort on the hash.
Now you have transformed your shuffle into a sort your problems turn into finding an efficient external sort algorithm that fits your pocket and memory limits. That should now be as easy as google.
A simple approach is to pick a K such that 1/K of the data fits comfortably in memory. Perhaps K=4 for your data, assuming you've got 16GB RAM. I'll assume your random number function has the form rnd(n) which generates a uniform random number from 0 to n-1.
Then:
for i = 0 .. K-1
Initialize your random number generator to a known state.
Read through the input data, generating a random number rnd(K) for each item as you go.
Retain items in memory whenever rnd(K) == i.
After you've read the input file, shuffle the retained data in memory.
Write the shuffled retained items to the output file.
This is very easy to implement, will avoid a lot of seeking, and is clearly correct.
An alternative is to partition the input data into K files based on the random numbers, and then go through each, shuffling in memory and writing to disk. This reduces disk IO (each item is read twice and written twice, compared to the first approach where each item is read K times and written once), but you need to be careful to buffer the IO to avoid a lot of seeking, it uses more intermediate disk, and is somewhat more difficult to implement. If you've got only 40GB of data (so K is small), then the simple approach of multiple iterations through the input data is probably best.
If you use 20ms as the time for reading or writing 1MB of data (and assuming the in-memory shuffling cost is insignificant), the simple approach will take 40*1024*(K+1)*20ms, which is 1 minute 8 seconds (assuming K=4). The intermediate-file approach will take 40*1024*4*20ms, which is around 55 seconds, assuming you can minimize seeking. Note that SSD is approximately 20 times faster for reads and writes (even ignoring seeking), so you should expect to perform this task in well under 10s using an SSD. Numbers from Latency Numbers every Programmer should know
I suggest keeping your general approach, but inverting the map before doing the actual copy. That way, you read sequentially and do scattered writes rather than the other way round.
A read has to be done when requested before the program can continue. A write can be left in a buffer, increasing the probability of accumulating more than one write to the same disk block before actually doing the write.
Premise
From what I understand, using the Fisher-Yates algorithm and the data you have about the positions of the entries, you should be able to obtain (and compute) a list of:
struct Entry {
long long sourceStartIndex;
long long sourceEndIndex;
long long destinationStartIndex;
long long destinationEndIndex;
}
Problem
From this point onward, the naive solution is to seek each entry in the source file, read it, then seek to the new position of the entry in the destination file and write it.
The problem with this approach is that it uses way too many seeks.
Solution
A better way to do it, is to reduce the number of seeks, using two huge buffers, for each of the files.
I recommend a small buffer for the source file (say 64MB) and a big one for the destination file (as big as the user can afford - say 2GB).
Initially, the destination buffer will be mapped to the first 2GB of the destination file. At this point, read the whole source file, in chunks of 64MB, in the source buffer. As you read it, copy the proper entries to the destination buffer. When you reach the end of the file, the output buffer should contain all the proper data. Write it to the destination file.
Next, map the output buffer to the next 2GB of the destination file and repeat the procedure. Continue until you have wrote the whole output file.
Caution
Since the entries have arbitrary sizes, it's very likely that at the beginning and ending of the buffers you will have suffixes and prefixes of entries, so you need to make sure you copy the data properly!
Estimated time costs
The execution time depends, essentially, on the size of the source file, the available RAM for the application and the reading speed of the HDD. Assuming a 40GB file, a 2GB RAM and a 200MB/s HDD read speed, the program will need to read 800GB of data (40GB * (40GB / 2GB)). Assuming the HDD is not highly fragmented, the time spent on seeks will be negligible. This means the reads will take up one hour! But if, luckily, the user has 8GB of RAM available for your application, the time may decrease to only 15 to 20 minutes.
I hope this will be enough for you, as I don't see any other faster way.
Although you can use external sort on a random key, as proposed by OldCurmudgeon, the random key is not necessary. You can shuffle blocks of data in memory, and then join them with a "random merge," as suggested by aldel.
It's worth specifying what "random merge" means more clearly. Given two shuffled sequences of equal size, a random merge behaves exactly as in merge sort, with the exception that the next item to be added to the merged list is chosen using a boolean value from a shuffled sequence of zeros and ones, with exactly as many zeros as ones. (In merge sort, the choice would be made using a comparison.)
Proving it
My assertion that this works isn't enough. How do we know this process gives a shuffled sequence, such that every ordering is equally possible? It's possible to give a proof sketch with a diagram and a few calculations.
First, definitions. Suppose we have N unique items, where N is an even number, and M = N / 2. The N items are given to us in two M-item sequences labeled 0 and 1 that are guaranteed to be in a random order. The process of merging them produces a sequence of N items, such that each item comes from sequence 0 or sequence 1, and the same number of items come from each sequence. It will look something like this:
0: a b c d
1: w x y z
N: a w x b y c d z
Note that although the items in 0 and 1 appear to be in order, they are just labels here, and the order doesn't mean anything. It just serves to connect the order of 0 and 1 to the order of N.
Since we can tell from the labels which sequence each item came from, we can create a "source" sequence of zeros and ones. Call that c.
c: 0 1 1 0 1 0 0 1
By the definitions above, there will always be exactly as many zeros as ones in c.
Now observe that for any given ordering of labels in N, we can reproduce a c sequence directly, because the labels preserve information about the sequence they came from. And given N and c, we can reproduce the 0 and 1 sequences. So we know there's always one path back from a sequence N to one triple (0, 1, c). In other words, we have a reverse function r defined from the set of all orderings of N labels to triples (0, 1, c) -- r(N) = (0, 1, c).
We also have a forward function f from any triple r(n) that simply re-merges 0 and 1 according to the value of c. Together, these two functions show that there is a one-to-one correspondence between outputs of r(N) and orderings of N.
But what we really want to prove is that this one-to-one correspondence is exhaustive -- that is, we want to prove that there aren't extra orderings of N that don't correspond to any triple, and that there aren't extra triples that don't correspond to any ordering of N. If we can prove that, then we can choose orderings of N in a uniformly random way by choosing triples (0, 1, c) in a uniformly random way.
We can complete this last part of the proof by counting bins. Suppose every possible triple gets a bin. Then we drop every ordering of N in the bin for the triple that r(N) gives us. If there are exactly as many bins as orderings, then we have an exhaustive one-to-one correspondence.
From combinatorics, we know that number of orderings of N unique labels is N!. We also know that the number of orderings of 0 and 1 are both M!. And we know that the number of possible sequences c is N choose M, which is the same as N! / (M! * (N - M)!).
This means there are a total of
M! * M! * N! / (M! * (N - M)!)
triples. But N = 2 * M, so N - M = M, and the above reduces to
M! * M! * N! / (M! * M!)
That's just N!. QED.
Implementation
To pick triples in a uniformly random way, we must pick each element of the triple in a uniformly random way. For 0 and 1, we accomplish that using a straightforward Fisher-Yates shuffle in memory. The only remaining obstacle is generating a proper sequence of zeros and ones.
It's important -- important! -- to generate only sequences with equal numbers of zeros and ones. Otherwise, you haven't chosen from among Choose(N, M) sequences with uniform probability, and your shuffle may be biased. The really obvious way to do this is to shuffle a sequence containing an equal number of zeros and ones... but the whole premise of the question is that we can't fit that many zeros and ones in memory! So we need a way to generate random sequences of zeros and ones that are constrained such that there are exactly as many zeros as ones.
To do this in a way that is probabilistically coherent, we can simulate drawing balls labeled zero or one from an urn, without replacement. Suppose we start with fifty 0 balls and fifty 1 balls. If we keep count of the number of each kind of ball in the urn, we can maintain a running probability of choosing one or the other, so that the final result isn't biased. The (suspiciously Python-like) pseudocode would be something like this:
def generate_choices(N, M):
n0 = M
n1 = N - M
while n0 + n1 > 0:
if randrange(0, n0 + n1) < n0:
yield 0
n0 -= 1
else:
yield 1
n1 -= 1
This might not be perfect because of floating point errors, but it will be pretty close to perfect.
This last part of the algorithm is crucial. Going through the above proof exhaustively makes it clear that other ways of generating ones and zeros won't give us a proper shuffle.
Performing multiple merges in real data
There remain a few practical issues. The above argument assumes a perfectly balanced merge, and it also assumes you have only twice as much data as you have memory. Neither assumption is likely to hold.
The fist turns out not to be a big problem because the above argument doesn't actually require equally sized lists. It's just that if the list sizes are different, the calculations are a little more complex. If you go through the above replacing the M for list 1 with N - M throughout, the details all line up the same way. (The pseudocode is also written in a way that works for any M greater than zero and less than N. There will then be exactly M zeros and M - N ones.)
The second means that in practice, there might be many, many chunks to merge this way. The process inherits several properties of merge sort — in particular, it requires that for K chunks, you'll have to perform roughly K / 2 merges, and then K / 4 merges, and so on, until all the data has been merged. Each batch of merges will loop over the entire dataset, and there will be roughly log2(K) batches, for a run time of O(N * log(K)). An ordinary Fisher-Yates shuffle would be strictly linear in N, and so in theory would be faster for very large K. But until K gets very, very large, the penalty may be much smaller than the disk seeking penalties.
The benefit of this approach, then, comes from smart IO management. And with SSDs it might not even be worth it — the seek penalties might not be large enough to justify the overhead of multiple merges. Paul Hankin's answer has some practical tips for thinking through the practical issues raised.
Merging all data at once
An alternative to doing multiple binary merges would be to merge all the chunks at once -- which is theoretically possible, and might lead to an O(N) algorithm. The random number generation algorithm for values in c would need to generate labels from 0 to K - 1, such that the final outputs have exactly the right number of labels for each category. (In other words, if you're merging three chunks with 10, 12, and 13 items, then the final value of c would need to have 0 ten times, 1 twelve times, and 2 thirteen times.)
I think there is probably an O(N) time, O(1) space algorithm that will do that, and if I can find one or work one out, I'll post it here. The result would be a truly O(N) shuffle, much like the one Paul Hankin describes towards the end of his answer.
Logically partition your database entries (for e.g Alphabetically)
Create indexes based on your created partitions
build DAO to sensitize based on index
Here is my situation. I am using two java.util.HashMap to store some frequently used data in a Java web app running on Tomcat. I know the exact number of entries into each Hashmap. The keys will be strings, and ints respectively.
My question is, what is the best way to set the initial capacity and loadfactor?
Should I set the capacity equal to the number of elements it will have and the load capacity to 1.0? I would like the absolute best performance without using too much memory. I am afraid however, that the table would not fill optimally. With a table of the exact size needed, won't there be key collision, causing a (usually short) scan to find the correct element?
Assuming (and this is a stretch) that the hash function is a simple mod 5 of the integer keys, wouldn't that mean that keys 5, 10, 15 would hit the same bucket and then cause a seek to fill the buckets next to them? Would a larger initial capacity increase performance?
Also, if there is a better datastructure than a hashmap for this, I am completely open to that as well.
In the absence of a perfect hashing function for your data, and assuming that this is really not a micro-optimization of something that really doesn't matter, I would try the following:
Assume the default load capacity (.75) used by HashMap is a good value in most situations. That being the case, you can use it, and set the initial capacity of your HashMap based on your own knowledge of how many items it will hold - set it so that initial-capacity x .75 = number of items (round up).
If it were a larger map, in a situation where high-speed lookup was really critical, I would suggest using some sort of trie rather than a hash map. For long strings, in large maps, you can save space, and some time, by using a more string-oriented data structure, such as a trie.
Assuming that your hash function is "good", the best thing to do is to set the initial size to the expected number of elements, assuming that you can get a good estimate cheaply. It is a good idea to do this because when a HashMap resizes it has to recalculate the hash values for every key in the table.
Leave the load factor at 0.75. The value of 0.75 has been chosen empirically as a good compromise between hash lookup performance and space usage for the primary hash array. As you push the load factor up, the average lookup time will increase significantly.
If you want to dig into the mathematics of hash table behaviour: Donald Knuth (1998). The Art of Computer Programming'. 3: Sorting and Searching (2nd ed.). Addison-Wesley. pp. 513–558. ISBN 0-201-89685-0.
I find it best not to fiddle around with default settings unless I really really need to.
Hotspot does a great job of doing optimizations for you.
In any case; I would use a profiler (Say Netbeans Profiler) to measure the problem first.
We routinely store maps with 10000s of elements and if you have a good equals and hashcode implementation (and strings and Integers do!) this will be better than any load changes you may make.
Assuming (and this is a stretch) that the hash function is a simple mod 5 of the integer keys
It's not. From HashMap.java:
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
I'm not even going to pretend I understand that, but it looks like that's designed to handle just that situation.
Note also that the number of buckets is also always a power of 2, no matter what size you ask for.
Entries are allocated to buckets in a random-like way. So even if you as many buckets as entries, some of the buckets will have collisions.
If you have more buckets, you'll have fewer collisions. However, more buckets means spreading out in memory and therefore slower. Generally a load factor in the range 0.7-0.8 is roughly optimal, so it is probably not worth changing.
As ever, it's probably worth profiling before you get hung up on microtuning these things.