I need a hashtable that doesn't change it's size, because at the beginning I know the size should be N, and the table shouldn't change in the program, so should I set the load factor to 1 to mean don't increase it's size until the size increases to N+1, which I know will never occur ?
To be more specific, I want this : when it reaches N, it shouldn't increase, but if N+1 occurs, then increase size. Is this the right way to set it ?
You probably want to use java.util.HashMap instead of Hashtable, unless you need the synchronized access. Either provides a constructor for you to set an initial capacity.
The load factor is the upper threshold multiplier for number of items before the table is rehashed.
The simplest answer is yes. However to explain a little bit..
The threshold for rehashing is calculated like this
threshold = (int)(initialCapacity * loadFactor);
And in the put method rehash is triggered by the following condition.
if (count >= threshold)
This is more or less true for HashMap as well. Should you decide to use it.
Related
I am learning about Java off-heap cache and I use OHC cache. I found out about the source code of OHC and it contains methods that I don't know what they are used for. Hope someone can explain it to me, thanks.
int cpus = Runtime.getRuntime().availableProcessors(); // my CPU = 4
segmentCount = roundUpToPowerOf2(cpus * 2, 1 << 30);
capacity = Math.min(cpus * 16, 64) * 1024 * 1024;
static int roundUpToPowerOf2(int number, int max) {
return number >= max ? max : (number > 1) ? Integer.highestOneBit((number - 1) << 1) : 1;
}
To minimize lock contention, OHC splits the entire cache into 'segments' so that only if two entries hash to the same segment must one operation wait for the other. This is something like table-level locking in a relational database.
The meaning of cpus is clear enough.
Default segmentCount is the smallest power of 2 larger than twice the CPU count. For the logic of doubling the CPU count for throughput optimization, see for example https://stackoverflow.com/a/4771384.
capacity is the total storable data size in the cache. The default is 16MB per core. This is probably designed to correspond with CPU cache sizes, though presumably a user would have an idea of their application's actual capacity needs and would very likely be configuring this value and not using the default anyway.
The actual roundUpToPowerOf2 can be explained as follows:
Do not go above max, nor below 1. (I suppose it's up to the caller to assure that max is itelf a power of two or, it's ok for it not to be in this case.) In betweeen: To get a power of two, we want an int comprised of a single one bit (or all zeros). Integer#highestOneBit gives such a number with the turned on bit being the leftmost turned on bit in its argument. So we need to provide it with a number with whose leftmost turned on bit is:
the same as number if it is already a power of two, or
one position to the left of number's leftmost one bit.
Calculating number - 1 before left shifting is for the first case. If number is a power of two, then left shifting as-is would give us the next power of two, which isn't what we want. For the second case, the number (or it's value subtracted by 1) left shifted turns on the next larger bit, then Integer#highestOneBit effectively blanks out all of the bits to the right.
In my program key-value pairs are frequently added to a Map until 1G of pairs are added. Map resizing slows down the process. How can I set minimum Map size to, for example 1000000007 (which is a prime)?
The constructor of a HashMap takes the initial size of the map (and the load factor, if desired).
Map<K,V> map = new HashMap<>(1_000_000_007);
How can I set minimum Map size to, for example 1000000007 (which is a prime)?
Using the HashMap(int) or HashMap(int, float) constructor. The int parameter is the capacity.
HashMap should have a size that is prime to minimize clustering.
Past and current implementations of the HashMap constructor will all choose a capacity that is the smallest power of 2 (up to 230) which is greater or equal to the supplied capacity. So using a prime number has no effect.
Will the constructor prevent map from resizing down?
HashMaps don't resize down.
(Note that size and capacity are different things. The size() method returns the number of currently entries in the Map. You can't "set" the size.)
A could of things you should note. The number of buckets in a HashMap is a power of 2 (might not be in future), the next power of 2 is 2^30. The load factor determines at what size it should grow the Map. Typically this is 0.75.
If you set the capacity to be the expected size, it will;
round up to the next power of 2
might still resize when the capacity * 0.75 is reached.
is limited to 2^30 anyway as it is the largest power of 2 you can have for the size of an array.
Will the constructor prevent map from resizing down?
The only way to do this is to copy all the elements into a new Map. This is not done automatically.
As per Sun Java Implementation, during expansion, ArrayList grows to 3/2 it's initial capacity whereas for HashMap the expansion rate is double. What is reason behind this?
As per the implementation, for HashMap, the capacity should always be in the power of two. That may be a reason for HashMap's behavior. But in that case the question is, for HashMap why the capacity should always be in power of two?
The expensive part at increasing the capacity of an ArrayList is copying the content of the backing array a new (larger) one.
For the HashMap, it is creating a new backing array and putting all map entries in the new array. And, the higher the capacity, the lower the risk of collisions. This is more expensive and explains, why the expansion factor is higher. The reason for 1.5 vs. 2.0? I consider this as "best practise" or "good tradeoff".
for HashMap why the capacity should always be in power of two?
I can think of two reasons.
You can quickly determine the bucket a hashcode goes in to. You only need a bitwise AND and no expensive modulo. int bucket = hashcode & (size-1);
Let's say we have a grow factor of 1.7. If we start with a size 11, the next size would be 18, then 31. No problem. Right? But the hashcodes of Strings in Java, are calculated with a prime factor of 31. The bucket a string goes into,hashcode%31, is then determined only by the last character of the String. Bye bye O(1) if you store folders that all end in /. If you use a size of, for example, 3^n, the distribution will not get worse if you increase n. Going from size 3 to 9, every element in bucket 2, will now go to bucket 2,5 or 7, depending on the higher digit. It's like splitting each bucket in three pieces. So a size of integer growth factor would be preferred. (Off course this all depends on how you calculate hashcodes, but a arbitrary growth factor doesn't feel 'stable'.)
The way HashMap is designed/implemented its underlying number of buckets must be a power of 2 (even if you give it a different size, it makes it a power of 2), thus it grows by a factor of two each time. An ArrayList can be any size and it can be more conservative in how it grows.
The accepted answer is not actually giving exact response to the question, but comment from #user837703 to that answer is clearly explaining why HashMap grows by power of two.
I found this article, which explains it in detail http://coding-geek.com/how-does-a-hashmap-work-in-java/
Let me post fragment of it, which gives detailed answer to the question:
// the function that returns the index of the bucket from the rehashed hash
static int indexFor(int h, int length) {
return h & (length-1);
}
In order to work efficiently, the size of the inner array needs to be a power of 2, let’s see why.
Imagine the array size is 17, the mask value is going to be 16 (size -1). The binary representation of 16 is 0…010000, so for any hash value H the index generated with the bitwise formula “H AND 16” is going to be either 16 or 0. This means that the array of size 17 will only be used for 2 buckets: the one at index 0 and the one at index 16, not very efficient…
But, if you now take a size that is a power of 2 like 16, the bitwise index formula is “H AND 15”. The binary representation of 15 is 0…001111 so the index formula can output values from 0 to 15 and the array of size 16 is fully used. For example:
if H = 952 , its binary representation is 0..01110111000, the associated index is 0…01000 = 8
if H = 1576 its binary representation is 0..011000101000, the associated index is 0…01000 = 8
if H = 12356146, its binary representation is 0..0101111001000101000110010, the associated index is 0…00010 = 2
if H = 59843, its binary representation is 0..01110100111000011, the associated index is 0…00011 = 3
This is why the array size is a power of two. This mechanism is transparent for the developer: if he chooses a HashMap with a size of 37, the Map will automatically choose the next power of 2 after 37 (64) for the size of its inner array.
Hashing takes advantage of distributing data evenly into buckets. The algorithm tries to prevent multiple entries in the buckets ("hash collisions"), as they will decrease performance.
Now when the capacity of a HashMap is reached, size is extended and existing data is re-distributed with the new buckets. If the size-increas would be too small, this re-allocation of space and re-dsitribution would happen too often.
A general rule to avoid collisions on Maps is to keep to load factor max at around 0.75
To decrease possibility of collisions and avoid expensive copying process HashMap grows at a larger rate.
Also as #Peter says, it must be a power of 2.
I can't give you a reason why this is so (you'd have to ask Sun developers), but to see how this happens take a look at source:
HashMap: Take a look at how HashMap resizes to new size (source line 799)
resize(2 * table.length);
ArrayList: source, line 183:
int newCapacity = (oldCapacity * 3)/2 + 1;
Update: I mistakenly linked to sources of Apache Harmony JDK - changed it to Sun's JDK.
Here is my situation. I am using two java.util.HashMap to store some frequently used data in a Java web app running on Tomcat. I know the exact number of entries into each Hashmap. The keys will be strings, and ints respectively.
My question is, what is the best way to set the initial capacity and loadfactor?
Should I set the capacity equal to the number of elements it will have and the load capacity to 1.0? I would like the absolute best performance without using too much memory. I am afraid however, that the table would not fill optimally. With a table of the exact size needed, won't there be key collision, causing a (usually short) scan to find the correct element?
Assuming (and this is a stretch) that the hash function is a simple mod 5 of the integer keys, wouldn't that mean that keys 5, 10, 15 would hit the same bucket and then cause a seek to fill the buckets next to them? Would a larger initial capacity increase performance?
Also, if there is a better datastructure than a hashmap for this, I am completely open to that as well.
In the absence of a perfect hashing function for your data, and assuming that this is really not a micro-optimization of something that really doesn't matter, I would try the following:
Assume the default load capacity (.75) used by HashMap is a good value in most situations. That being the case, you can use it, and set the initial capacity of your HashMap based on your own knowledge of how many items it will hold - set it so that initial-capacity x .75 = number of items (round up).
If it were a larger map, in a situation where high-speed lookup was really critical, I would suggest using some sort of trie rather than a hash map. For long strings, in large maps, you can save space, and some time, by using a more string-oriented data structure, such as a trie.
Assuming that your hash function is "good", the best thing to do is to set the initial size to the expected number of elements, assuming that you can get a good estimate cheaply. It is a good idea to do this because when a HashMap resizes it has to recalculate the hash values for every key in the table.
Leave the load factor at 0.75. The value of 0.75 has been chosen empirically as a good compromise between hash lookup performance and space usage for the primary hash array. As you push the load factor up, the average lookup time will increase significantly.
If you want to dig into the mathematics of hash table behaviour: Donald Knuth (1998). The Art of Computer Programming'. 3: Sorting and Searching (2nd ed.). Addison-Wesley. pp. 513–558. ISBN 0-201-89685-0.
I find it best not to fiddle around with default settings unless I really really need to.
Hotspot does a great job of doing optimizations for you.
In any case; I would use a profiler (Say Netbeans Profiler) to measure the problem first.
We routinely store maps with 10000s of elements and if you have a good equals and hashcode implementation (and strings and Integers do!) this will be better than any load changes you may make.
Assuming (and this is a stretch) that the hash function is a simple mod 5 of the integer keys
It's not. From HashMap.java:
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
I'm not even going to pretend I understand that, but it looks like that's designed to handle just that situation.
Note also that the number of buckets is also always a power of 2, no matter what size you ask for.
Entries are allocated to buckets in a random-like way. So even if you as many buckets as entries, some of the buckets will have collisions.
If you have more buckets, you'll have fewer collisions. However, more buckets means spreading out in memory and therefore slower. Generally a load factor in the range 0.7-0.8 is roughly optimal, so it is probably not worth changing.
As ever, it's probably worth profiling before you get hung up on microtuning these things.
Ok, here's my situation:
I have an Array of States, which may contain duplicates. To get rid of the duplicates, I can add them all to a Set.
However when I create the Set, it wants the initial capacity and load factor to be defined, but what should they be set to?
From googling, I have come up with:
String[] allStates = getAllStates();
Set<String> uniqueStates = new HashSet<String>(allStates.length, 0.75);
The problem with this, is that allStates can contain anwhere between 1 and 5000 states. So the Set will have capacity of over 5000, but only containing at most 50.
So alternatively set the max size of the Set could be set to be the max number of states, and the load factor to be 1.
I guess my questions really are:
What should you set the initial capacity to be when you don't know how many items are to be in the Set?
Does it really matter what it gets set to when the most it could contain is 50?
Should I even be worrying about it?
Assuming that you know there won't be more than 50 states (do you mean US States?), the
Set<String> uniqueStates = new HashSet<String>(allStates.length, 0.75);
quoted is definitely wrong. I'd suggest you go for an initial capacity of 50 / 0.75 = 67, or perhaps 68 to be on the safe side.
I also feel the need to point out that you're probably overthinking this intensely. Resizing the arraylist twice from 16 up to 64 isn't going to give you a noticeable performance hit unless this is right in the most performance-critical part of the program.
So the best answer is probably to use:
new HashSet<String>();
That way, you won't come back a year later and puzzle over why you chose such strange constructor arguments.
Use the constructor where you don't need to specify these values, then reasonable defaults are chosen.
First, I'm going to say that in your case you're definitely overthinking it. However, there are probably situations where one would want to get it right. So here's what I understand:
1) The number of items you can hold in your HashSet = initial capacity x load factor. So if you want to be able to hold n items, you need to do what Zarkonnen did and divide n by the load factor.
2) Under the covers, the initial capacity is rounded up to a power of two per Oracle tutorial.
3) Load factor should be no more than .80 to prevent excessive collisions, as noted by Tom Hawtin - tackline.
If you just accept the default values (initial capacity = 16, load factor = .75), you'll end up doubling your set in size 3 times. (Initial max size = 12, first increase makes capacity 32 and max size 24 (32 * .75), second increase makes capacity 64 and max size 48 (64 * .75), third increase makes capacity 128 and max size 96 (128 * .75).)
To get your max size closer to 50, but keep the set as small as possible, consider an initial capacity of 64 (a power of two) and a load factor of .79 or more. 64 * .79 = 50.56, so you can get all 50 states in there. Specifying 32 < initial capacity < 64 will result in initial capacity being rounded up to 64, so that's the same as specifying 64 up front. Specifying initial capacity <= 32 will result in a size increase. Using a load factor < .79 will also result in a size increase unless your initial capacity > 64.
So my recommendation is to specify initial capacity = 64 and load factor = .79.
The safe bet is go for a size that is too small.
Because resizing is ameliorated by an exponential growth algorithm (see the stackoverflow podcast from a few weeks back), going small will never cost you that much. If you have lots of sets (lucky you), then it will matter to performance if they are oversize.
Load factor is a tricky one. I suggest leaving it at the default. I understand: Below about 0.70f you are making the array too large and therefore slower. Above 0.80f and you'll start getting to many key clashes. Presumably probing algorithms will require lower load factors than bucket algorithms.
Also note that the "initial capacity" means something slightly different than it appears most people think. It refers to the number of entries in the array. To get the exact capacity for a number of elements, divide by the desired load factor (and round appropriately).
Make a good guess. There is no hard rule. If you know there's likely to be say 10-20 states, i'd start off with that number (20).
I second Zarkonnen. Your last question is the most important one. If this happens to occur in a hotspot of your application it might be worth the effort to look at it and try to optimise, otherwise CPU cycles are cheaper than burning up your own neurons.
If you were to optimize this -- and it may be appropriate to do that -- some of your decision will depend on how many duplicates you expect the array to have.
If there are very many duplicates, you will want a smaller initial
capacity. Large, sparse hash tables are bad when iterating.
If there are not expected to be very many duplicates, you will want
an initial capacity such that the entire array could fit without
resizing.
My guess is that you want the latter, but this is something worth considering if you pursue this.