Lower / upper load factor in hash tables - java

I am to write a chained hash set class in java.
I understand the load factor is M/capacity where M is the number of elements currently in the table and capacity is the size of the table.
But how does the load factor help me determine if I should resize the table and rehash or not?
Also I couldn't find anywhere how to calculate the lower / upper load factors. Are they even needed?
I hope this is enough information,
Thank you!!

A single loadFactor used to configure standard Java hashes (and in a number of hash APIs in other languages) is a simplification.
Conceptually, it is reasonable to distingush
Target load, indicate a default memory footprint -- performance tradeoff configuration. when you build a hash of the known size, you choose capacity so that size/capacity is as close to target load, as possible.
Max load, you want a hash to never exceed this load. You trigger a resize, if a hash reach this load.
Grow factor, a default configuration of how much to enlarge a hash on resize. If capacity is a power of 2, grow factor could be only either 2 or 4.
Min load, you want a hash load to never be lower than the min load, maybe unless you remove elements or clear a hash. If capacity is a power of 2, min load couldn't be greater than 0.5. Additionally, max load / min load ratio should be greater or equal to grow factor.
All the above concerns chained hash, for open addressing hash with tombstones things are getting even more complicated.
In java.util.HashMap loadFactor plays the roles of target and max load simultaneously. Grow factor is 2, min load is 0.0.
For chained hash non-power-of-2 capacity is overkill unless you need extremely precise control over memory usage (you don't, trust me) or a capacity between 2^30 and 2^31-1 (because you can't create an array of size 2^31 in Java, it is Integer.MAX_VALUE + 1).

It works the other way around: it's not that the load factor helps you; you explicitely decide the load factor based on your performance tests, to avoid wasting time rehashing and still have the acceptable retrieval and iteration performance.

Related

What Happens In a HashMap If we give 2^30 initial size and load factor= 1?

I have searched a lot but didn't find around the same:-
How does the hashmap resize works if it reaches to max capacity in this case?
Does it throw some exception because it can't increase its size max size(2<<30)?
Looking at the source code, it just stops resizing itself beyond that maximum size.
Meaning that (if you can even reach this point without running out of memory first), you don't get any additional buckets. But you could not get a lot more anyway, as hashCode returns an int, and with 2^30 you have already pretty much maxed out the range there, too. Arrays in Java cannot grow beyond being int-indexed, either (and the HashMap buckets are stored in an array).
Assuming again you have enough memory, you can continue adding more elements, they will just "collide" into the same buckets.

Why would a higher load factor in HashMap decrease the space overhead

From the java documentation: "As a general rule, the default load factor (.75) offers a good trade off between time and space costs. Higher values decrease the space overhead but increase the lookup cost"
Why it decrease the space overhead? isnt the extra nodes in the buckets with equivalent size of the extra array size?
(in the end the number of entries will be the same!)
The load factor controls how full the map can become, before it's capacity is doubled.
Let's assume you have an ideal hash function with proper dispersal among buckets. Let's say you have a map with 100 capacity and a load factor of 0.75. This means that when you add elements to fill 75 buckets, it's allocated capacity is doubled i.e. becomes 200. So with 75 actual buckets, you've allocated capacity of 200. Overhead i.e. wasted space = 125
Now, let's say we have another map with capacity 100 and load factor 0.5. This means that by the time the map is populated 50 actual buckets, it's capacity is doubled. So for 50 entries, the capacity is now 200. Overhead = 150.
With higher load capacity the opposite happens i.e. you have less wasted space.

Insert/Delete in O(1) time in HashMaps with millions of objects (with distinct keys)?

I know that insert/delete works in O(1) time with Java HashMaps.
But is it still the fastest data structure if I have over a million objects (with distinct keys - i.e. each object has a unique key) in my HashMap?
TL;DR - profile your code!
The average performance of HashMap insertion and deletion scales as O(1) (assuming you have a sound hashCode() method on the keys1) until you start running into 2nd-order memory effects:
The larger the heap is, the longer it takes to garbage collect. Generally, the factors that impact most are the number and size of non-garbage objects. A big enough HashMap will do that ...
Your hardware has a limited amount of physical memory. If your JVM's memory demand grows beyond that, the host OS will "swap" memory pages between RAM and disk. A big enough HashMap will do that ... if your heap size is bigger than the amount of physical RAM available to the JVM process.
There are memory effects that are due to the sizes of your processors' memory cache and TLB cache sizes. Basically, if the processors "demand" in reading and writing memory is too great, the memory system becomes the bottleneck. These effects can be exacerbated by a large heap and highly non-localized access patterns. (And running the GC!)
There is also a limit of about 2^31 on the size of a HashMap's primary hash array. So if you have more than about 2^31 / 0.75 entries, the performance of the current HashMap implementation theoretically O(N). However, we are talking billions of entries, and the 2nd order memory effects will be impacting on performance well before then.
1 - If your keys have a poor hashCode() function, then you may find that you get a significant proportion of the keys hash to the same code. If that happens, lookup, insert and delete performance for those keys will be either O(logN) or O(N) ... depending on the key's type, and your Java version. In this case, N is the number keys in the table with the same hashcode as the one you are looking up, etc.
Is HashMap the fastest data structure for your use-case?
It is hard to say without more details of your use-case.
It is hard to say without understanding how much time and effort you are prepared to put into the problem. (If you put in enough coding effort, you could almost certainly trim a few percent off. Maybe a lot more. HashMap is general purpose.)
It is hard to say without you (first!) doing a proper performance analysis.
For example, you first need to be sure that the HashMap really is the cause of your performance problems. Sure, you >>think<< it is, but have you actually profiled your code to find out? Until you do this, you risk wasting your time on optimizing something that isn't the bottleneck.
So HashMaps will have an O(1) insert/delete even for a huge number of objects. The problem for a huge amount of data is the space. For a million entries you may be fine with in memory.
Java has a default load factor of .75 for a HashMap, meaning that a HashMap would need 1.33 million slots to support this map. If you can support this in memory, it's all fine. Even if you can't hold this all in memory, you'd probably still want to use HashMaps, perhaps a Distributed HashMap.
As far as Big-O time goes, this refers to the worst case complexity. The only time the analysis of Big-O time is really useful is as data sizes get larger and larger. If you were working with a really small data set, O(5n+10) would not be the same as O(n). The reason that constant time ( O(1) ) time is so valuable is because it means that the time doesn't depend on the size of the data set. Therefore, for a large data set like the one you're describing, a HashMap would be an excellent option due to its constant time insert/delete.

OpenHFT ChronicleMap memory alocation and limits

This post would likely be a good candidate for frequently asked questions at OpenHFT.
I am playing with ChronicleMap considering it for an idea but having lots of questions. I am sure most junior programmers who are looking into this product have similar considerations.
Would you explain how memory is managed in this API?
ChronicleMap proclaims some remarkable TBs off-heap memory resources available to processing its data and I would like to get a clear vision on that.
Lets get down to a programmer with a laptop of 500GB HD and 4GB RAM. In this case pure math sais - total resource of 'swapped' memory available is 504GB. Let's give the OS and other programs half and we are left with 250GB HD and 2GB RAM. Can you elaborate on the actual available memory ChronicleMap can allocate in numbers relative to available resources?
Next related questions are relative to the implementation of ChronicleMap.
My understanding is that each ChronicleMap allocates chunk of memory it works with and optimal performance/memory usage is achieved when we can accurately predict the amount of data passed through. However, this is a dynamic world.
Lets set an (exaggerated but possible) example:
Suppose a map of K (key) 'cities' and their V (value) - 'description' (of the cities) and allowing users large limits on the description length.
First user enters: K = "Amsterdam", V = "City of bicycles" and this entry is used to declare the map
- it sets the precedent for the pair like this:
ChronicleMap<Integer, PostalCodeRange> cityPostalCodes = ChronicleMap
.of(CharSequence.class, CharSequence.class)
.averageKey("Amsterdam")
.averageValue("City of bicycles")
.entries(5_000)
.createOrRecoverPersistedTo(citiesAndDescriptions);
Now, next user gets carried away and writes an assay about Prague
He passes to: K = "Prague", V = "City of 100 towers is located in the hard of Europe ... blah, blah... million words ..."
Now the programmer had expected max 5_000 entries, but it gets out of his hands and there are many thousands of entries.
Does ChronicleMap allocate memory automatically for such cases? If yes is there some better approach of declaring ChronicleMaps for this dynamic solution? If no, would you recommend an approach (best in code example) how to best handle such scenarios?
How does this work with persistence to file?
Can ChronicleMaps deplete my RAM and/or disk space? Best practice to avoid that?
In other words, please explain how memory is managed in case of under-estimation and over-estimation of the value (and/or key) lengths and number of entries.
Which of these are applicable in ChronicleMap?
If I allocate big chunk (.entries(1_000_000), .averageValueSize(1_000_000) and actual usage is - Entries = 100, and Average Value Size = 100.
What happens?:
1.1. - all works fine, but there will be large wasted chunk - unused?
1.2. - all works fine, the unused memory is available to:
1.2.1 - ChronicleMap
1.2.2 - given thread using ChronicleMap
1.2.3 - given process
1.2.4 - given JVM
1.2.5 - the OS
1.3. - please explain if something else happens with the unused memory
1.4. - what does the over sized declaration do to my persistence file?
Opposite of case 1 - I allocate small chunk (.entries(10), .averageValueSize(10) and the actual usage is 1_000_000s of entries, and Average Value Size = 1_000s of bytes.
What happens?:
Lets get down to a programmer with a laptop of 500GB HD and 4GB RAM. In this case pure math sais - total resource of 'swapped' memory available is 504GB. Let's give the OS and other programs half and we are left with 250GB HD and 2GB RAM. Can you elaborate on the actual available memory ChronicleMap can allocate in numbers relative to available resources?
Under such conditions Chronicle Map will be very slow, with on average 2 random disk reads and writes (4 random disk operations in total) on each operation with Chronicle Map. Traditional disk-based db engines, like RocksDB or LevelDB, should work better when the database size is much bigger than memory.
Now the programmer had expected max 5_000 entries, but it gets out of his hands and there are many thousands of entries.
Does ChronicleMap allocate memory automatically for such cases? If yes is there some better approach of declaring ChronicleMaps for this dynamic solution? If no, would you recommend an approach (best in code example) how to best handle such scenarios?
Chronicle Map will allocate memory until the actual number of entries inserted divided by the number configured through ChronicleMapBuilder.entries() is not higher than the configured ChronicleMapBuilder.maxBloatFactor(). E. g. if you create a map as
ChronicleMap<Integer, PostalCodeRange> cityPostalCodes = ChronicleMap
.of(CharSequence.class, CharSequence.class)
.averageKey("Amsterdam")
.averageValue("City of bicycles")
.entries(5_000)
.maxBloatFactor(5.0)
.createOrRecoverPersistedTo(citiesAndDescriptions);
It will start throwing IllegalStateException on attempts to insert new entries, when the size will be ~ 25 000.
However, Chronicle Map works progressively slower when the actual size grows far beyond the configured size, so the maximum possible maxBloatFactor() is artificially limited to 1000.
The solution right now is to configure the future size of the Chronicle Map via entries() (and averageKey(), and averageValue()) at least approximately correctly.
The requirement to configure plausible Chronicle Map's size in advance is acknowledged to be a usability problem. There is a way to fix this and it's on the project roadmap.
In other words, please explain how memory is managed in case of under-estimation and over-estimation of the value (and/or key) lengths and number of entries.
Key/value size underestimation: space is wasted in hash lookup area, ~ 8 bytes * underestimation factor, per entry. So it could be pretty bad if the actual average entry size (key + value) is small, e. g. 50 bytes, and you have configured it as 20 bytes, you will waste ~ 8 * 50 / 20 = 20 bytes, or 40%. Bigger the average entry size, smaller the waste.
Key/value size overestimation: if you configure just key and value average size, but not actualChunkSize() directly, the actual chunk size is automatically chosen between 1/8th and 1/4th of the average entry size (key + value). The actual chunk size is the allocation unit in Chronicle Map. So if you configured average entry size as ~ 1000 bytes, the actual chunk size will be chosen between 125 and 250 bytes. If the actual average entry size is just 100 bytes, you will lose a lot of space. If the overestimation is small, the expected space losses are limited to about 20% of the data size.
So if you are afraid you may overestimate the average key/value size, configure actualChunkSize() explicitly.
Number of entries underestimation: discussed above. No particular space waste, but Chronicle Map works slower, the worse the underestimation.
Number of entries overestimation: memory is wasted in hash lookups area, ~ 8 bytes * overestimation factor, per entry. See the section key/value size underestimation above on how good or bad it could be, depending on the actual average entry data size.

Maximal number of ChronicleMap entries

How many entries theoretically can ChronicleMap contain at maximum? What is the number of maximum entries one can put to ChronicleMap?
The theoretical maximum is limited by the virtual memory you can have. This can be as low as 128 TB depending on your OS. By comparison Chronicle Queue doesn't have this limit, as it swaps in memory mappings, but would be much slower for random access as a result.
In practice, limiting the size of your map to 2x - 40x main memory size seems to be a realistic upper bound.
In short, the smaller the entries, the more you can have.

Categories

Resources