For some time i am working on creating index for very large data sets (around 190 million). I have a BTree which can insert data sets (typically an object)/search for key and while i searched how to persist the data into files in disk, i came across this amazing article (http://www.javaworld.com/article/2076333/java-web-development/use-a-randomaccessfile-to-build-a-low-level-database.html#resources). This pretty much gives me the starting point.
Here they are indexing String key to binary object (blob). They have the file format where they have divided it into 3 regions, header(stores start point of indexes), index(stores index and its corresponding location) and data region (stores data). They are using RandomAccessFile to get the data.
How do i define similar file format for btree. All i know is for every read made to disk, i have to get one node(typically one block 512 bytes). There are many similar questions on how to persist but it is little difficult to understand the big picture on why we decide on something that we implemented like this question (Persisting B-Tree nodes to RandomAccessFile -[SOLVED]). Please share your thoughts.
Here is an alternative take on the question, based on problem specifics that have become known in the meantime. This post is based on the following assumptions:
record count about 190 million, fixed
keys are 64-byte hashes, like SHA-256
values are filenames: variable length, but sensible (average length < 64 bytes, max < page)
page size 4 KiByte
Efficient representation of filenames in a database is a different topic that cannot be addressed here. Should the filenames be awkward - longish on average and/or Unicode - then the hashing solution will punish you with increased disk read counts (more overflows, more chaining) or reduced average occupancy (more wasted space). A B-tree solution reacts somewhat more benignly, though, since an optimum tree can be constructed in any case.
The most efficient solution in this situation - and the simplest to implement by a wide margin - is hashing, since your keys are perfect hashes already. Take the first 23 bits of the hash as the page number, and lay out the pages like this:
page header
uint32_t next_page
uint16_t key_count
key/offset vector
uint16_t value_offset;
byte key[64];
... unallocated space ...
last arrived filename
...
2nd arrived filename
1st arrived filename
Values (filenames) are stored from the end of the page downwards, prefixed with their 16-bit length, and the key/offset vector grows upwards. That way neither low/high key counts nor short/long values can cause unnecessary waste of space, as would be the case with fixed-size structures. Nor do you have to parse variable-length structures during key searches. Apart from that I've aimed for the greatest possible simplicity - no premature optimisation. The bottom of the heap can be stored in the page header, in KO.[PH.key_count].value_offset (my preference), or computed as KO.Take(PH.key_count).Select(r => r.value_offset).Min(), whatever pleases you most.
The key/offset vector needs to be kept sorted on the keys so that you can use binary search but the values can be written as they arrive, they do not need to be in any particular order. If the page overflows, allocate a new one just like it at the current end of the file (growing the file by one page) and stash its page number in the appropriate header slot. This means that you can binary search within a page but all chained pages need to be read and searched one by one. Also, you do not need any kind of file header, since the file size is otherwise available and that's the only piece of global management information that needs to be maintained.
Create the file as a sparse file with the number of pages as indicated by your chosen number of hash key bits (e.g. 8388608 pages for 23 bits). Empty pages in a sparse file don't take up any disk space and read as all 0s, which works perfectly fine with our page layout/semantics. Extend the file by one page whenever you need to allocate an overflow page. Note: the 'sparse file' thing isn't very important here since almost all pages will have been written to when you're done building the file.
For maximum efficiency you need to run some analyses on your data. In my simulation - with random numbers as stand-ins for the hashes, and on the assumption that average filename size is 62 bytes or less - the optimum turned out to be making 2^23 = 8388608 buckets/pages. This means that you take the first 23 bit of the hash as the page number to load. Here are the details:
# bucket statistics for K = 23 and N = 190000000 ... 7336,5 ms
average occupancy 22,6 records
0 empty buckets (min: 3 records)
310101/8388608 buckets with 32+ records (3,7%)
That keeps the chaining to a minimum, on average you need to read just 1.04 pages per search. Increasing the hash key size by one single bit to 24 reduces the expected number of overflowing pages to 3 but doubles the file size and reduces average occupancy to 11.3 records per page/bucket. Reducing the key to 22 bits means that almost all pages (98.4%) can be expected to overflow - meaning the file is virtually the same size as that for 23 bits but you have to do twice as many disk reads per search.
Hence you see how important it is to run a detailed analysis on the data to decide on the proper number of bits to use for hash addressing. You should run an analysis that uses the actual filename sizes and tracks the per-page overhead, to see what the actual picture looks like for 22 bits to 24 bits. It'll take a while to run but that's still way faster than building a multi-gigabyte file blindly and then finding that you have wasted 70% of space or that searches take significantly more than 1.05 page reads on average.
Any B-tree based solution would be much more involved (read: complicated) but could not reduce the page read count per search below 1.000, for obvious reasons, and even that only on the assumption that a sufficient number of internal nodes can be kept cached in memory. If your system has such humongous amounts of RAM that data pages can be cached to a significant degree then the hashing solution will benefit just as much as one that is based on some kind of B-tree.
As much as I would like an excuse for building a screamingly fast hybrid radix/B+tree, the hashing solution delivers essentially the same performance for a tiny fraction of the effort. The only thing where B-treeish solutions can outdo hashing here is space efficiency, since it is trivial to construct an optimum tree for existing pre-sorted data.
The are plenty of Open Source key/value stores and full database engines - take a week off and start Googling. Even if you end up using none of them, you still need to study a representative cross section (architecture, design histories, key implementation details) to get enough of an overview over the subject matter so that you can make informed decisions and ask intelligent questions. For a brief overview, try to Google details on index file formats, both historic ones like IDX or NTX, and current ones used in various database engines.
If you want to roll your own then you might consider hitching yourself to the bandwagon of an existing format, like the dBASE variants Clipper and Visual FoxPro (my favourite). This gives you the ability to work your data with existing tools, including Total Commander plugins and whatnot. You don't need to support the full formats, just the single binary instance of the format that you choose for your project. Great for debugging, reindexing, ad hoc queries and so on. The format itself is dead simple and easy to generate even if you don't use any of the existing libraries. The index file formats aren't quite as trivial but still manageable.
If you want to roll your own from scratch then you've got quite a road ahead of you, since the basics of intra-node (intra-page) design and practice are poorly represented on the Internet and in literature. For example, some old DDJ issues contained articles about efficient key matching in connection with prefix truncation (a.k.a. 'prefix compression') and so on but I found nothing comparable out there on the 'net at the moment, except buried deeply in some research papers or source code repositories.
The single most important item here is the algorithm for searching prefix-truncated keys efficiently. Once you've got that, the rest more or less falls into place. I have found only one resource on the 'net, which is this DDJ (Dr Dobb's Journal) article:
Supercharging Sequential Searches by Walter Williams
A lot of tricks can also be gleaned from papers like
Efficient index compression in DB2 LUW
For more details and pretty much everything else you could do a lot worse than reading the following two books cover to cover (both of them!):
Goetz Graefe: Modern B-Tree Techniques (ISBN 1601984820)
Jim Gray: Transaction Processing. Concepts and Techniques (ISBN 1558601902)
An alternative to the latter might be
Philip E. Bernstein: Principles of Transaction Processing (ISBN 1558606238)
It covers a similar spectrum and it seems to be a bit more hands-on, but it does not seem to have quite the same depth. I cannot say for certain, though (I've ordered a copy but haven't got it yet).
These books give you a complete overview over all that's involved, and they are virtually free of fat - i.e. you need to know almost everything that's in there. They will answer gazillions of questions that you didn't know you had, or that you should have asked yourself. And they cover the whole ground - from B-tree (and B+tree) basics to detailed implementation issues like concurrency, locking, page replacement strategies and so forth. And they enable you to utilise the information that is scattered over the 'net, like articles, papers, implementation notes and source code.
Having said that, I'd recommend matching the node size to the architecture's RAM page size (4 KB or 8 KB), because then you can utilise the paging infrastructure of your OS instead of running afoul of it. And you're probably better off keeping index and blob data in separate files. Otherwise you couldn't put them on different volumes and the data would b0rken the caching of the index pages in subsystems that are not part of your program (hardware, OS and so forth).
I'd definitely go with a B+tree structure instead of watering down the index pages with data as in a normal B-tree. I'd also recommend using an indirection vector (Graefe has some interesting details there) in connection with length-prefixed keys. Treat the keys as raw bytes and keep all the collation/normalisation/upper-lower nonsense out of your core engine. Users can feed you UTF8 if they want - you don't want to have to care about that, trust me.
There is something to be said for using only suffix truncation in internal nodes (i.e. for distinguishing between 'John Smith' and 'Lucky Luke', 'K' or 'L' work just as well as the given keys) and only prefix truncation in leaves (i.e. instead of 'John Smith' and 'John Smythe' you store 'John Smith' and 7+'ythe').
It simplifies the implementation, and gives you most of the bang that could be got. I.e. shared prefixes tend to be very common at the leaf level (between neighbouring records in index order) but not so much in internal nodes, i.e. at higher index levels. Conversely, the leaves need to store the full keys anyway and so there's nothing to truncate and throw away there, but internal nodes only need to route traffic and you can fit a lot more truncated keys in a page than non-truncated ones.
Key matching against a page full of prefix-truncated keys is extremely efficient - on average you compare a lot less than one character per key - but it's still a linear scan, even with all the hopping forward based on skip counts. This limits effective page sizes somewhat, since binary search is more complicated in the face of truncated keys. Graefe has a lot of details on that. One workaround for enabling bigger node sizes (many thousands of keys instead of hundreds) is to lay out the node like a mini B-tree with two or three levels. It can make things lightning-fast (especially if you respect magic thresholds like 64-byte cache line size), but it also makes the code hugely more complicated.
I'd go with a simple lean and mean design (similar in scope to IDA's key/value store), or use an existing product/library, unless you are in search of a new hobby...
Related
I am fairly new to DS and Algorithms and recently at a job interview I was asked a question on performance tuning along with code. We have a Data Structure which contains multi-billion entries and we need to search a particular word in that data structure. So which Java feature/library can we use to do the searching in the quickest time possible ?
On the spot I could not think of exact answer so I wrote that:
We can store the values in a map and search words in the map (but got stuck how to decide key-value pair in the map).
How can I understand the exact answer to this question and what can be the optimal solution(s) ?
After reading the question and getting clarification in the comments, I think what has become apparent to me is that: you needed to ask follow-up questions.
I'll try to break it down and provide comments that I hope will be helpful, because I also know what it's like to be "in the moment" and how nerves can stab you in the back when you least need them to.
We have a Data Structure which contains multi-billion entries and we need to search a particular word in that data structure.
I think a good follow-up question here would've been:
Q: What specific data structure is being used to contain all this data?
I would press until they give me an actual name and explain why it is not possible to name a Java algorithm/library. For all you know, the data structure could've been String[], a Set<String>, or even a fancy name for a file on disk (if they're trying to throw you off). They could've also clarified and said the DS was not relevant and that you could pick whichever DS you thought was best.
The wording also implies that they implemented the structure and that it's already populated in a system with, presumably, enough memory to hold all of it. Asking to confirm that this is really the case could've given you helpful information.
For example: "Based on the wording, it seems this mystery data structure is already implemented and fully populated in memory in a system with enough memory to hold it. Can you confirm my understanding here is correct? If not, could you clarify further?"
Given the suggested wording, and the fact that we don't have additional clarifications to go from, I will assume, for the purposes of this answer, that my suppositions are indeed correct.
Note that if you had been asked to design the data structure to hold all of this info, you would've had to ask very different questions, take memory constraints into account, and perhaps even ask about character sets/encodings (e.g. ASCII vs multi-byte Unicode).
Also, if you had been asked to design the search algorithm, then knowing the DS is a pre-requisite, and not knowing this could've made the task impossible. For example, the binary search algorithm implementation will look very different if you're working on an array vs a binary search tree, even though both would offer O(lg n) time complexity.
So which java feature/library can we use to do the searching in the quickest time possible?
Consistent with the 1st part, this question only asks what pre-existing/built-in Java code you would choose to perform the search for you. The "quickest time possible" here should make you think about solutions that are in O(1), i.e. are constant time. However, the data structure may open/close doors for you.
Some search algorithms in Java work on generics and others work on other types like arrays. Some algorithms work on Maps while others work on Lists, Sets, and so on. The follow-up question from the first part could've helped in answering this question.
That said, even if you knew the DS, but couldn't think of a specific method name or such at the time, I also think it should be considered reasonable to mention the interface or at least a relevant package and say that further details can be checked on the the Java documentation if you're pressed for more specificity, given that's what it's there for in the first place.
We can store the values in a map and search words in the map (but got stuck how to decide key-value pair in the map).
Given the wording, my interpretation of their question was not "which data structure would you use?", but rather, "which pre-existing search algorithm would you choose?". It seems to me like it was them who needed to answer the question regarding DS.
That said, if you had indeed been asked "which data structure would you use?", then a Map would've still worked against you, since you didn't really need to map a key to a value. You only needed to store a value (i.e. the words). Therefore, a Set, specifically a HashSet, would've been a better candidate, since it also avoids duplicates and should consume less memory in the process because it stores singular values, rather than key/value pairs.
Of course, that's still under the assumption(s) I made earlier. If memory constraints are said to be an issue, then scaling horizontally to multiple servers and so on would've likely been necessary.
How can I understand the exact answer to this question and what can be the optimal solution(s)?
It is probably the case that they wanted to see if you would follow up with questions, given the lack of information they gave you.
There are a couple data structures that allow for efficient searching, assuming that memory requirements aren't an issue and the data structure is already populated.
Regarding time complexity, Set#contains and Map#containsKey are both O(1), assuming that the hash function isn't expensive and that there aren't many collisions.
Because the data structure stores words (assuming you're referring to Strings), then it could also be relatively efficient to use a trie (radix tree, prefix tree, etc.), which would allow you to search by character (which I believe would be O(log n)). If the hash function is expensive or there are many collisions, this could be a good alternative!
The answer that you gave to the interviewer should suffice since hashing is an effective searching method, even for billions of entries.
You did not mention whether the entries are words or documents (multiple words). In both cases a search index could be suitable.
Search indexes extract words from the billion document entries and manage a map of these words to the documents they are used in. Frameworks like Lucene (e.g. as part of SOLR or ElasticSearch) manage memory and persistence for you.
If it were only multiple of thousands of entries, a simple HashMap would be sufficient because there is no need for memory management then. If all of the billion entries are single words, a database could be a slightly better choice.
The hashmap solution is reasonable as stated by others but there are doubts with respect to scalability.
Here is a possible solution for the problem as discussed in the below post
Sub-string match If your entry blob is a single sting or word (without any white space) and you need to search arbitrary sub-string within it. In such cases you need to parse every entry to find best possible entries that matches. One uses algorithms like Boyer Moor algorithm. See this and this for details. This is also equivalent to grep - because grep uses similar stuff inside
Indexed search. Here you are assuming that entry contains set of words and search is limited to fixed word lengths. In this case, entries are indexed over all the possible occurrences of words. This is often called "Full Text search". There are number of algorithms to do this and number of open source projects that can be used directly. Many of them, also support wild card search, approximate search etc. as below :
a. Apache Lucene : http://lucene.apache.org/java/docs/index.html
b. OpenFTS : http://openfts.sourceforge.net/
c. Sphinx http://sphinxsearch.com/
Most likely if you need "fixed words" as queries, the approach two will be very fast and effective
Reference - https://softwareengineering.stackexchange.com/questions/118759/how-to-quickly-search-through-a-very-large-list-of-strings-records-on-a-databa
Multi-billion entries lie at the edge of what might conceivably be stored in main memory (for instance, storing 10 billion entries at 100 bytes per entry will take 1000 GB main memory).
While storing the data in main memory offers a very high throughput (thousands to millions of requests per second), you'd likely need special hardware (typical blade servers only offers 16 GB, but there are commodity servers that permit installation of up to 3000 GB of main memory). Also, keeping this much data in the Java Heap will likely cause garbage collector pauses of seconds or minutes unless special care is taken.
Therefore, unless the structure of your data admits a very compact representation in main memory (say, you only need membership checking among ints, which is possible with a 512 MB Bitset), you'll not want to store it in main memory, but on disk.
Therefore, you'll need persistence. Any relational or NoSQL database permits efficient searching by key and can handle such amounts of data with ease. To talk to a relational database, use JPA or JDBC. To talk to a non-relational database, you can use their proprietary Java API or an abstraction layer such as Spring Data.
You could also implement persistence from scratch if you wanted to (i.e. the interviewer asks for that). A data structure optimized for efficient lookup in external memory is the B-Tree, that's what many databases use internally :-)
I'm working on a project that requires that I store (potentially) millions of key-value mapping, and make (potentially) the 100s of queries a second. There are some checks I can do around the data I'm working with, but it will only reduce the load by a bit. In addition, I will be making (potentially) 100s of put/removes a second, so my question is: Is there a map sufficient for this task? Is there any way I might optimize the map? Is there something faster that would work for storing key-value mappings?
Some additional information;
- The key will be a point in 3d spaces, I feel like this means I could use arrays, but the arrays would have to be massive
- The value must be an object
Any help would be greatly appreciated!
Back of envelope estimates help in getting to terms with this sort of thing. If you have millions of entries in a map, lets say 32M, and a key is a 3d point (so 3 ints->3*4B->12 bytes) ->12B * 32M = 324MB. You didn't mention the size of the value but assuming you have a similarly sized value lets double that figure. This is Java, so assuming a 64bit platform with Compressed OOPs which is default and what most people are on, you pay an extra 12B of object header per Object. So: 32M * 2 * 24B = 1536MB.
Now if you use a HashMap each entry requires an extra HashMap.Node, in Java8 on the platform above you are looking at 32B per Node (use OpenJDK JOL to find out object sizes). Which brings us to 2560MB. Also throw in the cost of the HashMap array, with 32M entries you are looking at a table with 64M entries (because the array size is a power of 2 and you need some slack beyond your entries), so that's an extra 256MB. All together lets round it up to 3GB?
Most servers these days have quite large amounts of memory (10s to 100s of GB) and adding an extra 3GB to the JVM live set should not scare you. You might consider it disappointing that the overhead exceeds the data in your case, but this is not your emotional well being, it's a question of will it work ;-)
Now that you've loaded up the data, you are mutating it at a rate of 100s of inserts/deletes per second, lets say 1024, reusing above quantities we can sum it up with: 1024 * (24*2 + 32) = 70KB. Churning 70KB of garbage per second is small change for many applications, and not something you necessarily need to sweat about. To put it in context, a JVM will contend with collecting many 100s of MB of Young Generation in a matter of 10s of milliseconds these days.
So, in summary, if all you need is to load the data and query/mutate it along the lines you describe you might just find that a modern server can easily contend with a vanilla solution. I'd recommend you give that a go, maybe prototype with some representative data set, and see how it works out. If you have an issue you can always find more exotic/efficient solutions.
To make it simple, my question is: how to hash a String (about 200 characters) as quickly as possible. Security is not important, but collisions ARE a big deal.
Note: After a quick investigation, it seems like MurmurHash3 might be the best choice. I am open to any comment saying otherwise tho'
First, I know that there are plenty of other similar question, but I couldn't find a convincing answer yet.
I have a list of objects, each containing a list of about 3k paragraphs which is saved to a database. Every X hours, those paragraph are regenerated and I need to find if any paragraphs has changed, and if so, push only those new paragraphs.
The quickest way I found to find the differences (knowing that most of the time the content will be identical) is to create a MerkleTree, save it to the DB, and iterate over the MerkleTree to find the differences, instead of comparing the paragraphs themselves.
This imply, in my case, that I will be creating ten thousands of hashes per second to compare with what is in the DB. Therefore, I need a very efficient way to create those hashes. I don't care about the security, I only need to ensure that the number of collision remains very very low.
What would be the best algorithm available in Java for that?
In my case, the main object is composed of Sections, which is composed of Languages, which is composed of Paragraph. The comparison strategy is:
1) If the object hash is identical, stop, otherwise go to 2)
2) Loop on all Section, keep only the Section with a different hash
3) Loop on all Languages of those Sections, keep only the language with a different hash
4) Loop on all the Paragraph of all those Languages, if the hash is different, then push the new content.
This amazing answer on Programmers Stack Exchange tells you all you need to know.
The short version is, use FNV-1a, aka the Fowler–Noll–Vo hash function, it has excellent performance, high randomness and low collisions.
Any further explanation I might shed on this question would be just be a copy and paste from that Programmers.SE answer, which incidentally is the second highest voted answer on the entire site.
Some other thoughts:
Ultimately, you have a pretty niche use case. Most people aren't dealing with 1 billion entry datasets regularly. As such, you may have to do your own benchmarking.
That said, having a high randomness suggests that the algorithm is likely to scale well for English hashes.
You haven't really talked about other issues; are you able to keep the entire data set in memory? What are your footprint requirements?
See also: Fastest Hash Algorithm for Text Data
In some previous posts I have asked some questions about coding of Custom Hash Map/Table in java. Now as I can't solve it and may be I forgot to properly mentioning what I really want, I am summarizing all of them to make it clear and precise.
What I am going to do:
I am trying to code for our server in which I have to find users access type by URL.
Now, I have 1110 millions of URLs (approx).
So, what we did,
1) Divided the database on 10 parts each of 110 millions of Urls.
2) Building a HashMap using parallel array whose key are URL's one part (represented as LONG) and values are URL's other part (represented as INT) - key can have multiple values.
3) Then search the HashMap for some other URLs (millions of URLs saved in one day) per day at the beginning when system starts.
What you have Tried:
1) I have tried many NoSQL databases, however we found not so good for our purpose.
2) I have build our custom hashmap(using two parallel arrays) for that purpose.
So, what the issue is:
When the system starts we have to load our hashtable of each database and perform search for million of url:
Now, issue is,
1) Though the HashTable performance is quite nice, code takes more time while loading HashTable (we are using File Channel & memory-mapped buffer to load it which takes 20 seconds to load HashTable - 220 millions entry - as load factor is 0.5, we found it most faster)
So, we are spending time: (HashTable Load + HashTable Search) * No. of DB = (5 + 20) * 10 = 250 seconds. Which is quite expensive for us and most of the time (200 out of 250 sec) is going for loading hashtables.
Have you think any-other way:
One way can be:
Without worrying about loading and storing, and leave caching to the operating system by using a memory-mapped buffer. But, as I have to search for millions of keys, it gives worser performance than above.
As we found HashTable performance is nice but loading time is high, we thought to cut it off in another way like:
1) Create an array of Linked Lists of the size Integer_MAX (my own custom linked list).
2) Insert values (int's) to the Linked Lists whose number is key number (we reduce the key size to INT).
3) So, we have to store only the linked lists to the disks.
Now, issue is, it is taking lots of time to create such amount of Linked Lists and creating such large amount of Linked Lists has no meaning if data is not well distributed.
So, What is your requirements:
Simply my requirements:
1) Key with multiple values insertion and searching. Looking for nice searching performance.
2) Fast way to load (specially) into memory.
(keys are 64 bit INT and Values are 32 bit INT, one key can have at most 2-3 values. We can make our key 32 bit also but will give more collisions, but acceptable for us, if we can make it better).
Can anyone help me, how to solve this or any comment how to solve this issue ?
Thanks.
NB:
1) As per previous suggestions of Stack Overflow, Pre-read data for disk caching is not possible because when system starts our application will start working and on next day when system starts.
2) We have not found NoSQL db's are scaling well as our requirements are simple (means just insert hashtable key value and load and search (retrieve values)).
3) As our application is a part of small project and to be applied on a small campus, I don't think anybody will buy me a SSD disk for that. That is my limitation.
4) We use Guava/ Trove also but they are not able to store such large amount of data in 16 GB also (we are using 32 GB ubuntu server.)
If you need quick access to 1110 million data items then hashing is the way to go. But dont reinvent the wheel, use something like:
memcacheDB: http://memcachedb.org
MongoDB: http://www.mongodb.org
Cassandra: http://cassandra.apache.org
It seems to me (if I understand your problem correctly) that you are trying to approach the problem in a convoluted manner.
I mean the data you are trying to pre-load are huge to begin with (let's say 220 Million * 64 ~ 14GB). And you are trying to memory-map etc for this.
I think this is a typical problem that is solved by distributing the load in different machines. I.e. instead of trying to locate the linked list index you should be trying to figure out the index of the appropriate machine that a specific part of the map has been loaded and get the value from that machine from there (each machine has loaded part of this database map and you get the data from the appropriate part of the map i.e. machine each time).
Maybe I am way off here but I also suspect you are using a 32bit machine.
So if you have to stay using a one machine architecture and it is not economically possible to improve your hardware (64-bit machine and more RAM or SSD as you point out) I don't think that you can make any dramatic improvement.
I don't really understand in what form you are storing the data on disk. If what you are storing consists of urls and some numbers, you might be able to speed up loading from disk quite a bit by compressing the data (unless you are already doing that).
Creating a multithreaded loader that decompresses while loading might be able to give you quite a big boost.
I'm looking to implement a B-tree (in Java) for a "one use" index where a few million keys are inserted, and queries are then made a handful of times for each key. The keys are <= 40 byte ascii strings, and the associated data always takes up 6 bytes. The B-tree structure has been chosen because my memory budget does not allow me to keep the entire temporary index in memory.
My issue is about the practical details in choosing a branching factor and storing nodes on disk. It seems to me that there are two approaches:
One node always fit within one block. Achieved by choosing a branching factor k so that even for the worst case key-length the storage requirement for keys, data and control structures are <= the system block size. k is likely to be low, and nodes will in most cases have a lot of empty room.
One node can be stored on multiple blocks. Branching factor is chosen independent of key size. Loading a single node may require that multiple blocks are loaded.
The questions are then:
Is the second approach what is usually used for variable-length keys? or is there some completely different approach I have missed?
Given my use case, would you recommend a different overall solution?
I should in closing mention that I'm aware of the jdbm3 project, and is considering using it. Will attempt to implement my own in any case, both as a learning exercise and to see if case specific optimization can yield better performance.
Edit: Reading about SB-Trees at the moment:
S(b)-Trees
Algorithms and Data Structures for External Memory
I'm missing option C here:
At least two tuples always fit into one block, the block size is chosen accordingly. Blocks are filled up with as many key/value pairs as possible, which means the branching factor is variable. If the blocksize is much greater than average size of a (key, value) tuple, the wasted space would be very low. Since the optimal IO size for discs is usually 4k or greater and you have a maximum tuple size of 46, this is automatically true in your case.
And for all options you have some variants: B* or B+ Trees (see Wikipedia).
JDBM BTree is already self balancing. It also have defragmentation which is very fast and solves all problems described above.
One node can be stored on multiple blocks. Branching factor is chosen independent of key size. Loading a single node may require that multiple blocks are loaded.
Not necessary. JDBM3 uses mapped memory, so it never reads full block from disk to memory. It creates 'a view' on top of block and only read partial data as actually needed. So instead of reading full 4KB block, it may read just 2x128 bytes. This depends on underlying OS block size.
Is the second approach what is usually used for variable-length keys? or is there some completely different approach I have missed?
I think you missed point that increasing disk size decreases performance, as more data have to be read. And single tree can have share both approaches (newly inserted nodes first, second after defragmentation).
Anyway, flat-file with mapped memory buffer is probably best for your problem. Since you have fixed record size and just a few million records.
Also have look at leveldb. It has new java port which almost beats JDBM:
https://github.com/dain/leveldb
http://code.google.com/p/leveldb/
You could avoid this hassle if you use some embedded database. Those have solved these problems and some more for you already.
You also write: "a few million keys" ... "[max] 40 byte ascii strings" and "6 bytes [associated data]". This does not count up right. One gig of RAM would allow you more then "a few million" entries.