I'm working on a project that requires that I store (potentially) millions of key-value mapping, and make (potentially) the 100s of queries a second. There are some checks I can do around the data I'm working with, but it will only reduce the load by a bit. In addition, I will be making (potentially) 100s of put/removes a second, so my question is: Is there a map sufficient for this task? Is there any way I might optimize the map? Is there something faster that would work for storing key-value mappings?
Some additional information;
- The key will be a point in 3d spaces, I feel like this means I could use arrays, but the arrays would have to be massive
- The value must be an object
Any help would be greatly appreciated!
Back of envelope estimates help in getting to terms with this sort of thing. If you have millions of entries in a map, lets say 32M, and a key is a 3d point (so 3 ints->3*4B->12 bytes) ->12B * 32M = 324MB. You didn't mention the size of the value but assuming you have a similarly sized value lets double that figure. This is Java, so assuming a 64bit platform with Compressed OOPs which is default and what most people are on, you pay an extra 12B of object header per Object. So: 32M * 2 * 24B = 1536MB.
Now if you use a HashMap each entry requires an extra HashMap.Node, in Java8 on the platform above you are looking at 32B per Node (use OpenJDK JOL to find out object sizes). Which brings us to 2560MB. Also throw in the cost of the HashMap array, with 32M entries you are looking at a table with 64M entries (because the array size is a power of 2 and you need some slack beyond your entries), so that's an extra 256MB. All together lets round it up to 3GB?
Most servers these days have quite large amounts of memory (10s to 100s of GB) and adding an extra 3GB to the JVM live set should not scare you. You might consider it disappointing that the overhead exceeds the data in your case, but this is not your emotional well being, it's a question of will it work ;-)
Now that you've loaded up the data, you are mutating it at a rate of 100s of inserts/deletes per second, lets say 1024, reusing above quantities we can sum it up with: 1024 * (24*2 + 32) = 70KB. Churning 70KB of garbage per second is small change for many applications, and not something you necessarily need to sweat about. To put it in context, a JVM will contend with collecting many 100s of MB of Young Generation in a matter of 10s of milliseconds these days.
So, in summary, if all you need is to load the data and query/mutate it along the lines you describe you might just find that a modern server can easily contend with a vanilla solution. I'd recommend you give that a go, maybe prototype with some representative data set, and see how it works out. If you have an issue you can always find more exotic/efficient solutions.
Related
For some time i am working on creating index for very large data sets (around 190 million). I have a BTree which can insert data sets (typically an object)/search for key and while i searched how to persist the data into files in disk, i came across this amazing article (http://www.javaworld.com/article/2076333/java-web-development/use-a-randomaccessfile-to-build-a-low-level-database.html#resources). This pretty much gives me the starting point.
Here they are indexing String key to binary object (blob). They have the file format where they have divided it into 3 regions, header(stores start point of indexes), index(stores index and its corresponding location) and data region (stores data). They are using RandomAccessFile to get the data.
How do i define similar file format for btree. All i know is for every read made to disk, i have to get one node(typically one block 512 bytes). There are many similar questions on how to persist but it is little difficult to understand the big picture on why we decide on something that we implemented like this question (Persisting B-Tree nodes to RandomAccessFile -[SOLVED]). Please share your thoughts.
Here is an alternative take on the question, based on problem specifics that have become known in the meantime. This post is based on the following assumptions:
record count about 190 million, fixed
keys are 64-byte hashes, like SHA-256
values are filenames: variable length, but sensible (average length < 64 bytes, max < page)
page size 4 KiByte
Efficient representation of filenames in a database is a different topic that cannot be addressed here. Should the filenames be awkward - longish on average and/or Unicode - then the hashing solution will punish you with increased disk read counts (more overflows, more chaining) or reduced average occupancy (more wasted space). A B-tree solution reacts somewhat more benignly, though, since an optimum tree can be constructed in any case.
The most efficient solution in this situation - and the simplest to implement by a wide margin - is hashing, since your keys are perfect hashes already. Take the first 23 bits of the hash as the page number, and lay out the pages like this:
page header
uint32_t next_page
uint16_t key_count
key/offset vector
uint16_t value_offset;
byte key[64];
... unallocated space ...
last arrived filename
...
2nd arrived filename
1st arrived filename
Values (filenames) are stored from the end of the page downwards, prefixed with their 16-bit length, and the key/offset vector grows upwards. That way neither low/high key counts nor short/long values can cause unnecessary waste of space, as would be the case with fixed-size structures. Nor do you have to parse variable-length structures during key searches. Apart from that I've aimed for the greatest possible simplicity - no premature optimisation. The bottom of the heap can be stored in the page header, in KO.[PH.key_count].value_offset (my preference), or computed as KO.Take(PH.key_count).Select(r => r.value_offset).Min(), whatever pleases you most.
The key/offset vector needs to be kept sorted on the keys so that you can use binary search but the values can be written as they arrive, they do not need to be in any particular order. If the page overflows, allocate a new one just like it at the current end of the file (growing the file by one page) and stash its page number in the appropriate header slot. This means that you can binary search within a page but all chained pages need to be read and searched one by one. Also, you do not need any kind of file header, since the file size is otherwise available and that's the only piece of global management information that needs to be maintained.
Create the file as a sparse file with the number of pages as indicated by your chosen number of hash key bits (e.g. 8388608 pages for 23 bits). Empty pages in a sparse file don't take up any disk space and read as all 0s, which works perfectly fine with our page layout/semantics. Extend the file by one page whenever you need to allocate an overflow page. Note: the 'sparse file' thing isn't very important here since almost all pages will have been written to when you're done building the file.
For maximum efficiency you need to run some analyses on your data. In my simulation - with random numbers as stand-ins for the hashes, and on the assumption that average filename size is 62 bytes or less - the optimum turned out to be making 2^23 = 8388608 buckets/pages. This means that you take the first 23 bit of the hash as the page number to load. Here are the details:
# bucket statistics for K = 23 and N = 190000000 ... 7336,5 ms
average occupancy 22,6 records
0 empty buckets (min: 3 records)
310101/8388608 buckets with 32+ records (3,7%)
That keeps the chaining to a minimum, on average you need to read just 1.04 pages per search. Increasing the hash key size by one single bit to 24 reduces the expected number of overflowing pages to 3 but doubles the file size and reduces average occupancy to 11.3 records per page/bucket. Reducing the key to 22 bits means that almost all pages (98.4%) can be expected to overflow - meaning the file is virtually the same size as that for 23 bits but you have to do twice as many disk reads per search.
Hence you see how important it is to run a detailed analysis on the data to decide on the proper number of bits to use for hash addressing. You should run an analysis that uses the actual filename sizes and tracks the per-page overhead, to see what the actual picture looks like for 22 bits to 24 bits. It'll take a while to run but that's still way faster than building a multi-gigabyte file blindly and then finding that you have wasted 70% of space or that searches take significantly more than 1.05 page reads on average.
Any B-tree based solution would be much more involved (read: complicated) but could not reduce the page read count per search below 1.000, for obvious reasons, and even that only on the assumption that a sufficient number of internal nodes can be kept cached in memory. If your system has such humongous amounts of RAM that data pages can be cached to a significant degree then the hashing solution will benefit just as much as one that is based on some kind of B-tree.
As much as I would like an excuse for building a screamingly fast hybrid radix/B+tree, the hashing solution delivers essentially the same performance for a tiny fraction of the effort. The only thing where B-treeish solutions can outdo hashing here is space efficiency, since it is trivial to construct an optimum tree for existing pre-sorted data.
The are plenty of Open Source key/value stores and full database engines - take a week off and start Googling. Even if you end up using none of them, you still need to study a representative cross section (architecture, design histories, key implementation details) to get enough of an overview over the subject matter so that you can make informed decisions and ask intelligent questions. For a brief overview, try to Google details on index file formats, both historic ones like IDX or NTX, and current ones used in various database engines.
If you want to roll your own then you might consider hitching yourself to the bandwagon of an existing format, like the dBASE variants Clipper and Visual FoxPro (my favourite). This gives you the ability to work your data with existing tools, including Total Commander plugins and whatnot. You don't need to support the full formats, just the single binary instance of the format that you choose for your project. Great for debugging, reindexing, ad hoc queries and so on. The format itself is dead simple and easy to generate even if you don't use any of the existing libraries. The index file formats aren't quite as trivial but still manageable.
If you want to roll your own from scratch then you've got quite a road ahead of you, since the basics of intra-node (intra-page) design and practice are poorly represented on the Internet and in literature. For example, some old DDJ issues contained articles about efficient key matching in connection with prefix truncation (a.k.a. 'prefix compression') and so on but I found nothing comparable out there on the 'net at the moment, except buried deeply in some research papers or source code repositories.
The single most important item here is the algorithm for searching prefix-truncated keys efficiently. Once you've got that, the rest more or less falls into place. I have found only one resource on the 'net, which is this DDJ (Dr Dobb's Journal) article:
Supercharging Sequential Searches by Walter Williams
A lot of tricks can also be gleaned from papers like
Efficient index compression in DB2 LUW
For more details and pretty much everything else you could do a lot worse than reading the following two books cover to cover (both of them!):
Goetz Graefe: Modern B-Tree Techniques (ISBN 1601984820)
Jim Gray: Transaction Processing. Concepts and Techniques (ISBN 1558601902)
An alternative to the latter might be
Philip E. Bernstein: Principles of Transaction Processing (ISBN 1558606238)
It covers a similar spectrum and it seems to be a bit more hands-on, but it does not seem to have quite the same depth. I cannot say for certain, though (I've ordered a copy but haven't got it yet).
These books give you a complete overview over all that's involved, and they are virtually free of fat - i.e. you need to know almost everything that's in there. They will answer gazillions of questions that you didn't know you had, or that you should have asked yourself. And they cover the whole ground - from B-tree (and B+tree) basics to detailed implementation issues like concurrency, locking, page replacement strategies and so forth. And they enable you to utilise the information that is scattered over the 'net, like articles, papers, implementation notes and source code.
Having said that, I'd recommend matching the node size to the architecture's RAM page size (4 KB or 8 KB), because then you can utilise the paging infrastructure of your OS instead of running afoul of it. And you're probably better off keeping index and blob data in separate files. Otherwise you couldn't put them on different volumes and the data would b0rken the caching of the index pages in subsystems that are not part of your program (hardware, OS and so forth).
I'd definitely go with a B+tree structure instead of watering down the index pages with data as in a normal B-tree. I'd also recommend using an indirection vector (Graefe has some interesting details there) in connection with length-prefixed keys. Treat the keys as raw bytes and keep all the collation/normalisation/upper-lower nonsense out of your core engine. Users can feed you UTF8 if they want - you don't want to have to care about that, trust me.
There is something to be said for using only suffix truncation in internal nodes (i.e. for distinguishing between 'John Smith' and 'Lucky Luke', 'K' or 'L' work just as well as the given keys) and only prefix truncation in leaves (i.e. instead of 'John Smith' and 'John Smythe' you store 'John Smith' and 7+'ythe').
It simplifies the implementation, and gives you most of the bang that could be got. I.e. shared prefixes tend to be very common at the leaf level (between neighbouring records in index order) but not so much in internal nodes, i.e. at higher index levels. Conversely, the leaves need to store the full keys anyway and so there's nothing to truncate and throw away there, but internal nodes only need to route traffic and you can fit a lot more truncated keys in a page than non-truncated ones.
Key matching against a page full of prefix-truncated keys is extremely efficient - on average you compare a lot less than one character per key - but it's still a linear scan, even with all the hopping forward based on skip counts. This limits effective page sizes somewhat, since binary search is more complicated in the face of truncated keys. Graefe has a lot of details on that. One workaround for enabling bigger node sizes (many thousands of keys instead of hundreds) is to lay out the node like a mini B-tree with two or three levels. It can make things lightning-fast (especially if you respect magic thresholds like 64-byte cache line size), but it also makes the code hugely more complicated.
I'd go with a simple lean and mean design (similar in scope to IDA's key/value store), or use an existing product/library, unless you are in search of a new hobby...
I am looking around for the best algorithms for the bitset operations like intersection and union, and found a lot of links and similar questions also.
Eg: Similar Question on Stack-Overflow
One thing however, which I am trying to understand is that where bit set stands into this. Eg, Lucene has taken BitSet operations to give a high performing set operations, specially because it can work at a lower level.
However, what looks to me is, the bit-set will start performing slow and slow, as the number of elements increase and the set is sparse, say set has ~10 elements where the max number of elements can be 2 Billion, because that will call out for unnecessary matching. What do you suggest ?
Bit Sets indeed make sense for dense sets, i.e. covering a significant fraction of the domain, as they represent every possible element. The space and running time requirements are O(D) [D = domain size = 2 billion !].
Sorted Set operations represent only the elements in the given set and will have an O(E) behavior [E = number of elements = 10], much more appropriate.
Bit Sets are fast, they are not efficient. I mean their hidden constant is smaller. They are blazingly fast for small sets (say D <= 1024) as they can process 32/64 elements in a single CPU instruction.
For sparse bitsets you can greatly improve performance (and reduce memory usage) using sparse bitmaps where you divide your data into chunks as opposed to storing everything under a single key.
When using bitmaps for analytics, you have a limited number of users active at any given time (e.g. day) and sparse bitmaps use this fact to their advantage.
Shameless plug: http://github.com/bilus/redis-bitops (if you're using Ruby but there are also performance notes there).
In some previous posts I have asked some questions about coding of Custom Hash Map/Table in java. Now as I can't solve it and may be I forgot to properly mentioning what I really want, I am summarizing all of them to make it clear and precise.
What I am going to do:
I am trying to code for our server in which I have to find users access type by URL.
Now, I have 1110 millions of URLs (approx).
So, what we did,
1) Divided the database on 10 parts each of 110 millions of Urls.
2) Building a HashMap using parallel array whose key are URL's one part (represented as LONG) and values are URL's other part (represented as INT) - key can have multiple values.
3) Then search the HashMap for some other URLs (millions of URLs saved in one day) per day at the beginning when system starts.
What you have Tried:
1) I have tried many NoSQL databases, however we found not so good for our purpose.
2) I have build our custom hashmap(using two parallel arrays) for that purpose.
So, what the issue is:
When the system starts we have to load our hashtable of each database and perform search for million of url:
Now, issue is,
1) Though the HashTable performance is quite nice, code takes more time while loading HashTable (we are using File Channel & memory-mapped buffer to load it which takes 20 seconds to load HashTable - 220 millions entry - as load factor is 0.5, we found it most faster)
So, we are spending time: (HashTable Load + HashTable Search) * No. of DB = (5 + 20) * 10 = 250 seconds. Which is quite expensive for us and most of the time (200 out of 250 sec) is going for loading hashtables.
Have you think any-other way:
One way can be:
Without worrying about loading and storing, and leave caching to the operating system by using a memory-mapped buffer. But, as I have to search for millions of keys, it gives worser performance than above.
As we found HashTable performance is nice but loading time is high, we thought to cut it off in another way like:
1) Create an array of Linked Lists of the size Integer_MAX (my own custom linked list).
2) Insert values (int's) to the Linked Lists whose number is key number (we reduce the key size to INT).
3) So, we have to store only the linked lists to the disks.
Now, issue is, it is taking lots of time to create such amount of Linked Lists and creating such large amount of Linked Lists has no meaning if data is not well distributed.
So, What is your requirements:
Simply my requirements:
1) Key with multiple values insertion and searching. Looking for nice searching performance.
2) Fast way to load (specially) into memory.
(keys are 64 bit INT and Values are 32 bit INT, one key can have at most 2-3 values. We can make our key 32 bit also but will give more collisions, but acceptable for us, if we can make it better).
Can anyone help me, how to solve this or any comment how to solve this issue ?
Thanks.
NB:
1) As per previous suggestions of Stack Overflow, Pre-read data for disk caching is not possible because when system starts our application will start working and on next day when system starts.
2) We have not found NoSQL db's are scaling well as our requirements are simple (means just insert hashtable key value and load and search (retrieve values)).
3) As our application is a part of small project and to be applied on a small campus, I don't think anybody will buy me a SSD disk for that. That is my limitation.
4) We use Guava/ Trove also but they are not able to store such large amount of data in 16 GB also (we are using 32 GB ubuntu server.)
If you need quick access to 1110 million data items then hashing is the way to go. But dont reinvent the wheel, use something like:
memcacheDB: http://memcachedb.org
MongoDB: http://www.mongodb.org
Cassandra: http://cassandra.apache.org
It seems to me (if I understand your problem correctly) that you are trying to approach the problem in a convoluted manner.
I mean the data you are trying to pre-load are huge to begin with (let's say 220 Million * 64 ~ 14GB). And you are trying to memory-map etc for this.
I think this is a typical problem that is solved by distributing the load in different machines. I.e. instead of trying to locate the linked list index you should be trying to figure out the index of the appropriate machine that a specific part of the map has been loaded and get the value from that machine from there (each machine has loaded part of this database map and you get the data from the appropriate part of the map i.e. machine each time).
Maybe I am way off here but I also suspect you are using a 32bit machine.
So if you have to stay using a one machine architecture and it is not economically possible to improve your hardware (64-bit machine and more RAM or SSD as you point out) I don't think that you can make any dramatic improvement.
I don't really understand in what form you are storing the data on disk. If what you are storing consists of urls and some numbers, you might be able to speed up loading from disk quite a bit by compressing the data (unless you are already doing that).
Creating a multithreaded loader that decompresses while loading might be able to give you quite a big boost.
I'm using SOLR-3.4, spatial filtering with the schema having LatLonType (subType=tdouble). I have an index of about 20M places. My basic problem is that if I do bbox filter with cache=true, the performance is reasonably good (~40-50 QPS, about 100-150ms latency), but a big downside is crazy fast old gen heap growth ultimately leading to major collections every 30-40 minutes (on a very large heap, 25GB). And at that point performance is beyond unacceptable. On the other hand I can turn off caching for bbox filters, but then my latency and QPS drops (the latency goes down from 100ms => 500ms). The NumericRangeQuery javadoc talks about the great performance you can get (sub 100 ms) but now I wonder if that was with filterCache enabled, and nobody bothered to look at the heap growth that results. I feel like this is sort of a catch-22 since neither configuration is really acceptable.
I'm open to any ideas. My last idea (untried) is to use geo hash (and pray that it either performs better with cache=false, or has more manageable heap growth if cache=true).
EDIT:
Precision step: default (8 for double I think)
System memory: 32GB (EC2 M2 2XL)
JVM: 24GB
Index size: 11 GB
EDIT2:
A tdouble with precisionStep of 8 means that your doubles will be splitted in sequences of 8 bits. If all your latitudes and longitudes only differ by the last sequence of 8 bits, then tdouble would have the same performance has a normal double on a range query. This is why I suggested to test a precisionStep of 4.
Question: what does this actually mean for a double value?
Having a profile of Solr while responding to your spatial queries would be of great help to understand what is slow, see hprof for example.
Still, here are a few ideas on how you could (perhaps) improve latency.
First you could try to test what happens when decreasing the precisionStep (try 4 for example). If the latitudes and longitudes are too close of each other and the precisionStep is too high, Lucene cannot take advantage of having several indexed values.
You could also try to give a little bit less memory to the JVM in order to give the OS cache more chances to cache frequently accessed index files.
Then, if it is still not fast enough, you could try to extend replace TrieDoubleField as a sub field by a field type that would use a frange query for the getRangeQuery method. This would reduce the number of disk access while computing the range at the cost of a higher memory usage. (I have never tested it, it might provide horrible performance as well.)
What is the size that a reference in Android's Java VM consumes?
More info:
By that I mean, if we have
String str = "Watever";
I need what str takes, not "Watever". -- "Watever" is what's saved in the location to which the pointer (or the reference) that str is holding, is pointing to.
Also, if we have
String str = null;
how much memory does it consume? Is it the same as the other str?
Now, if we have:
Object obj[] = new object[2];
how much does obj consume and how much does obj[1] and obj[2] consume?
The reason for the question is the following: (in case someone can recommend something).
I'm working on an app that manages many pictures downloaded from internet.
I started storing those pictures on a "bank" (that consists of a list of pictures).
When displaying those pictures on a gallery, I used to search for the picture in the list (SLOW) and then, if then picture wasn't there, I used to show a temporal downloading image until the picture was downloaded.
Since that happened on the UI Thread, the app became very slow, so I thought about implementing a hash table on the bank instead of the list I had.
As I explained before, this search occurs in the UI Thread (and I can't change that). Because of that, collisions can become a problem if they start slowing the thread.
I have read that "To balance time and space efficiency, the hash table should be around half full", but that makes collisions occur half of the time (Not practical for the UI Thread). That makes me think about having a very long hash table (compared to the amount of pictures saved) and use more RAM (having less free VMHeap).
Before determining the size of the hash table, I wanted to know how much memory would it consume in order not to exagerate.
I know that the size of the hash table might be very small compared to the memory that the pictures might consume, but I wanted to make sure I wasn't consuming more memory than necessary.
Before asking this question i searched, between other places, in
How big is an object reference in Java and precisely what information does it contain?
reference type size in java
Hashing Tutorial
(Yes, I know two of the places contradict each other, that's part of the reason for the question).
A object or array reference occupies one 32 bit word (4 bytes) on a 32 bit JVM or Davlik VM. A null takes the same space as a reference. (It has to, because a null has to fit in a reference-typed slot; i.e. instance field, local variable, etc.)
On the other hand, an object occupies a minimum of 2 32 bit words (8 bytes), and an array occupies a minimum of 3 32 bit words (12 bytes). The actual size depends on the number and kinds of fields for an object, and on the number and kind of elements for an array.
For a 64 bit JVM, the size of a reference is 64 bits, unless you have configured the JVM to use compressed pointers:
-XX:+UseCompressedOops Enables the use of compressed pointers (object references represented as 32 bit offsets instead of 64-bit pointers) for optimized 64-bit performance with Java heap sizes less than 32gb.
This is the nub of your question, I think.
Before determining the size of the hash table, I wanted to know how much memory would it consume in order not to exagerate.
If you allocate a HashMap or Hashtable with a large initial size, the majority of the space will be occupied by the hash array. This is an array of references, so the size will be 3 + initialSize 32 bit words. It is unlikely that this will be significant ... unless you get your size estimate drastically wrong.
However, I think you are probably worrying unnecessarily about performance. If you are storing objects in a default allocated HashMap or Hashtable, the class will automatically resize the hash table as it gets larger. So, provided that your objects have a decent hash function (not too slow, not hashing everything to a small number of values) the hash table should not be a direct CPU performance concern.
References are nearly free. Even more so when compared to images.
Having a few collisions in a Map isn't a real problem. Collisions can be resolved far quicker than a linear search through a list of items. That said, a Binary Search through a sorted list of items would be a good way to keep memory usage down (compared to a Map).
I can vouch for the effectiveness of a having smaller initial sizes for Maps - I recently wrote a program that makes a Trie structure of 170000 English words. When I set the initial size to 26, I would run out of memory by the time I got to words starting with R. Cutting it down to 5, I was able to create the maps without memory issues and can search the tree (with many collisions) in effectively no time.
[Edit] If a reference is 32 bit (4 bytes) and your average image is around 2 megabytes, you could fit 500000 references into the same space that a single image would take. You don't have to worry about the references.