I am currently designing around a big memory index structure (several giga bytes). The index is actually a RTree which leafes are BTrees (dont ask). It supports a special query and pushes it to the logical limit.
Since those nodes are soley search nodes I ask my self how to best make it parallel.
I know of six solutions so far:
Block reads when a write is scheduled. The tree is completely blocked until the last read is finished and then the write is performed and after the write the tree can yet again used for multiple reads. (reads need no locking).
Clone Nodes to change and reuse existing nodes (including leafs) and switch between both by simply yet again stop reads switch and done. Since leaf pointers must be altered also the leaf pointers might become their own collection making it possible to switch modifications atomar and changes can be redo to a second version to avoid copy of the pointer on each insert.
Use independent copies of the index like double buffering. Update one copy of the index, switch it. Once noone reads the old index, alter this index in the same way. This way the change can be done without blocking existing reads. If another insert hits the tree in a reasonable amount of time these changes can also be done.
Use a serial share nothing architecture so each search thread has its own copy. Since a thread can only alter its tree after a single read is performed, this would be also lock free and simple. Due reads are spread evenly for each worker thread (being bound to a certain core), the throughput would not be harmed.
Use write / read locks for each node being about to be written and do only block a subtree during write. This would involve additional operations against the tree since splitting and merging would propagate upwards and therefore require a repass of the insert (since expanding locks upwards (parentwise) would introduce the chance of a deadlock). Since Split and Merge are not that frequent if you have a higher page size, this would also be a good way. Actually currently my BTree implementation currently uses a similar mechanism by spliting a node and reinsert the value unless no split is needed (which is not optimal but more simple).
Use double buffer for each node like the shadow cache for databases where each page is switched between two versions. So everytime a node is modified a copy is modified and once a read is issued the old versions are used or the new one. Each node gets a version number and the version that is more close to the active version (latest change) is choosen. To switch between to version, one needs only an atomar change on the root information. This way the tree can be altered and used. This swith can be done every time but it must be ensured that no read is using the old version when overriding the new one. This method has the possibility to not interfer with cache locality in order to link leafs and alike. But it also requires twice the amount of memory since a back buffer must be present but saves allocation time and might be good for a high frequency of changes.
With all that thoughts what is best? I know it depends but what is done in the wild? If there are 10 read threads (or even more) and being blocked by a single write operation I guess this is nothing I really want.
Also how about L3, L2 and L1 cache and in scenarios with multiple CPUs? Any issues on that? The beauty of the double buffering is the chance that those reads hitting the old version are still working with the correct cache version.
The version of creating a fresh copy of a node is quite not appealing. So what is meet in the wild of todays database landscapes?
[update]
By rereading the post, I wonder if using the write locks for split and merge would be better suited by creating replacement nodes since for a split and a merge I need to copy somewhat the half of elements around, those operations are very rare and so actually copy a node completely would do the trick by replacing this node in the parent node which is a simple and fast operation. This way the actual blocks for reads would be very limited and since we create copies anyway, the blocking only happens when the new nodes are replaced. Since during those access leafs may not be altered it is unimportant since the information density has not changed. But again this needs for every access of a node a increment and decrement of a read lock and checking for intended write locks. This all is overhead and this all is blocking further reads.
[Update2]
Solution 7. (currently favored)
Currently we favor a double buffer for the internal (non-leaf) nodes and use something similar to row locking.
Our logical tables that we try to decompose using those index structure (which is all a index does) results in using algebra of sets on those information. I noticed that this algebra of sets is linear (O(m+n) for intersection and union) and gives us the chance to lock each entry being part of such operation.
By double buffering the internal nodes (which is not hard to implement nor does it cost much (about <1% memory overhead)) we can live problem free on that issue not blocking too much read operations.
Since we batch modifications in a certain way it is very rarely seen that a given column is updated but once it is, it takes more time since those modifications might go in the thousands for this single entry.
So the goal is to alter the algebra of sets used to simply intersect those columns being currently modified later on. Since only one column is modified at a time such operation would only block once. And for everyone currently reading it, the write operation has to wait. And guess what, once a write operation waits, it usually lets another write operation of another column taking place that is not bussy. We calculate the propability of such a block to be very very low. So we dont need to care.
The locking mechanism is done using check for write, check for write intention, add read, check for write again and procced with the read. So there is no explicit object locking. We access fixed areas of bytes and if the structure is clear everything critical is planed to move into a c++ version to make it somewhat faster (2x we guess and this only takes one person one or two weeks to do especially if you use a Java to C++ translator).
The only effect that is now also important might be the caching issue since it invalidates L1 caches and maybe L2 too. So we plan to collect all modifications on such a table / index to be scheduled to run within 1 or more minutes timeshare but be evenly distributed to not make a system that has performance hickhups.
If you know of anything that helps us please go ahead.
As noone replied I would like to summarize what we (I) finally did. The structure is now separated. We have a RTree which leaf are actually Tables. Those tables can be even remote so we have a distribution way that is mostly transparent thanks to RMI and proxies.
The rest was simply easy. The RTree has the way to advise a table to split and this split is again a table. This split is been done on a single maschine and transfered to another if it has to be remote. Merge is almost similar.
This remote also is true for threads bound to different CPUs to avoid cache issues.
About the modification in memory it is as I already suggested. we duplicate internal nodes and turned the table 90° and adapted the algebraic set algorithms to handle locked columns efficiently. The test for a single table is simple and compared to the 1000ends of entries per column not a performance issue after all. Deadlocks are also impossible since one column is used at a time so there is only one lock per thread. We experiment with doing columns in parallel which would increase the response time. We also think about binding columns to a given virtual core so there is no locking again since the column is in isolation and yet again the modification can be serialized.
This way one can utilize 20 cores and more per CPU and also avoid cache misses.
Related
So I am working on a project that requires a collection of clients to be iterated through for updating, with each client requiring an update packet for every other client within proximity. I want to be able to do this in a fast way since updates will happen for a large amount of clients, at an often-occurring interval.
My original plan of attack was to create regions based on client locations, updating each client only with the other clients in their region. This would entail a LinkedList<Region>, with the Region having its own list of clients which would update among each other. One problem with this method was that some regions could have 1 client, while others could have 1000. Another level of difficulty arose from the fact that clients will constantly be moving (thus changing location and Region). These problems could be avoided if there was a way to modify the list while iterating through it, possibly splitting elements when a region gets too large.
Next I thought of creating one large List<Client> that held all players, which was constantly sorted based on location. Then to update client at index n of the list with the closest 20 clients, I would only iterate n-10 and n+10 from their current index. I don't really like this method as much since if there was a 21st client in a closeby area, they could be ignored even though they had equal distance to the client at n as the one at n+10. It also seemed slow to have to resort all the clients every tick.
In terms of speed, which of these methods provides better performance? Additionally, are there any other Java collections I should consider? Thanks!
I strongly prefer the first method. Sorting the entire list every tick is going to end up being a very bad idea time-wise, which rules out the second method.
To solve the concurrency issues, you should make a copy of the LinkedList<Reigon> before updating it in a thread. That way you will allow Clients to change their Reigon at the same time as updates are being pushed out to each Reigon.
Another note is that if you plan on retrieving an arbitrary Reigon from the LinkedList<Reigon> (for example, when you move a Client from one Reigon to another) you should look into some kind of a hash set. It will increase performance greatly when retrieving an arbitrary element from the middle of the list, especially if the list is large.
I was thinking of how I would go about implementing a thread-safe RingBuffer in Java and Android (as for some reason there is none, even after all these years, not even a circular queue. So, no (Circular/Ring)ByteBuffer, nor (Circular/Ring)(Buffer/Queue).
Even majority of the RingBuffer implementations that are third party are said to be not thread safe, which makes me think it really isn't as simple as I think it is going to be. What I was thinking about was doing something like this:
Have an Object (say RingBufferPosition) that encapsulates both the Head and Tail position.
Have the RingBuffer maintain an AtomicReference to the RingBufferPosition
When a thread adds something, it will create a temporary (unfortunately, I don't know enough of Java to determine this, but "Stack-allocated") object, which will be recycled over and over, updating it with the new updated head and tail, until it can CAS successfully.
When a thread removes something, it will do similar to adding something.
Everything is accessed in an array allocated to the max length, hence, the head and tail can access/update the current element in O(1) time.
Would this work, and better yet, would it yield any benefits over simply synchronizing access to the collection?
A small code sample/pseudocode (has not been run yet, and I do not even know how to remotely test an atomic data structure, I plan on using it for buffering/streaming media but I haven't gotten that far yet as I need to create this first) can be found here. I have comments/documentation that details my concerns there.
Lastly, to address a possible "Why" question, as in "Why do you need such performance", I'll be truthful. I have always found data structures, especially atomic/lock-free data structures very interesting, and I found this as a very good exercise to learn, plus I always wanted to create a Ring Buffer. I could have just "synchronized" everything, however I do also value performance.
Multiple reader/multiple writer ring buffers are tricky.
Your way doesn't work, because you can't update that start/end position AND the array contents atomically. Consider adding to the buffer: If you update the end position first, then there is a moment before you update the array when the buffer contains an invalid item. If you update the array first, then there's nothing to stop simultaneous additions from stomping on the same array element.
There are lots of ways to deal with these problems, but the various ways have different trade-offs, and you have better options available if you can get rid of the multiple reader or multiple writer requirement.
If I had to guess at why we don't have a concurrent ring buffer in the standard library, I'd say it's because there is no one best way to implement it that is going to be good for most scenarios. The data structure used for ConcurrentLinkedQueue, in contrast, is simple and elegant and an obvious choice when a concurrent linked list is required.
My application has a number of objects in an internal list, and I need to be able to log them (e.g. once a second) and later recreate the state of the list at any time by querying the log file.
The current implementation logs the entire list every second, which is great for retrieval because I can simply load the log file, scan through it until I reach the desired time, and load the stored list.
However, the majority of my objects (~90%) rarely change, so it is wasteful in terms of disk space to continually log them at a set interval.
I am considering switching to a "delta" based log where only the changed objects are logged every second. Unfortunately this means it becomes hard to find the true state of the list at any one recorded time, without "playing back" the entire file to catch those objects that had not changed for a while before the desired recall time.
An alternative could be to store (every second) both the changed objects and the last-changed time for each unchanged object, so that a log reader would know where to look for them. I'm worried I'm reinventing the wheel here though — this must be a problem that has been encountered before.
Existing comparable techniques, I suppose, are those used in version control systems, but I'd like a native object-aware Java solution if possible — running git commit on a binary file once a second seems like it's abusing the intention of a VCS!
So, is there a standard way of solving this problem that I should be aware of? If not, any pitfalls that I might encounter when developing my own solution?
I often read that linked list data structure and its variant skiplists are cache friendly in parallel hardware. What does this mean ? Can some one please explain in an easy to understand way .
Edit: The context is in this link .
I often read that linked list data structure and its variant skiplists are cache friendly
linked list and similar structures are NOT CPU cache friendly because each node can be randomly arranged in memory resulting in many cache misses.
An ArrayList by comparison will have all its references sequentially in memory so when a cache line is read in (typically 64 byte long) this can read in 16 references at once.
Note: The objects the List refers to can still be arranged randomly in memory, something you have no control over. :|
From the article in the question.
Besides being well suited for concurrent traversal and update, linked lists also are cache-friendly on parallel hardware. When one thread removes a node, for example, the only memory that needs to be transferred to every other core that subsequently reads the list is the memory containing the two adjacent nodes.
What this is talking about is that a linked list when modified by multiple threads at once (something LinkedList in Java doesn't support) only the nodes of the list which are modified need to be made cache consistent. By comparison if you remove or add an element in the middle or start of an ArrayList, you need to update all the references. Give this is known to be inefficient, its best avoided in any case.
The closest example to this in Java is ConcurrentLinkedQueue which supports concurrent adding and removing. The problem is that any benefit you might gain by being able to update the start and the end in terms of the cache is lost by the fact that this action creates garbage which is much more significant, though still not very significant.
If you use an ArrayBlockingQueue you get better cache and garbage behaviour as the references are continuous in memory, don't require shuffling down like ArrayList and don't create garbage to add an entries. (Unfortunately take() creates an object :P )
I'm looking to implement a B-tree (in Java) for a "one use" index where a few million keys are inserted, and queries are then made a handful of times for each key. The keys are <= 40 byte ascii strings, and the associated data always takes up 6 bytes. The B-tree structure has been chosen because my memory budget does not allow me to keep the entire temporary index in memory.
My issue is about the practical details in choosing a branching factor and storing nodes on disk. It seems to me that there are two approaches:
One node always fit within one block. Achieved by choosing a branching factor k so that even for the worst case key-length the storage requirement for keys, data and control structures are <= the system block size. k is likely to be low, and nodes will in most cases have a lot of empty room.
One node can be stored on multiple blocks. Branching factor is chosen independent of key size. Loading a single node may require that multiple blocks are loaded.
The questions are then:
Is the second approach what is usually used for variable-length keys? or is there some completely different approach I have missed?
Given my use case, would you recommend a different overall solution?
I should in closing mention that I'm aware of the jdbm3 project, and is considering using it. Will attempt to implement my own in any case, both as a learning exercise and to see if case specific optimization can yield better performance.
Edit: Reading about SB-Trees at the moment:
S(b)-Trees
Algorithms and Data Structures for External Memory
I'm missing option C here:
At least two tuples always fit into one block, the block size is chosen accordingly. Blocks are filled up with as many key/value pairs as possible, which means the branching factor is variable. If the blocksize is much greater than average size of a (key, value) tuple, the wasted space would be very low. Since the optimal IO size for discs is usually 4k or greater and you have a maximum tuple size of 46, this is automatically true in your case.
And for all options you have some variants: B* or B+ Trees (see Wikipedia).
JDBM BTree is already self balancing. It also have defragmentation which is very fast and solves all problems described above.
One node can be stored on multiple blocks. Branching factor is chosen independent of key size. Loading a single node may require that multiple blocks are loaded.
Not necessary. JDBM3 uses mapped memory, so it never reads full block from disk to memory. It creates 'a view' on top of block and only read partial data as actually needed. So instead of reading full 4KB block, it may read just 2x128 bytes. This depends on underlying OS block size.
Is the second approach what is usually used for variable-length keys? or is there some completely different approach I have missed?
I think you missed point that increasing disk size decreases performance, as more data have to be read. And single tree can have share both approaches (newly inserted nodes first, second after defragmentation).
Anyway, flat-file with mapped memory buffer is probably best for your problem. Since you have fixed record size and just a few million records.
Also have look at leveldb. It has new java port which almost beats JDBM:
https://github.com/dain/leveldb
http://code.google.com/p/leveldb/
You could avoid this hassle if you use some embedded database. Those have solved these problems and some more for you already.
You also write: "a few million keys" ... "[max] 40 byte ascii strings" and "6 bytes [associated data]". This does not count up right. One gig of RAM would allow you more then "a few million" entries.