I want to create an adjacency list in Java and since I will get a huge set of nodes later as input, it needs to be really efficient.
What sort of implementation is best for this scenario?
A list of lists or maybe a map? I also need to save the edge weights somewhere. I could not figure out how to do this, since the adjacency list itself apparently just keeps track of the connected nodes, but not the edge weight.
Warning: this route is the most masochistic and hardest to maintain possible, and only recommended when the highest possible performance is required.
Adjacency lists are one of the most awkward classes of data structures to optimize, mainly because they vary in size from one vertex to the next. At some broad conceptual level, if you include the adjacency data as part of the definition of a Vertex or Node, then that makes the size of a Vertex/Node variable. Variable-sized data and the kind of memory contiguity needed to be cache-friendly tend to fight one another in most programming languages.
Most object-oriented languages weren't designed to deal with objects that can actually vary in size. They solve that by making them point to/reference memory elsewhere, but that leads to much higher cache misses.
If you want cutting-edge efficiency and you traverse adjacent vertices/nodes a lot, then you want a vertex and its variable number of references/indices to adjacent neighbors (and their weights in your case) to fit in a single cache line, and possibly with a good likelihood that some of those neighboring vertices also fit in the same cache line (though solving this and reorganizing the data to map a 2D graph to a 1-dimensional memory space is an NP-hard problem, but existing heuristics help a lot).
So it ceases to become a question of what data structures to use so much as what memory layouts to use. Arrays are your friend here, but not arrays of nodes. You want an array of bytes packing node data contiguously. Something like this:
[node1_data num_adj adj1 adj2 adj3 (possibly some padding for alignment and to avoid straddling) node2_data num_adj adj1 adj2 adj3 ...]
Node insertion and removal here starts to resemble the kind of algorithms you find to implement memory allocators. When you connect a new edge, that actually changes the node's size and potentially its position in these giant, contiguous memory blocks. Unlike memory allocators, you're potentially allowed to reshuffle and compact and defrag the data provided that you can update your references/indices to it.
Now this is only if you want the fastest possible solution, and provided your use cases are heavily weighted towards read operations (evaluation, traversal) rather than writes (connecting edges, inserting nodes, removing nodes). It's completely overkill otherwise, and a complete PITA since you'll lose all that nice object-oriented structure that helps keep the code easy to maintain, reuse, etc. This has you obliterating all that structure in favor of dealing with things at the bits and bytes level, and it's only worth doing if your software is in a realm where its quality is somehow very proportional to the efficiency of that graph.
One solution you can think of create a class Node which contains the data and a wt. this weight will be the weight of edge through which it is connected to the Node.
suppose you have a list for Node I which is connected to node A B C with edge weight a b c. And Node J is connected to A B C with x y z weights, so the adj List of I will contains the Node object as
I -> <A, a>,<B b>,<C c>
List of J will contains the Node object as
J -> <A, x>,<B y>,<C z>
Related
I have a large list of regions with 2D coordinates. None of the regions overlap. The regions are not immediately adjacent to one another and do not follow a placement pattern.
Is there an efficient lookup algorithm that can be used to let me know what region a specific point will fall into? This seems like it would be the exact inverse of what a QuadTree is.
The data structure you need is called an R-Tree. Most RTrees permit a "Within" or "Intersection" query, which will return any geographic area containing or overlapping a given region, see, e.g. wikipedia.
There is no reason that you cannot build your own R-Tree, its just a variant on a balanced B-Tree which can hold extended structures and allows some overlap. This implementation is lightweight, and you could use it here by wrapping your regions in rectangles. Each query might return more than one result but then you could check the underlying region. Its probably an easier solution than trying to build a polyline-supporting R-tree version.
What you need, if I understand correctly, is a point location data structure that is, as you put it, somehow the opposite of quad or R-tree. In a point location data structure you have a set of regions stored, and the queries are of the form: given point p give me the region in which it is contained.
Several point location data structures exists, the most famous and the one that achieves the best performance is the Kirkpatrick's one also known as triangulation refinement and achieves O(n) space and O(logn) query time; but is also famous to be hard to implement. On the other hand there are several simpler data structures that achieves O(n) or O(nlogn) space but O(log^2n) query time, which is not that bad and way easier to implement, and for some is possible to reduce the query time to O(logn) using a method called fractional cascading.
I recommend you to take a look into chapter 6 of de Berg, Overmars, et al. Computational Geometry: Algorithms and Applications which explains the subject in a way very easy to grasp, though it doesn't includes Kirkpatrick's method, which you can find it in Preparata's book or read it directly from Kirkpatrick's paper.
BTW, several of this structures assumes that your regions are not overlapping but are expected to be adjacent (regions share edges), and the edges forms a connected graph, some times triangular regions are also assumed. In all cases you can extend your set of regions by adding new edges, but don't you worry for that, since the extra space needed will be still linear, since the final set of regions will induce a planar graph. So you can blindly extend your sets of regions without worrying with too much growth of space.
What do you use when you need a immutable list with the fastest access/update? LinkedList can be slow if you have to access an element from the middle, and it's prohibitive to create and repopulate it. Binary trees? quadtrees?
If updating is very rare (or the collection is small), an array which you don't write to after intialization is worthwhile. The much lower constant factors (both in time and space) outweigh the linear time update in these cases.
Apart from that, there are a number of purely functional data structures which provide better bounds for these cases. 2-3 Finger Trees (the data structure behind Haskell's Data.Sequence) are one example. Another option are Clojure's vectors and related data structures (e.g. Relaxed Radix-Balanced Trees), which use trees with high fan-out (32 or more) to keep reads cheap and structural sharing to avoid too many copies.
All of these are moderately tricky to implement manually though, especially if performance is important, and I'm not aware of existing implementations (I don't think Clojure's vectors are easy or convenient to use from Java).
I'm not sure I understand what you're looking for but I'll try to give a couple of pointers based on some things I've seen in the standard classes:
CopyOnWriteArrayList is a mutable yet threadsafe list because it duplicates the internal array on updates. Perhaps you could adapt some ideas from that, although it's obviously not efficient for large lists.
ConcurrentHashMap implements similar ideas on a much more complicated structure. It divides the internal hash table into separate partitions, so that changes only need to lock access to the relevant partition.
For an immutable list you could do something similar: divide the list's internal array into several partitions and treat them all as immutable. When you need to change the list, you only need to clone one partition and the index of the partitions, which will be cheaper than duplicating the whole list.
AWTEventMulticaster achieves similar goals, but duplicates the absolute minimum. It's a clever binary tree. See the source.
With a smaller size of internal partition or block, you can get faster updates, but slower access in general. With a larger block (e.g., the entire array) you get slower updates but faster access.
If you really need fastest access and update, you have to use a mutable array.
I am generating random edges for a complete graph with 32678 Vertices. So, 500 million + values.
I am using a HashMap to using the edges as key and the random edge weight as the value. I keep encountering:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.StringBuilder.toString(StringBuilder.java:430) at
pa1.Graph.(Graph.java:60) at pa1.Main.main(Main.java:19)
This graph will then be used to construct a Minimum Spanning Tree.
Any ideas on a better data-structure or approach?
I know there are overrides to allocate more memory, but I would prefer a solution that works as-is.
A HashMap will be very large, cause it will contain Doubles (with a capital D) which are significantly larger than 8 bytes. (Not to mention the Entry) Depends on implementation and the CPU chip, but I think it's at least 16 bytes each, and probably more?
I think you should consider keeping the primary data in a huge double[] (or, if you can spare some accuracy, a float[]). That cuts memory usage by an easy 2x or 4x. (500M floats is a "mere" 2GB) Then use integer indexes into this array to implement your edges and vertices. For example, an edge could be an int[2]. This is far from O-O, and there's some serious hand-waving here. (and I don't understand all the nuances of what you are trying to do)
Very "old fashioned" in style, but requires a lot less memory.
Correction - I think an edge might be int[4], a vertex an int[2]. But you get the idea. Actually, for edges and vertices, you will have a smaller number of Objects and for them you can probably use "real" Objects, Maps, etc...
Since it is a complete graph, there is no doubt on what the edges are. How about storing the labels for those edges in a simple list which is ordered in a certain manner? So e.g. if you have 5 nodes, the weights for the edges which would be ordered as follows: {1,2}, {1,3} {1,4} {1,5} {2,3} {2,4} {2,5} {3,4} {3,5} {4,5}.
However, as pointed out by #BillyO'Neal this might still take up 8 GB of space. You might want to split up this list into multiple files and simultaneously maintain an index of these files suggesting where one set of weights ends in one file and where the next set of weights begin.
Additionally, given that you are finding the MST for the graph, you might want to have a look at the following paper as well: http://cvit.iiit.ac.in/papers/Vibhav09Fast.pdf. The paper seems to based off the Boruvka's Algorithm (http://en.wikipedia.org/wiki/Bor%C5%AFvka's_algorithm; http://iss.ices.utexas.edu/?p=projects/galois/benchmarks/mst).
For ultra-fast code it essential that we keep locality of reference- keep as much of the data which is closely used together, in CPU cache:
http://en.wikipedia.org/wiki/Locality_of_reference
What techniques are to achieve this? Could people give examples?
I interested in Java and C/C++ examples. Interesting to know of ways people use to stop lots of cache swapping.
Greetings
This is probably too generic to have clear answer. The approaches in C or C++ compared to Java will differ quite a bit (the way the language lays out objects differ).
The basic would be, keep data that will be access in close loops together. If your loop operates on type T, and it has members m1...mN, but only m1...m4 are used in the critical path, consider breaking T into T1 that contains m1...m4 and T2 that contains m4...mN. You might want to add to T1 a pointer that refers to T2. Try to avoid objects that are unaligned with respect to cache boundaries (very platform dependent).
Use contiguous containers (plain old array in C, vector in C++) and try to manage the iterations to go up or down, but not randomly jumping all over the container. Linked Lists are killers for locality, two consecutive nodes in a list might be at completely different random locations.
Object containers (and generics) in Java are also a killer, while in a Vector the references are contiguous, the actual objects are not (there is an extra level of indirection). In Java there are a lot of extra variables (if you new two objects one right after the other, the objects will probably end up being in almost contiguous memory locations, even though there will be some extra information (usually two or three pointers) of Object management data in between. GC will move objects around, but hopefully won't make things much worse than it was before it run.
If you are focusing in Java, create compact data structures, if you have an object that has a position, and that is to be accessed in a tight loop, consider holding an x and y primitive types inside your object rather than creating a Point and holding a reference to it. Reference types need to be newed, and that means a different allocation, an extra indirection and less locality.
Two common techniques include:
Minimalism (of data size and/or code size/paths)
Use cache oblivious techniques
Example for minimalism: In ray tracing (a 3d graphics rendering paradigm), it is a common approach to use 8 byte Kd-trees to store static scene data. The traversal algorithm fits in just a few lines of code. Then, the Kd-tree is often compiled in a manner that minimalizes the number of traversal steps by having large, empty nodes at the top of tree ("Surface Area Heuristics" by Havran).
Mispredictions typically have a probability of 50%, but are of minor costs, because really many nodes fit in a cache-line (consider that you get 128 nodes per KiB!), and one of the two child nodes is always a direct neighbour in memory.
Example for cache oblivious techniques: Morton array indexing, also known as Z-order-curve-indexing. This kind of indexing might be preferred if you usually access nearby array elements in unpredictable direction. This might be valuable for large image or voxel data where you might have 32 or even 64 bytes big pixels, and then millions of them (typical compact camera measure is Megapixels, right?) or even thousands of billions for scientific simulations.
However, both techniques have one thing in common: Keep most frequently accessed stuff nearby, the less frequently things can be further away, spanning the whole range of L1 cache over main memory to harddisk, then other computers in the same room, next room, same country, worldwide, other planets.
Some random tricks that come to my mind, and which some of them I used recently:
Rethink your algorithm. For example, you have an image with a shape and the processing algorithm that looks for corners of the shape. Instead of operating on the image data directly, you can preprocess it, save all the shape's pixel coordinates in a list and then operate on the list. You avoid random the jumping around the image
Shrink data types. Regular int will take 4 bytes, and if you manage to use e.g. uint16_t you will cache 2x more stuff
Sometimes you can use bitmaps, I used it for processing a binary image. I stored pixel per bit, so I could fit 8*32 pixels in a single cache line. It really boosted the performance
Form Java, you can use JNI (it's not difficult) and implement your critical code in C to control the memory
In the Java world the JIT is going to be working hard to achieve this, and trying to second guess this is likely to be counterproductive. This SO question addresses Java-specific issues more fully.
This shouldn't be a difficult question, but I'd just like someone to bounce it off of before I continue. I simply need to decide what data structure to use based on these expected activities:
Will need to frequently iterate through in sorted order (starting at the head).
Will need to remove/restore arbitrary elements from the/a sorted view.
Later I'll be frequently resorting the data and working with multiple sorted views.
Also later I'll be frequently changing the position of elements within their sorted views.
This is in Java, by the way.
My best guess is that I'll either be rolling some custom Linked Hash Set (to arrange the links in sorted order) or possibly just using a Tree Set. But I'm still not completely sure yet. Recommendations?
Edit: I guess because of the arbitrary remove/restore, I should probably stick with a Tree Set, right?
Actually, not necessarily. Hmmm...
In theory, I'd say the right data structure is a multiway tree - preferably something like a B+ tree. Traditionally this is a disk-based data structure, but modern main memory has a lot of similar characteristics due to layers of cache and virtual memory.
In-order iteration of a B+ tree is very efficient because (1) you only iterate through the linked-list of leaf nodes - branch nodes aren't needed, and (2) you get extremely good locality.
Finding, removing and inserting arbitrary elements is log(n) as with any balanced tree, though with different constant factors.
Resorting within the tree is mostly a matter of choosing an algorithm that gives good performance when operating on a linked list of blocks (the leaf nodes), minimising the need to use leaf nodes - variants of quicksort or mergesort seem like likely candidates. Once the items are sorted in the branch nodes, just propogate the summary information back through the leaf nodes.
BUT - pragmatically, this is only something you'd do if you're very sure that you need it. Odds are good that you're better off using some standard container. Algorithm/data structure optimisation is the best kind of optimisation, but it can still be premature.
Standard LinkedHashSet or LinkedMultiset from google collections if you want your data structure to store not unique values.