union find disjoint set weighted quick union with path compression algorithm

union find disjoint set weighted quick union with path compression algorithm - java

Concerning this question about union find disjoint set , weighted quick union with path compression algorithm
Weighted Quick-Union with Path Compression algorithm
does the path compression affect the iz[] array (array that contain length of the tree rooted at i) ?

As far as I understood the code, the array iz[] represents the amount of elements in a given disjoint set. When you compress the path, you don't modify that number for each set. Thus, the path compression does not affect the iz[] array.

I would start quoting a few basic points. First of all this is the most optimal algorithm for implementing disjoint set and referred as Union by rank with path compression heuristic. This algorithm requires 2 arrays, first(id[] there) is used as a link to parent and the second(iz[]) gives the number of nodes int that set.
We have 2 operations- Union and connected.
Union is done by 'rank' which leads to lower amortized cost for further operations by making the smaller tree child of the bigger one, by this the length tends to be minimum.
When connected method is called, after we get to know the root of that tree, we use a path compression technique which basically points that particular node to the root of that tree so in future we don't have to traverse the whole branch again. Since iz[] just contains the number of nodes of that set, it does not have any effect by path compression.

Related

How to compare all elements between two arrays?

I'm attempting to write a program which can identify all nodes in a graph that don't share any common neighbors, and in which all vertices are contained within the various subsystems in the graph. Assume all nodes are numerically labeled for simplicity.
For example, in a graph of a cube, the furthest corners share no common nodes and are part of subsystems that together contain all vertices.
I'm looking to write a program that compares each potential subsystem against all other potential subsystems, regardless of the graph, number of nodes or sides, and finds groups of subsystems whose central nodes don't share common neighbors. For simplicity's sake, assume the graphs aren't usually symmetrical, unlike the cube example, as this introduces functionally equivalent systems. However, the number of nodes in a subsystem, or elements in an array, can vary.
The goal for the program is to find a set of central nodes whose neighbors are each unique to the subsystem, that is no neighbor appears in another subsystem. This would also mean that the total number of nodes in all subsystems, central nodes and neighbors together, would equal the total number of vertices in the graph.
My original plan was to use a 2d array, where rows act as stand-ins for the subsystems. It would compare individual elements in an array against all other elements in all other arrays. If two arrays contain no similar elements, then index the compared array and its central node is recorded, otherwise it is discarded for this iteration. After the program has finished iterating through the 2d array against the first row, it adds up the number of elements from all recorded rows to see if all nodes in the graph are represented. So if a graph contains x nodes, and the number of elements in the recorded rows is less than x, then the program iterates down one row to the next subsystem and compares all values in the row against all other values like before.
Eventually, this system should print out which nodes can make up a group of subsystems that encompass all vertices and whose central nodes share no common neighbors.
My limited exposure to CS makes a task like this daunting, as it's just my way of solving puzzles presented by my professor. I'd find the systems by hand through guess-and-check methods, but with a 60+ node array...
Thanks for any help, and simply pointers in the right direction would be very much appreciated.

I don't have a good solution (and maybe there exists none; it sounds very close to vertex cover). So you may need to resort backtracking. The idea is the following:
Keep a list of uncovered vertices and a list of potential central node candidates. Both lists initially contain all vertices. Then start placing a random central node from the candidate list. This will erase the node and its one-ring from the uncovered list and the node plus its one-ring and two-ring from the candidate list. Do this until the uncovered list is empty or you run out of candidates. If you make a mistake, revert the last step (and possibly more).
In pseudo-code, this looks as follows:
findSolution(uncoveredVertices : list, centralNodeCandidates : list, centralNodes : list)
if uncoveredVertices is empty
return centralNodes //we have found a valid partitioning
if centralNodeCandidates is empty
return [failure] //we cannot place more central nodes
for every n in centralNodeCandidates
newUncoveredVertices <- uncoveredVertices \ { n } \ one-ring of n
newCentralNodeCandidates <- centralNodeCandidates \ { n } \ one-ring of n \ two-ring of n
newCentralNodes = centralNodes u { n }
subProblemSolution = findSolution(newUncoveredVertices, newCentralNodeCandidates, newCentralNodes)
if subProblemSolution is not [failure]
return subProblemSolution
next
return [failure] //none of the possible routes to go yielded a valid solution
Here, \ is the set minus operator and u is set union.
There are several possible optimizations:
If you know the maximum number of nodes, you can represent the lists as bitmaps (for a maximum of 64 nodes, this even fits into a 64 bit integer).
You may end up checking the same state multiple times. To avoid this, you may want to cache the states that resulted in failures. This is in the spirit of dynamic programming.

Algorithm for minimal cost + maximum matching in a general graph

I've got a dataset consisting of nodes and edges.
The nodes respresent people and the edges represent their relations, which each has a cost that's been calculated using euclidean distance.
Now I wish to match these nodes together through their respective edges, where there is only one constraint:
Any node can only be matched with a single other node.
Now from this we know that I'm working in a general graph, where every node could theoretically be matched with any node in the dataset, as long as there is an edge between them.
What I wish to do, is find the solution with the maximum matches and the overall minimum cost.
Node A
Node B
Node C
Node D
- Edge 1:
Start: End Cost
Node A Node B 0.5
- Edge 2:
Start: End Cost
Node B Node C 1
- Edge 3:
Start: End Cost
Node C Node D 0.5
- Edge 2:
Start: End Cost
Node D Node A 1
The solution to this problem, would be the following:
Assign Edge 1 and Edge 3, as that is the maximum amount of matches ( in this case, there's obviously only 2 solutions, but there could be tons of branching edges to other nodes)
Edge 1 and Edge 3 is assigned, because it's the solution with maximum amount of matches and the minimum overall cost (1)
I've looked into quite a few algorithms including Hungarian, Blossom, Minimal-cost flow, but I'm uncertain which is the best for this case. Also there seems so be an awful lot of material to solving these kinds of problems in bipartial graph's, which isn't really the case in this matter.
So I ask you:
Which algorithm would be the best in this scenario to return the (a) maximum amount of matches and (b) with the lowest overall cost.
Do you know of any good material (maybe some easy-to-understand pseudocode), for your recomended algorithm? I'm not the strongest in mathematical notation.

For (a), the most suitable algorithm (there are theoretically faster ones, but they're more difficult to understand) would be Edmonds' Blossom algorithm. Unfortunately it is quite complicated, but I'll try to explain the basis as best I can.
The basic idea is to take a matching, and continually improve it (increase the number of matched nodes) by making some local changes. The key concept is an alternating path: a path from an unmatched node to another unmatched node, with the property that the edges alternate between being in the matching, and being outside it.
If you have an alternating path, then you can increase the size of the matching by one by flipping the state (whether or not they are in the matching) of the edges in the alternating path.
If there exists an alternating path, then the matching is not maximum (since the path gives you a way to increase the size of the matching) and conversely, you can show that if there is no alternating path, then the matching is maximum. So, to find a maximum matching, all you need to be able to do is find an alternating path.
In bipartite graphs, this is very easy to do (it can be done with DFS). In general graphs this is more complicated, and this is were Edmonds' Blossom algorithm comes in. Roughly speaking:
Build a new graph, where there is an edge between two vertices if you can get from u to v by first traversing an edge that is in the matching, and then traversing and edge that isn't.
In this graph, try to find a path from an unmatched vertex to a matched vertex that has an unmatched neighbor (that is, a neighbor in the original graph).
Each edge in the path you find corresponds to two edges of the original graph (namely an edge in the matching and one not in the matching), so the path translates to an alternating walk in the new graph, but this is not necessarily an alternating path (the distinction between path and walk is that a path only uses each vertex once, but a walk can use each vertex multiple times).
If the walk is a path, you have an alternating path and are done.
If not, then the walk uses some vertex more than once. You can remove the part of the walk between the two visits to this vertex, and you obtain a new graph (with part of the vertices removed). In this new graph you have to do the whole search again, and if you find an alternating path in the new graph you can "lift" it to an alternating path for the original graph.
Going into the details of this (crucial) last step would be a bit too much for a stackoverflow answer, but you can find more details on Wikipedia and perhaps having this high-level overview helps you understand the more mathematical articles.
Implementing this from scratch will be quite challenging.
For the weighted version (with the Euclidean distance), there is an even more complicated variant of Edmonds' Algorithm that can handle weights. Kolmogorov offers a C++ implementation and accompanying paper. This can also be used for the unweighted case, so using this implementation might be a good idea (even if it is not in java, there should be some way to interface with it).
Since your weights are based on Euclidean distances there might be a specialized algorithm for that case, but the more general version I mentioned above would also work and and implementation is available for it.

Approximate hashtable, does such a data structure exist?

I don't figure out how to implement a special hash table.
The idea would be that the hash table gives an approximate
match. So a perfect hash table (such as found in java.util)
just gives a map, such that:
Hashtable h = new Hashtable();
...
x = h.get(y);
If x is the result of applying the map h to the argument y,
i.e. basically in mathematics it would be a function
namely x = h(y). Now for the approximate match, what about a
data structure that gives me quickly:
x = h(k) where k=max { z<=y | h(z)!=null }
The problem is k can be very far away from the given y. For example
y could be 2000, and the next occupied slot k could be 1000. Some
linear search would be costly, the data structure should do the job
more quickly.
I know how to do it with a tree(*), but something with a hash, can this
also work? Or maybe combine some tree and hash properties in the sought
of data structure? Some data structure that tends toward O(1) access?
Bye
(*) You can use a tree ordered by y, and find something next below or equal y.

This is known as Spatial hashing. Keep in mind it has to be tailored for your specific domain.
It can be used when the hash tells you something about logical arrangement of objects. So when |hash(a)-hash(b)| < |hash(a)-hash(c)| means b is closer/more similar to a than c is.
Then the basic idea is that you divide the space into buckets (e.g. drop the least significant digits of the hash -- the naive approach) and your spatial hash is this bucket ID. You have to take care of the edge cases, when the objects are very near to each other, while being on the boundary of the buckets (e.g. h(1999) = 1 but h(2000)=2). You can solve this by two overlapping hashes and having two separate hash maps for them and querying both of them, or looking to the neighboring buckets etc...
As I sais in the beginning, this has to be thought through very well.
The tree (e.g. KD-tree for higher dimensions) isn't so demanding in the design phase and is generally a more convenient approach to nearest neighbor(s) querying.

The specific formula you give suggests you want a set that can retrieve the greatest item less than a given input.
One simple approach to achieving that would be to keep a sorted list of the items, and perform a binary search to locate the position in the list at which the given element would be inserted, then return the element equal to or less than that element.
As always, any set can be converted into a map by using a pair object to wrap the key-value pair, or by maintaining a parallel data structure for the values.
For an array-based approach, the runtime will be O(log n) for retrieval and O(n) for insertion of a single element. If 'add all' sorts the added elements and then merges them, it can be O(n log n).
It's not possible1 to have a constant-time algorithm that can answer what the first element less than a given element is using a hashing approach; a good hashing algorithm spreads out similar (but non-equal) items, to avoid having many similar items fall into the same bucket and destroy the desired constant-time retrieval behavior, this means the elements of a hash set (or map) are very deliberately not even remotely close to sorted order, they are as close to randomly distributed as we could achieve while using an efficient repeatable hashing algorithm.
1. Of course, proving that it's not possible is difficult, since one can't easily prove that there isn't a simple repeatable constant-time request that will reliably convince an oracle (or God, if God were that easy to manipulate) to give you the answer to the question you want, but it seems unlikely.

Why store the points in a binary tree?

This question covers a software algorithm, from On topic
I am working on an interview question from Amazon Software Question,
specifically "Given a set of points (x,y) and an integer "n", return n number of points which are close to the origin"
Here is the sample high level psuedocode answer to this question, from Sample Answer
Step 1: Design a class called point which has three fields - int x, int y, int distance
Step 2: For all the points given, find the distance between them and origin
Step 3: Store the values in a binary tree
Step 4: Heap sort
Step 5: print the first n values from the binary tree
I agree with steps 1 and 2 because it makes sense in terms of object-oriented design to have one software bundle of data, Point, encapsulate away the fields of x, y and distance.Ensapsulation
Can someone explain the design decisions from 3 to 5?
Here's how I would do steps of 3 to 5
Step 3: Store all the points in an array
Step 4: Sort the array with respect to distance(I use some build in sort here like Arrays.Sort
Step 5: With the array sorted in ascending order, I print off the first n values
Why the author of that response use a more complicated data structure, binary tree and not something simpler like an array that I used? I know what a binary tree is - hierarchical data structure of nodes with two pointers. In his algorithm, would you have to use a BST?

First, I would not say that having Point(x, y, distance) is good design or encapsulation. distance is not really part of a point, it can be computed from x and y. In term of design, I would certainly have a function, i.e. a static method from Point or an helper class Points.
double distance(Point a, Point b)
Then for the specific question, I actually agree with your solution, to put the data in an array, sort this array and then extract the N first.
What the example may be hinted at is that the heapsort actually often uses a binary tree structure inside the array to be sorted as explained here :
The heap is often placed in an array with the layout of a complete binary tree.
Of course, if the distance to the origin is not stored in the Point, for performance reason, it had to be put with the corresponding Point object in the array, or any information that will allow to get the Point object from the sorted distance (reference, index), e.g.
List<Pair<Long, Point>> distancesToOrigin = new ArrayList<>();
to be sorted with a Comparator<Pair<Long, Point>>

It is not necessary to use BST. However, it is a good practice to use BST when needing a structure that is self-sorted. I do not see the need to both use BST and heapsort it (somehow). You could use just BST and retrieve the first n points. You could also use an array, sort it and use the first n points.
If you want to sort an array of type Point, you could implement the interface Comparable (Point would imolement that interface) and overload the default method.
You never have to choose any data structures, but by determining the needs you have, you would also easily determine the optimum structure.

The approach described in this post is more complex than needed for such a question. As you noted, simple sorting by distance will suffice. However, to help explain your confusion about what your sample answer author was trying to get at, maybe consider the k nearest neighbors problem which can be solved with a k-d tree, a structure that applies space partitioning to the k-d dataset. For 2-dimensional space, that is indeed a binary tree. This tree is inherently sorted and doesn't need any "heap sorting."
It should be noted that building the k-d tree will take O(n log n), and is only worth the cost if you need to do repeated nearest neighbor searches on the structure. If you only need to perform one search to find k nearest neighbors from the origin, it can be done with a naive O(n) search.
How to build a k-d tree, straight from Wiki:
One adds a new point to a k-d tree in the same way as one adds an element to any other search tree. First, traverse the tree, starting from the root and moving to either the left or the right child depending on whether the point to be inserted is on the "left" or "right" side of the splitting plane. Once you get to the node under which the child should be located, add the new point as either the left or right child of the leaf node, again depending on which side of the node's splitting plane contains the new node.
Adding points in this manner can cause the tree to become unbalanced, leading to decreased tree performance. The rate of tree performance degradation is dependent upon the spatial distribution of tree points being added, and the number of points added in relation to the tree size. If a tree becomes too unbalanced, it may need to be re-balanced to restore the performance of queries that rely on the tree balancing, such as nearest neighbour searching.
Once have have built the tree, you can find k nearest neighbors to some point (the origin in your case) in O(k log n) time.
Straight from Wiki:
Searching for a nearest neighbour in a k-d tree proceeds as follows:
Starting with the root node, the algorithm moves down the tree recursively, in the same way that it would if the search point were being inserted (i.e. it goes left or right depending on whether the point is lesser than or greater than the current node in the split dimension).
Once the algorithm reaches a leaf node, it saves that node point as the "current best"
The algorithm unwinds the recursion of the tree, performing the following steps at each node:
If the current node is closer than the current best, then it becomes the current best.
The algorithm checks whether there could be any points on the other side of the splitting plane that are closer to the search point than the current best. In concept, this is done by intersecting the splitting hyperplane with a hypersphere around the search point that has a radius equal to the current nearest distance. Since the hyperplanes are all axis-aligned this is implemented as a simple comparison to see whether the difference between the splitting coordinate of the search point and current node is lesser than the distance (overall coordinates) from the search point to the current best.
If the hypersphere crosses the plane, there could be nearer points on the other side of the plane, so the algorithm must move down the other branch of the tree from the current node looking for closer points, following the same recursive process as the entire search.
If the hypersphere doesn't intersect the splitting plane, then the algorithm continues walking up the tree, and the entire branch on the other side of that node is eliminated.
When the algorithm finishes this process for the root node, then the search is complete.
This is a pretty tricky algorithm that I would hate to need to describe as an interview question! Fortunately the general case here is more complex than is needed, as you pointed out in your post. But I believe this approach may be close to what your (wrong) sample answer was trying to describe.

What's the simplest algorithm/solution for a single pair shortest path through a real-weighted undirected graph?

I need to find a shortest path through an undirected graph whose nodes are real (positive and negative) weighted. These weights are like resources which you can gain or loose by entering the node.
The total cost (resource sum) of the path isn't very important, but it must be more than 0, and length has to be the shortest possible.
For example consider a graph like so:
A-start node; D-end node
A(+10)--B( 0 )--C(-5 )
\ | /
\ | /
D(-5 )--E(-5 )--F(+10)
The shortest path would be A-E-F-E-D
Dijkstra's algorithm alone doesn't do the trick, because it can't handle negative values. So, I thought about a few solutions:
First one uses Dijkstra's algorithm to calculate the length of a shortest path from each node to the exit node, not considering the weights. This can be used like some sort of heuristics value like in A*. I'm not sure if this solution could work, and also it's very costly. I also thought about implement Floyd–Warshall's algorithm, but I'm not sure how.
Another solution was to calculate the shortest path with Dijkstra's algorithm not considering the weights, and if after calculating the path's resource sum it's less than zero, go through each node to find a neighbouring node which could quickly increase the resource sum, and add it to the path(several times if needed). This solution won't work if there is a node that could be enough to increase the resource sum, but farther away from the calculated shortest path.
For example:
A- start node; E- end node
A(+10)--B(-5 )--C(+40)
\
D(-5 )--E(-5 )
Could You help me solve this problem?
EDIT: If when calculating the shortest path, you reach a point where the sum of the resources is equal to zero, that path is not valid, since you can't go on if there's no more petrol.

Edit: I didn't read the question well enough; the problem is more advanced than a regular single-source shortest path problem. I'm leaving this post up for now just to give you another algorithm that you might find useful.
The Bellman-Ford algorithm solves the single-source shortest-path problem, even in the presence of edges with negative weight. However, it does not handle negative cycles (a circular path in the graph whose weight sum is negative). If your graph contains negative cycles, you are probably in trouble, because I believe that that makes the problem NP-complete (because it corresponds to the longest simple path problem).

This doesn't seem like an elegant solution, but given the ability to create cyclic paths I don't see a way around it. But I would just solve it iteratively. Using the second example - Start with a point at A, give it A's value. Move one 'turn' - now I have two points, one at B with a value of 5, and one at D also with a value of 5. Move again - now I have 4 points to track. C: 45, A: 15, A: 15, and E: 0. It might be that the one at E can oscillate and become valid so we can't toss it out yet. Move and accumulate, etc. The first time you reach the end node with a positive value you are done (though there may be additional equivalent paths that come in on the same turn)
Obviously problematic in that the number of points to track will rise pretty quickly, and I assume your actual graph is much more complex than the example.

I would do it similarly to what Mikeb suggested: do a breadth-first search over the graph of possible states, i.e. (position, fuel-left)-pairs.
Using your example graph:
Octagons: Ran out of fuel
Boxes: Child nodes omitted for space reasons
Searching this graph breadth-first is guaranteed to give you the shortest route that actually reaches the goal if such a route exists. If it does not, you will have to give up after a while (after x nodes searched, or maybe when you reach a node with a score greater than the absolute value of all negative scores combined), as the graph can contain infinite loops.
You have to make sure not to abort immediately on finding the goal if you want to find the cheapest path (fuel wise) too, because you might find more than one path of the same length, but with different costs.

Try adding the absolute value of the minimun weight (in this case 5) to all weights. That will avoid negative ciclic paths
Current shortest path algorithms requires calculate shortest path to every node because it goes combinating solutions on some nodes that will help adjusting shortest path in other nodes. No way to make it only for one node.
Good luck

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.