Algorithm for minimal cost + maximum matching in a general graph

Algorithm for minimal cost + maximum matching in a general graph - java

I've got a dataset consisting of nodes and edges.
The nodes respresent people and the edges represent their relations, which each has a cost that's been calculated using euclidean distance.
Now I wish to match these nodes together through their respective edges, where there is only one constraint:
Any node can only be matched with a single other node.
Now from this we know that I'm working in a general graph, where every node could theoretically be matched with any node in the dataset, as long as there is an edge between them.
What I wish to do, is find the solution with the maximum matches and the overall minimum cost.
Node A
Node B
Node C
Node D
- Edge 1:
Start: End Cost
Node A Node B 0.5
- Edge 2:
Start: End Cost
Node B Node C 1
- Edge 3:
Start: End Cost
Node C Node D 0.5
- Edge 2:
Start: End Cost
Node D Node A 1
The solution to this problem, would be the following:
Assign Edge 1 and Edge 3, as that is the maximum amount of matches ( in this case, there's obviously only 2 solutions, but there could be tons of branching edges to other nodes)
Edge 1 and Edge 3 is assigned, because it's the solution with maximum amount of matches and the minimum overall cost (1)
I've looked into quite a few algorithms including Hungarian, Blossom, Minimal-cost flow, but I'm uncertain which is the best for this case. Also there seems so be an awful lot of material to solving these kinds of problems in bipartial graph's, which isn't really the case in this matter.
So I ask you:
Which algorithm would be the best in this scenario to return the (a) maximum amount of matches and (b) with the lowest overall cost.
Do you know of any good material (maybe some easy-to-understand pseudocode), for your recomended algorithm? I'm not the strongest in mathematical notation.

For (a), the most suitable algorithm (there are theoretically faster ones, but they're more difficult to understand) would be Edmonds' Blossom algorithm. Unfortunately it is quite complicated, but I'll try to explain the basis as best I can.
The basic idea is to take a matching, and continually improve it (increase the number of matched nodes) by making some local changes. The key concept is an alternating path: a path from an unmatched node to another unmatched node, with the property that the edges alternate between being in the matching, and being outside it.
If you have an alternating path, then you can increase the size of the matching by one by flipping the state (whether or not they are in the matching) of the edges in the alternating path.
If there exists an alternating path, then the matching is not maximum (since the path gives you a way to increase the size of the matching) and conversely, you can show that if there is no alternating path, then the matching is maximum. So, to find a maximum matching, all you need to be able to do is find an alternating path.
In bipartite graphs, this is very easy to do (it can be done with DFS). In general graphs this is more complicated, and this is were Edmonds' Blossom algorithm comes in. Roughly speaking:
Build a new graph, where there is an edge between two vertices if you can get from u to v by first traversing an edge that is in the matching, and then traversing and edge that isn't.
In this graph, try to find a path from an unmatched vertex to a matched vertex that has an unmatched neighbor (that is, a neighbor in the original graph).
Each edge in the path you find corresponds to two edges of the original graph (namely an edge in the matching and one not in the matching), so the path translates to an alternating walk in the new graph, but this is not necessarily an alternating path (the distinction between path and walk is that a path only uses each vertex once, but a walk can use each vertex multiple times).
If the walk is a path, you have an alternating path and are done.
If not, then the walk uses some vertex more than once. You can remove the part of the walk between the two visits to this vertex, and you obtain a new graph (with part of the vertices removed). In this new graph you have to do the whole search again, and if you find an alternating path in the new graph you can "lift" it to an alternating path for the original graph.
Going into the details of this (crucial) last step would be a bit too much for a stackoverflow answer, but you can find more details on Wikipedia and perhaps having this high-level overview helps you understand the more mathematical articles.
Implementing this from scratch will be quite challenging.
For the weighted version (with the Euclidean distance), there is an even more complicated variant of Edmonds' Algorithm that can handle weights. Kolmogorov offers a C++ implementation and accompanying paper. This can also be used for the unweighted case, so using this implementation might be a good idea (even if it is not in java, there should be some way to interface with it).
Since your weights are based on Euclidean distances there might be a specialized algorithm for that case, but the more general version I mentioned above would also work and and implementation is available for it.

Related

Traverse a tree represented by its edges

My tree is represented by its edges and the root node. The edge list is undirected.
char[][] edges =new char[][]{
new char[]{'D','B'},
new char[]{'A','C'},
new char[]{'B','A'}
};
char root='A';
The tree is
A
B C
D
How do I do depth first traversal on this tree? What is the time complexity?
I know time complexity of depth first traversal on linked nodes is O(n). But if the tree is represented by edges, I feel the time complexity is O(n^2). Am I wrong?
Giving code is appreciated, although I know it looks like homework assignment..

The general template behind DFS looks something like this:
function DFS(node) {
if (!node.visited) {
node.visited = true;
for (each edge {node, v}) {
DFS(v);
}
}
}
If you have your edges represented as a list of all the edges in the graph, then you could implement the for loop by iterating across all the edges in the graph and, every time you find one with the current node as its source, following the edge to its endpoint and running a DFS from there. If you do this, then you'll do O(m) work per node in the graph (here, m is the number of edges), so the runtime will be O(mn), since you'll do this at most once per node in the graph. In a tree, the number of edges is always O(n), so for a tree the runtime is O(n2).
That said, if you have a tree and there are only n edges, you can speed this up in a bunch of ways. First, you could consider doing an O(n log n) preprocessing step to sort the array of edges. Then, you can find all the edges leaving a given node by doing a binary search to find the first edge leaving the node, then iterating across the edges starting there to find just the edges leaving the node. This improves the runtime quite a bit: you do O(log n) work per node for the binary search, and then every edge gets visited only once. This means that the runtime is O(n log n). Since you've mentioned that the edges are undirected, you'll actually need to create two different copies of the edges array - one that's the original one, and one with the edges reversed - and should sort each one independently. The fact that DFS marks visited nodes along the way means that you don't need to do any extra bookkeeping here to figure out which direction you should go at each step, and this doesn't change the overall time complexity, though it does increase the space usage.
Alternatively, you could use a hashing-based solution. Before doing the DFS, iterate across the edges and convert them into a hash table whose keys are the nodes and whose values are lists of the edges leaving that node. This will take expected time O(n). You can then implement the "for each edge" step quite efficiently by just doing a hash table lookup to find the edges in question. This reduces the time to (expected) O(n), though the space usage goes up to O(n) as well. Since your edges are undirected, as you populate the table, just be sure to insert the edge in each direction.

Why store the points in a binary tree?

This question covers a software algorithm, from On topic
I am working on an interview question from Amazon Software Question,
specifically "Given a set of points (x,y) and an integer "n", return n number of points which are close to the origin"
Here is the sample high level psuedocode answer to this question, from Sample Answer
Step 1: Design a class called point which has three fields - int x, int y, int distance
Step 2: For all the points given, find the distance between them and origin
Step 3: Store the values in a binary tree
Step 4: Heap sort
Step 5: print the first n values from the binary tree
I agree with steps 1 and 2 because it makes sense in terms of object-oriented design to have one software bundle of data, Point, encapsulate away the fields of x, y and distance.Ensapsulation
Can someone explain the design decisions from 3 to 5?
Here's how I would do steps of 3 to 5
Step 3: Store all the points in an array
Step 4: Sort the array with respect to distance(I use some build in sort here like Arrays.Sort
Step 5: With the array sorted in ascending order, I print off the first n values
Why the author of that response use a more complicated data structure, binary tree and not something simpler like an array that I used? I know what a binary tree is - hierarchical data structure of nodes with two pointers. In his algorithm, would you have to use a BST?

First, I would not say that having Point(x, y, distance) is good design or encapsulation. distance is not really part of a point, it can be computed from x and y. In term of design, I would certainly have a function, i.e. a static method from Point or an helper class Points.
double distance(Point a, Point b)
Then for the specific question, I actually agree with your solution, to put the data in an array, sort this array and then extract the N first.
What the example may be hinted at is that the heapsort actually often uses a binary tree structure inside the array to be sorted as explained here :
The heap is often placed in an array with the layout of a complete binary tree.
Of course, if the distance to the origin is not stored in the Point, for performance reason, it had to be put with the corresponding Point object in the array, or any information that will allow to get the Point object from the sorted distance (reference, index), e.g.
List<Pair<Long, Point>> distancesToOrigin = new ArrayList<>();
to be sorted with a Comparator<Pair<Long, Point>>

It is not necessary to use BST. However, it is a good practice to use BST when needing a structure that is self-sorted. I do not see the need to both use BST and heapsort it (somehow). You could use just BST and retrieve the first n points. You could also use an array, sort it and use the first n points.
If you want to sort an array of type Point, you could implement the interface Comparable (Point would imolement that interface) and overload the default method.
You never have to choose any data structures, but by determining the needs you have, you would also easily determine the optimum structure.

The approach described in this post is more complex than needed for such a question. As you noted, simple sorting by distance will suffice. However, to help explain your confusion about what your sample answer author was trying to get at, maybe consider the k nearest neighbors problem which can be solved with a k-d tree, a structure that applies space partitioning to the k-d dataset. For 2-dimensional space, that is indeed a binary tree. This tree is inherently sorted and doesn't need any "heap sorting."
It should be noted that building the k-d tree will take O(n log n), and is only worth the cost if you need to do repeated nearest neighbor searches on the structure. If you only need to perform one search to find k nearest neighbors from the origin, it can be done with a naive O(n) search.
How to build a k-d tree, straight from Wiki:
One adds a new point to a k-d tree in the same way as one adds an element to any other search tree. First, traverse the tree, starting from the root and moving to either the left or the right child depending on whether the point to be inserted is on the "left" or "right" side of the splitting plane. Once you get to the node under which the child should be located, add the new point as either the left or right child of the leaf node, again depending on which side of the node's splitting plane contains the new node.
Adding points in this manner can cause the tree to become unbalanced, leading to decreased tree performance. The rate of tree performance degradation is dependent upon the spatial distribution of tree points being added, and the number of points added in relation to the tree size. If a tree becomes too unbalanced, it may need to be re-balanced to restore the performance of queries that rely on the tree balancing, such as nearest neighbour searching.
Once have have built the tree, you can find k nearest neighbors to some point (the origin in your case) in O(k log n) time.
Straight from Wiki:
Searching for a nearest neighbour in a k-d tree proceeds as follows:
Starting with the root node, the algorithm moves down the tree recursively, in the same way that it would if the search point were being inserted (i.e. it goes left or right depending on whether the point is lesser than or greater than the current node in the split dimension).
Once the algorithm reaches a leaf node, it saves that node point as the "current best"
The algorithm unwinds the recursion of the tree, performing the following steps at each node:
If the current node is closer than the current best, then it becomes the current best.
The algorithm checks whether there could be any points on the other side of the splitting plane that are closer to the search point than the current best. In concept, this is done by intersecting the splitting hyperplane with a hypersphere around the search point that has a radius equal to the current nearest distance. Since the hyperplanes are all axis-aligned this is implemented as a simple comparison to see whether the difference between the splitting coordinate of the search point and current node is lesser than the distance (overall coordinates) from the search point to the current best.
If the hypersphere crosses the plane, there could be nearer points on the other side of the plane, so the algorithm must move down the other branch of the tree from the current node looking for closer points, following the same recursive process as the entire search.
If the hypersphere doesn't intersect the splitting plane, then the algorithm continues walking up the tree, and the entire branch on the other side of that node is eliminated.
When the algorithm finishes this process for the root node, then the search is complete.
This is a pretty tricky algorithm that I would hate to need to describe as an interview question! Fortunately the general case here is more complex than is needed, as you pointed out in your post. But I believe this approach may be close to what your (wrong) sample answer was trying to describe.

How to generate random graphs?

I want to be able to generate random, undirected, and connected graphs in Java. In addition, I want to be able to control the maximum number of vertices in the graph. I am not sure what would be the best way to approach this problem, but here are a few I can think of:
(1) Generate a number between 0 and n and let that be the number of vertices. Then, somehow randomly link vertices together (maybe generate a random number per vertex and let that be the number of edges coming out of said vertex). Traverse the graph starting from an arbitrary vertex (say with Breadth-First-Search) and let our random graph G be all the visited nodes (this way, we make sure that G is connected).
(2) Generate a random square matrix (of 0's and 1's) with side length between 0 and n (somehow). This would be the adjacency matrix for our graph (the diagonal of the matrix should then either be all 1's or all 0's). Make a data structure from the graph and traverse the graph from any node to get a connected list of nodes and call that the graph G.
Any other way to generate a sufficiently random graph is welcomed. Note: I do not need a purely random graph, i.e., the graph you generate doesn't have to have any special mathematical properties (like uniformity of some sort). I simply need lots and lots of graphs for testing purposes of something else.
Here is the Java Node class I am using:
public class Node<T> {
T data;
ArrayList<Node> children= new ArrayList<Node>();
...}
Here is the Graph class I am using (you can tell why I am only interested in connected graphs at the moment):
public class Graph {
Node mainNode;
ArrayList<Node> V= new ArrayList<Node>();
public Graph(Node node){
mainNode= node;
}
...}
As an example, this is how I make graphs for testing purposes right now:
//The following makes a "kite" graph G (with "a" as the main node).
/* a-b
|/|
c-d
*/
Node<String> a= new Node("a");
Node<String> b= new Node("b");
Node<String> c= new Node("c");
Node<String> d= new Node("d");
a.addChild(b);
a.addChild(c);
b.addChild(a);
b.addChild(c);
b.addChild(d);
c.addChild(a);
c.addChild(b);
c.addChild(d);
d.addChild(c);
d.addChild(b);
Graph G1= new Graph(a);

Whatever you want to do with your graph, I guess its density is also an important parameter. Otherwise, you'd just generate a set of small cliques (complete graphs) using random sizes, and then connect them randomly.
If I'm correct, I'd advise you to use the Erdős-Rényi model: it's simple, not far from what you originally proposed, and allows you to control the graph density (so, basically: the number of links).
Here's a short description of this model:
Define a probability value p (the higher p and the denser the graph: 0=no link, 1=fully connected graph);
Create your n nodes (as objects, as an adjacency matrix, or anything that suits you);
Each pair of nodes is connected with a (independent) probability p. So, you have to decide of the existence of a link between them using this probability p. For example, I guess you could ranbdomly draw a value q between 0 and 1 and create the link iff q < p. Then do the same thing for each possible pair of nodes in the graph.
With this model, if your p is large enough, then it's highly probable your graph is connected (cf. the Wikipedia reference for details). In any case, if you have several components, you can also force its connectedness by creating links between nodes of distinct components. First, you have to identify each component by performing breadth-first searches (one for each component). Then, you select pairs of nodes in two distinct components, create a link between them and consider both components as merged. You repeat this process until you've got a single component remaining.

The only tricky part is ensuring that the final graph is connected. To do that, you can use a disjoint set data structure. Keep track of the number of components, initially n. Repeatedly pick pairs of random vertices u and v, adding the edge (u, v) to the graph and to the disjoint set structure, and decrementing the component count when the that structure tells you u and v belonged to different components. Stop when the component count reaches 1. (Note that using an adjacency matrix simplifies managing the case where the edge (u, v) is already present in the graph: in this case, adj[u][v] will be set to 1 a second time, which as desired has no effect.)
If you find this creates graphs that are too dense (or too sparse), then you can use another random number to add edges only k% of the time when the endpoints are already part of the same component (or when they are part of different components), for some k.

The following paper proposes an algorithm that uniformly samples connected random graphs with prescribed degree sequence, with an efficient implementation. It is available in several libraries, like Networkit or igraph.
Fast generation of random connected graphs with prescribed degrees.
Fabien Viger, Matthieu Latapy
Be careful when you make simulations on random graphs: if they are not sampled uniformly, then they may have hidden properties that impact simulations; alternatively, uniformly sampled graphs may be very different from the ones your code will meet in practice...

union find disjoint set weighted quick union with path compression algorithm

Concerning this question about union find disjoint set , weighted quick union with path compression algorithm
Weighted Quick-Union with Path Compression algorithm
does the path compression affect the iz[] array (array that contain length of the tree rooted at i) ?

As far as I understood the code, the array iz[] represents the amount of elements in a given disjoint set. When you compress the path, you don't modify that number for each set. Thus, the path compression does not affect the iz[] array.

I would start quoting a few basic points. First of all this is the most optimal algorithm for implementing disjoint set and referred as Union by rank with path compression heuristic. This algorithm requires 2 arrays, first(id[] there) is used as a link to parent and the second(iz[]) gives the number of nodes int that set.
We have 2 operations- Union and connected.
Union is done by 'rank' which leads to lower amortized cost for further operations by making the smaller tree child of the bigger one, by this the length tends to be minimum.
When connected method is called, after we get to know the root of that tree, we use a path compression technique which basically points that particular node to the root of that tree so in future we don't have to traverse the whole branch again. Since iz[] just contains the number of nodes of that set, it does not have any effect by path compression.

What's the simplest algorithm/solution for a single pair shortest path through a real-weighted undirected graph?

I need to find a shortest path through an undirected graph whose nodes are real (positive and negative) weighted. These weights are like resources which you can gain or loose by entering the node.
The total cost (resource sum) of the path isn't very important, but it must be more than 0, and length has to be the shortest possible.
For example consider a graph like so:
A-start node; D-end node
A(+10)--B( 0 )--C(-5 )
\ | /
\ | /
D(-5 )--E(-5 )--F(+10)
The shortest path would be A-E-F-E-D
Dijkstra's algorithm alone doesn't do the trick, because it can't handle negative values. So, I thought about a few solutions:
First one uses Dijkstra's algorithm to calculate the length of a shortest path from each node to the exit node, not considering the weights. This can be used like some sort of heuristics value like in A*. I'm not sure if this solution could work, and also it's very costly. I also thought about implement Floyd–Warshall's algorithm, but I'm not sure how.
Another solution was to calculate the shortest path with Dijkstra's algorithm not considering the weights, and if after calculating the path's resource sum it's less than zero, go through each node to find a neighbouring node which could quickly increase the resource sum, and add it to the path(several times if needed). This solution won't work if there is a node that could be enough to increase the resource sum, but farther away from the calculated shortest path.
For example:
A- start node; E- end node
A(+10)--B(-5 )--C(+40)
\
D(-5 )--E(-5 )
Could You help me solve this problem?
EDIT: If when calculating the shortest path, you reach a point where the sum of the resources is equal to zero, that path is not valid, since you can't go on if there's no more petrol.

Edit: I didn't read the question well enough; the problem is more advanced than a regular single-source shortest path problem. I'm leaving this post up for now just to give you another algorithm that you might find useful.
The Bellman-Ford algorithm solves the single-source shortest-path problem, even in the presence of edges with negative weight. However, it does not handle negative cycles (a circular path in the graph whose weight sum is negative). If your graph contains negative cycles, you are probably in trouble, because I believe that that makes the problem NP-complete (because it corresponds to the longest simple path problem).

This doesn't seem like an elegant solution, but given the ability to create cyclic paths I don't see a way around it. But I would just solve it iteratively. Using the second example - Start with a point at A, give it A's value. Move one 'turn' - now I have two points, one at B with a value of 5, and one at D also with a value of 5. Move again - now I have 4 points to track. C: 45, A: 15, A: 15, and E: 0. It might be that the one at E can oscillate and become valid so we can't toss it out yet. Move and accumulate, etc. The first time you reach the end node with a positive value you are done (though there may be additional equivalent paths that come in on the same turn)
Obviously problematic in that the number of points to track will rise pretty quickly, and I assume your actual graph is much more complex than the example.

I would do it similarly to what Mikeb suggested: do a breadth-first search over the graph of possible states, i.e. (position, fuel-left)-pairs.
Using your example graph:
Octagons: Ran out of fuel
Boxes: Child nodes omitted for space reasons
Searching this graph breadth-first is guaranteed to give you the shortest route that actually reaches the goal if such a route exists. If it does not, you will have to give up after a while (after x nodes searched, or maybe when you reach a node with a score greater than the absolute value of all negative scores combined), as the graph can contain infinite loops.
You have to make sure not to abort immediately on finding the goal if you want to find the cheapest path (fuel wise) too, because you might find more than one path of the same length, but with different costs.

Try adding the absolute value of the minimun weight (in this case 5) to all weights. That will avoid negative ciclic paths
Current shortest path algorithms requires calculate shortest path to every node because it goes combinating solutions on some nodes that will help adjusting shortest path in other nodes. No way to make it only for one node.
Good luck

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.