Traverse a tree represented by its edges

Traverse a tree represented by its edges - java

My tree is represented by its edges and the root node. The edge list is undirected.
char[][] edges =new char[][]{
new char[]{'D','B'},
new char[]{'A','C'},
new char[]{'B','A'}
};
char root='A';
The tree is
A
B C
D
How do I do depth first traversal on this tree? What is the time complexity?
I know time complexity of depth first traversal on linked nodes is O(n). But if the tree is represented by edges, I feel the time complexity is O(n^2). Am I wrong?
Giving code is appreciated, although I know it looks like homework assignment..

The general template behind DFS looks something like this:
function DFS(node) {
if (!node.visited) {
node.visited = true;
for (each edge {node, v}) {
DFS(v);
}
}
}
If you have your edges represented as a list of all the edges in the graph, then you could implement the for loop by iterating across all the edges in the graph and, every time you find one with the current node as its source, following the edge to its endpoint and running a DFS from there. If you do this, then you'll do O(m) work per node in the graph (here, m is the number of edges), so the runtime will be O(mn), since you'll do this at most once per node in the graph. In a tree, the number of edges is always O(n), so for a tree the runtime is O(n2).
That said, if you have a tree and there are only n edges, you can speed this up in a bunch of ways. First, you could consider doing an O(n log n) preprocessing step to sort the array of edges. Then, you can find all the edges leaving a given node by doing a binary search to find the first edge leaving the node, then iterating across the edges starting there to find just the edges leaving the node. This improves the runtime quite a bit: you do O(log n) work per node for the binary search, and then every edge gets visited only once. This means that the runtime is O(n log n). Since you've mentioned that the edges are undirected, you'll actually need to create two different copies of the edges array - one that's the original one, and one with the edges reversed - and should sort each one independently. The fact that DFS marks visited nodes along the way means that you don't need to do any extra bookkeeping here to figure out which direction you should go at each step, and this doesn't change the overall time complexity, though it does increase the space usage.
Alternatively, you could use a hashing-based solution. Before doing the DFS, iterate across the edges and convert them into a hash table whose keys are the nodes and whose values are lists of the edges leaving that node. This will take expected time O(n). You can then implement the "for each edge" step quite efficiently by just doing a hash table lookup to find the edges in question. This reduces the time to (expected) O(n), though the space usage goes up to O(n) as well. Since your edges are undirected, as you populate the table, just be sure to insert the edge in each direction.

Related

Algorithm for minimal cost + maximum matching in a general graph

I've got a dataset consisting of nodes and edges.
The nodes respresent people and the edges represent their relations, which each has a cost that's been calculated using euclidean distance.
Now I wish to match these nodes together through their respective edges, where there is only one constraint:
Any node can only be matched with a single other node.
Now from this we know that I'm working in a general graph, where every node could theoretically be matched with any node in the dataset, as long as there is an edge between them.
What I wish to do, is find the solution with the maximum matches and the overall minimum cost.
Node A
Node B
Node C
Node D
- Edge 1:
Start: End Cost
Node A Node B 0.5
- Edge 2:
Start: End Cost
Node B Node C 1
- Edge 3:
Start: End Cost
Node C Node D 0.5
- Edge 2:
Start: End Cost
Node D Node A 1
The solution to this problem, would be the following:
Assign Edge 1 and Edge 3, as that is the maximum amount of matches ( in this case, there's obviously only 2 solutions, but there could be tons of branching edges to other nodes)
Edge 1 and Edge 3 is assigned, because it's the solution with maximum amount of matches and the minimum overall cost (1)
I've looked into quite a few algorithms including Hungarian, Blossom, Minimal-cost flow, but I'm uncertain which is the best for this case. Also there seems so be an awful lot of material to solving these kinds of problems in bipartial graph's, which isn't really the case in this matter.
So I ask you:
Which algorithm would be the best in this scenario to return the (a) maximum amount of matches and (b) with the lowest overall cost.
Do you know of any good material (maybe some easy-to-understand pseudocode), for your recomended algorithm? I'm not the strongest in mathematical notation.

For (a), the most suitable algorithm (there are theoretically faster ones, but they're more difficult to understand) would be Edmonds' Blossom algorithm. Unfortunately it is quite complicated, but I'll try to explain the basis as best I can.
The basic idea is to take a matching, and continually improve it (increase the number of matched nodes) by making some local changes. The key concept is an alternating path: a path from an unmatched node to another unmatched node, with the property that the edges alternate between being in the matching, and being outside it.
If you have an alternating path, then you can increase the size of the matching by one by flipping the state (whether or not they are in the matching) of the edges in the alternating path.
If there exists an alternating path, then the matching is not maximum (since the path gives you a way to increase the size of the matching) and conversely, you can show that if there is no alternating path, then the matching is maximum. So, to find a maximum matching, all you need to be able to do is find an alternating path.
In bipartite graphs, this is very easy to do (it can be done with DFS). In general graphs this is more complicated, and this is were Edmonds' Blossom algorithm comes in. Roughly speaking:
Build a new graph, where there is an edge between two vertices if you can get from u to v by first traversing an edge that is in the matching, and then traversing and edge that isn't.
In this graph, try to find a path from an unmatched vertex to a matched vertex that has an unmatched neighbor (that is, a neighbor in the original graph).
Each edge in the path you find corresponds to two edges of the original graph (namely an edge in the matching and one not in the matching), so the path translates to an alternating walk in the new graph, but this is not necessarily an alternating path (the distinction between path and walk is that a path only uses each vertex once, but a walk can use each vertex multiple times).
If the walk is a path, you have an alternating path and are done.
If not, then the walk uses some vertex more than once. You can remove the part of the walk between the two visits to this vertex, and you obtain a new graph (with part of the vertices removed). In this new graph you have to do the whole search again, and if you find an alternating path in the new graph you can "lift" it to an alternating path for the original graph.
Going into the details of this (crucial) last step would be a bit too much for a stackoverflow answer, but you can find more details on Wikipedia and perhaps having this high-level overview helps you understand the more mathematical articles.
Implementing this from scratch will be quite challenging.
For the weighted version (with the Euclidean distance), there is an even more complicated variant of Edmonds' Algorithm that can handle weights. Kolmogorov offers a C++ implementation and accompanying paper. This can also be used for the unweighted case, so using this implementation might be a good idea (even if it is not in java, there should be some way to interface with it).
Since your weights are based on Euclidean distances there might be a specialized algorithm for that case, but the more general version I mentioned above would also work and and implementation is available for it.

Why store the points in a binary tree?

This question covers a software algorithm, from On topic
I am working on an interview question from Amazon Software Question,
specifically "Given a set of points (x,y) and an integer "n", return n number of points which are close to the origin"
Here is the sample high level psuedocode answer to this question, from Sample Answer
Step 1: Design a class called point which has three fields - int x, int y, int distance
Step 2: For all the points given, find the distance between them and origin
Step 3: Store the values in a binary tree
Step 4: Heap sort
Step 5: print the first n values from the binary tree
I agree with steps 1 and 2 because it makes sense in terms of object-oriented design to have one software bundle of data, Point, encapsulate away the fields of x, y and distance.Ensapsulation
Can someone explain the design decisions from 3 to 5?
Here's how I would do steps of 3 to 5
Step 3: Store all the points in an array
Step 4: Sort the array with respect to distance(I use some build in sort here like Arrays.Sort
Step 5: With the array sorted in ascending order, I print off the first n values
Why the author of that response use a more complicated data structure, binary tree and not something simpler like an array that I used? I know what a binary tree is - hierarchical data structure of nodes with two pointers. In his algorithm, would you have to use a BST?

First, I would not say that having Point(x, y, distance) is good design or encapsulation. distance is not really part of a point, it can be computed from x and y. In term of design, I would certainly have a function, i.e. a static method from Point or an helper class Points.
double distance(Point a, Point b)
Then for the specific question, I actually agree with your solution, to put the data in an array, sort this array and then extract the N first.
What the example may be hinted at is that the heapsort actually often uses a binary tree structure inside the array to be sorted as explained here :
The heap is often placed in an array with the layout of a complete binary tree.
Of course, if the distance to the origin is not stored in the Point, for performance reason, it had to be put with the corresponding Point object in the array, or any information that will allow to get the Point object from the sorted distance (reference, index), e.g.
List<Pair<Long, Point>> distancesToOrigin = new ArrayList<>();
to be sorted with a Comparator<Pair<Long, Point>>

It is not necessary to use BST. However, it is a good practice to use BST when needing a structure that is self-sorted. I do not see the need to both use BST and heapsort it (somehow). You could use just BST and retrieve the first n points. You could also use an array, sort it and use the first n points.
If you want to sort an array of type Point, you could implement the interface Comparable (Point would imolement that interface) and overload the default method.
You never have to choose any data structures, but by determining the needs you have, you would also easily determine the optimum structure.

The approach described in this post is more complex than needed for such a question. As you noted, simple sorting by distance will suffice. However, to help explain your confusion about what your sample answer author was trying to get at, maybe consider the k nearest neighbors problem which can be solved with a k-d tree, a structure that applies space partitioning to the k-d dataset. For 2-dimensional space, that is indeed a binary tree. This tree is inherently sorted and doesn't need any "heap sorting."
It should be noted that building the k-d tree will take O(n log n), and is only worth the cost if you need to do repeated nearest neighbor searches on the structure. If you only need to perform one search to find k nearest neighbors from the origin, it can be done with a naive O(n) search.
How to build a k-d tree, straight from Wiki:
One adds a new point to a k-d tree in the same way as one adds an element to any other search tree. First, traverse the tree, starting from the root and moving to either the left or the right child depending on whether the point to be inserted is on the "left" or "right" side of the splitting plane. Once you get to the node under which the child should be located, add the new point as either the left or right child of the leaf node, again depending on which side of the node's splitting plane contains the new node.
Adding points in this manner can cause the tree to become unbalanced, leading to decreased tree performance. The rate of tree performance degradation is dependent upon the spatial distribution of tree points being added, and the number of points added in relation to the tree size. If a tree becomes too unbalanced, it may need to be re-balanced to restore the performance of queries that rely on the tree balancing, such as nearest neighbour searching.
Once have have built the tree, you can find k nearest neighbors to some point (the origin in your case) in O(k log n) time.
Straight from Wiki:
Searching for a nearest neighbour in a k-d tree proceeds as follows:
Starting with the root node, the algorithm moves down the tree recursively, in the same way that it would if the search point were being inserted (i.e. it goes left or right depending on whether the point is lesser than or greater than the current node in the split dimension).
Once the algorithm reaches a leaf node, it saves that node point as the "current best"
The algorithm unwinds the recursion of the tree, performing the following steps at each node:
If the current node is closer than the current best, then it becomes the current best.
The algorithm checks whether there could be any points on the other side of the splitting plane that are closer to the search point than the current best. In concept, this is done by intersecting the splitting hyperplane with a hypersphere around the search point that has a radius equal to the current nearest distance. Since the hyperplanes are all axis-aligned this is implemented as a simple comparison to see whether the difference between the splitting coordinate of the search point and current node is lesser than the distance (overall coordinates) from the search point to the current best.
If the hypersphere crosses the plane, there could be nearer points on the other side of the plane, so the algorithm must move down the other branch of the tree from the current node looking for closer points, following the same recursive process as the entire search.
If the hypersphere doesn't intersect the splitting plane, then the algorithm continues walking up the tree, and the entire branch on the other side of that node is eliminated.
When the algorithm finishes this process for the root node, then the search is complete.
This is a pretty tricky algorithm that I would hate to need to describe as an interview question! Fortunately the general case here is more complex than is needed, as you pointed out in your post. But I believe this approach may be close to what your (wrong) sample answer was trying to describe.

Loop in a binary tree represented in adjacency list form

I attained an interview where I was asked a question as below:
You are given with parent -----> child relationships i.e. N1 --->
N2 where N1 is the parent of N2. This is nothing but representing a
binary tree in adjacency list form. So I had to find whether there is a loop is present or not.
I have been mulling over this and came up with a solution:
Just need to check individual node i.e. N1 and try going deep if you see there is a edge coming back to N1 then print. Else go for next node. But the interviewer told me that it is not very efficient, can somebody help me in finding an efficient solution. Thanks.

You can do in a simple manner:
In a binary tree of N nodes will have N-1 edges. If it has loop then a binary tree having N nodes will have more than N-1 edges.
So you need to calculate the no of nodes and edges (parent->child) if the no_of_nodes == no_of_edges-1 than no loop else has loop.
Hope you understand.

Tree is a graph. So you should apply any algorithm for graph traversing, for instance BFS (breadth-first search) or DFS (depth-first search).
Both algorithms use O(V) of memory to store list of vertices (V - # of vertices) and take O(E) time to complete (E - # of edges, between V and V^2). So basically in both algorithms you need to explore all edges once.
The algorithm to identify loops in your tree is as simple as:
1) Take root node
2) Traverse through graph (breadth-first or depth-first) remembering visited nodes
3) If you visit a node that has already been visited, then you have a loop. Increment your loop counter, backtrack and go to 2

You need to:
check the graph is connected.
check that there's exactly E+1 vertices when you're given E edges.
You can apply Kruskal's algorithm to identify the connected components of the graph. After you've applied it, you can both check that you have exactly one connected component, and that the number of vertices is E+1.

Queue data structure supporting fast k-th largest element finding

I'm faced with a problem which requires a Queue data structure supporting fast k-th largest element finding.
The requirements of this data structure are as follows:
The elements in the queue are not necessarily integers, but they must be comparable to each other, i.e we can tell which one is greater when we compare two elements(they can be equal as well).
The data structure must support enqueue(adds the element at the tail) and dequeue(removes the element at the head).
It can quickly find the k-th largest element in the queue, pls note k is not a constant.
You can assume that operations enqueue , dequeue and k-th largest element finding all occur with the same frequency.
My idea is to use a modified balanced binary search tree. The tree is the same as ordinary balanced binary search tree except that every nodei is augmented with another field ni, ni denotes the number of nodes contained in the subtree with root nodei. The aforementioned operations are supported as follows:
For simplicity assume that all elements are distinct.
Enqueue(x): x is first inserted into the tree, suppose the corresponding node is nodet, we append pair(x,pointer to nodet) to the queue.
Dequeue: suppose (e1, node1) is the element at the head, node1 is the pointer into the tree corresponding to e1. We delete node1 from the tree and remove (e1, node1) from the queue.
K-th largest element finding: suppose root node is noderoot, its two children are nodeleft and noderight(suppose they all exist), we compare K with nroot , three cases may happen:
if K< nleft we find the K-th largest element in the left subtree of nroot;
if K>nroot-nright we find the (K-nroot+nright)-th largest element in the right subtree of nroot;
otherwise nroot is the node we want.
The time complexity of all the three operations are O(logN) , where N is the number of elements currently in the queue.
How can I speed up the operations mentioned above? With what data structures and how?

Note - you cannot achieve better then O(logn) for all, at best you need to "chose" which op you care for the most. (Otherwise, you could sort in O(n) by feeding the array to the DS, and querying 1st, 2nd, 3rd, ... nth elements)
Using a skip list instead of a Balanced BST as the sorted structure
can reduce dequeue complexity to O(1) average case. It does
not affect complexity of any other op.
To remove from a skip list - all you need to do is to get to the element using the pointer from the head of the queue, and follow the links up and remove each. The expected number of nodes needed to be deleted is 1 + 1/2 + 1/4 + ... = 2.
find Kth can be achieved in O(logK) by starting from the leftest node (and not the root) and making your way up until you find you have "more sons then needed", and then treat the just found node as the root just like the algorithm in the question. Though it is better in asymptotic complexity - the constant factor is double.

I found an interesting paper:
Sliding-Window Top-k Queries on Uncertain Streams published in VLDB 2008 and cited by 71.
https://www.cse.ust.hk/~yike/wtopk.pdf
VLDB is the best conference in database research area, and the number of citations proves the data structure actually works.
The paper looks pretty difficult, but if you really need improve your data structure, I suggest you to read this paper or papers in the reference page of this paper.

You can also use a finger tree.
For example, a priority queue can be implemented by labeling the internal nodes by the minimum priority of its children in the tree, or an indexed list/array can be implemented with a labeling of nodes by the count of the leaves in their children. Finger trees can provide amortized O(1) cons, reversing, cdr, O(log n) append and split; and can be adapted to be indexed or ordered sequences.
Also note that being a purely functional structure makes this a good choice for concurrent usage.

k nearest neighbors graph implementation in Java

I am implementing a Flame clustering algorithm as a way of learning a bit more about graphs and graph traversal, and one of the first steps is constructing a K-nearest-neighbors graph, and I'm wondering what the fastest way would be of running through a list of nodes and connecting each one only to say, it's nearest five neighbors. My thought was that I would start at a node, iterate through the list of other nodes and keep the ones that are closest within an array, making sure that everything past the top n are discarded. Now, I could do this by just sorting a list and keeping the top n entries, but I would much rather keep less fewer things in memory and so I was wondering if there was a way to just have the final array and update that array as I iterate through, or if there is a more efficient way of generating a k nearest neighbors graph.
Also, please note, this is NOT a duplicate of K-Nearest Neighbour Implementation in Java. KNNG is distinct from KNN.

Place the first n nodes, sorted in a List. Then iterate through the rest of nodes and if it fits in the current list (i.e. is a top n node), place it in the corresponding position in the list and discard the last top n node. If it doesn't fit in the top n list, discard it.
for each neighborNode
for(int i = 0; i < topNList.size(); i++){
if((dist = distanceMetric(neighborNode,currentNode)) > topNList.get(i).distance){
topNList.remove(topNList.size()-1)
neighborNode.setDistance(dist);
topNList.add(i, neighborNode);
}

I think the most efficient way would be using a bound priority queue, like
https://github.com/tdebatty/java-graphs#bounded-priority-queue

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.