Why store the points in a binary tree?

Why store the points in a binary tree? - java

This question covers a software algorithm, from On topic
I am working on an interview question from Amazon Software Question,
specifically "Given a set of points (x,y) and an integer "n", return n number of points which are close to the origin"
Here is the sample high level psuedocode answer to this question, from Sample Answer
Step 1: Design a class called point which has three fields - int x, int y, int distance
Step 2: For all the points given, find the distance between them and origin
Step 3: Store the values in a binary tree
Step 4: Heap sort
Step 5: print the first n values from the binary tree
I agree with steps 1 and 2 because it makes sense in terms of object-oriented design to have one software bundle of data, Point, encapsulate away the fields of x, y and distance.Ensapsulation
Can someone explain the design decisions from 3 to 5?
Here's how I would do steps of 3 to 5
Step 3: Store all the points in an array
Step 4: Sort the array with respect to distance(I use some build in sort here like Arrays.Sort
Step 5: With the array sorted in ascending order, I print off the first n values
Why the author of that response use a more complicated data structure, binary tree and not something simpler like an array that I used? I know what a binary tree is - hierarchical data structure of nodes with two pointers. In his algorithm, would you have to use a BST?

First, I would not say that having Point(x, y, distance) is good design or encapsulation. distance is not really part of a point, it can be computed from x and y. In term of design, I would certainly have a function, i.e. a static method from Point or an helper class Points.
double distance(Point a, Point b)
Then for the specific question, I actually agree with your solution, to put the data in an array, sort this array and then extract the N first.
What the example may be hinted at is that the heapsort actually often uses a binary tree structure inside the array to be sorted as explained here :
The heap is often placed in an array with the layout of a complete binary tree.
Of course, if the distance to the origin is not stored in the Point, for performance reason, it had to be put with the corresponding Point object in the array, or any information that will allow to get the Point object from the sorted distance (reference, index), e.g.
List<Pair<Long, Point>> distancesToOrigin = new ArrayList<>();
to be sorted with a Comparator<Pair<Long, Point>>

It is not necessary to use BST. However, it is a good practice to use BST when needing a structure that is self-sorted. I do not see the need to both use BST and heapsort it (somehow). You could use just BST and retrieve the first n points. You could also use an array, sort it and use the first n points.
If you want to sort an array of type Point, you could implement the interface Comparable (Point would imolement that interface) and overload the default method.
You never have to choose any data structures, but by determining the needs you have, you would also easily determine the optimum structure.

The approach described in this post is more complex than needed for such a question. As you noted, simple sorting by distance will suffice. However, to help explain your confusion about what your sample answer author was trying to get at, maybe consider the k nearest neighbors problem which can be solved with a k-d tree, a structure that applies space partitioning to the k-d dataset. For 2-dimensional space, that is indeed a binary tree. This tree is inherently sorted and doesn't need any "heap sorting."
It should be noted that building the k-d tree will take O(n log n), and is only worth the cost if you need to do repeated nearest neighbor searches on the structure. If you only need to perform one search to find k nearest neighbors from the origin, it can be done with a naive O(n) search.
How to build a k-d tree, straight from Wiki:
One adds a new point to a k-d tree in the same way as one adds an element to any other search tree. First, traverse the tree, starting from the root and moving to either the left or the right child depending on whether the point to be inserted is on the "left" or "right" side of the splitting plane. Once you get to the node under which the child should be located, add the new point as either the left or right child of the leaf node, again depending on which side of the node's splitting plane contains the new node.
Adding points in this manner can cause the tree to become unbalanced, leading to decreased tree performance. The rate of tree performance degradation is dependent upon the spatial distribution of tree points being added, and the number of points added in relation to the tree size. If a tree becomes too unbalanced, it may need to be re-balanced to restore the performance of queries that rely on the tree balancing, such as nearest neighbour searching.
Once have have built the tree, you can find k nearest neighbors to some point (the origin in your case) in O(k log n) time.
Straight from Wiki:
Searching for a nearest neighbour in a k-d tree proceeds as follows:
Starting with the root node, the algorithm moves down the tree recursively, in the same way that it would if the search point were being inserted (i.e. it goes left or right depending on whether the point is lesser than or greater than the current node in the split dimension).
Once the algorithm reaches a leaf node, it saves that node point as the "current best"
The algorithm unwinds the recursion of the tree, performing the following steps at each node:
If the current node is closer than the current best, then it becomes the current best.
The algorithm checks whether there could be any points on the other side of the splitting plane that are closer to the search point than the current best. In concept, this is done by intersecting the splitting hyperplane with a hypersphere around the search point that has a radius equal to the current nearest distance. Since the hyperplanes are all axis-aligned this is implemented as a simple comparison to see whether the difference between the splitting coordinate of the search point and current node is lesser than the distance (overall coordinates) from the search point to the current best.
If the hypersphere crosses the plane, there could be nearer points on the other side of the plane, so the algorithm must move down the other branch of the tree from the current node looking for closer points, following the same recursive process as the entire search.
If the hypersphere doesn't intersect the splitting plane, then the algorithm continues walking up the tree, and the entire branch on the other side of that node is eliminated.
When the algorithm finishes this process for the root node, then the search is complete.
This is a pretty tricky algorithm that I would hate to need to describe as an interview question! Fortunately the general case here is more complex than is needed, as you pointed out in your post. But I believe this approach may be close to what your (wrong) sample answer was trying to describe.

Related

Algorithm for minimal cost + maximum matching in a general graph

I've got a dataset consisting of nodes and edges.
The nodes respresent people and the edges represent their relations, which each has a cost that's been calculated using euclidean distance.
Now I wish to match these nodes together through their respective edges, where there is only one constraint:
Any node can only be matched with a single other node.
Now from this we know that I'm working in a general graph, where every node could theoretically be matched with any node in the dataset, as long as there is an edge between them.
What I wish to do, is find the solution with the maximum matches and the overall minimum cost.
Node A
Node B
Node C
Node D
- Edge 1:
Start: End Cost
Node A Node B 0.5
- Edge 2:
Start: End Cost
Node B Node C 1
- Edge 3:
Start: End Cost
Node C Node D 0.5
- Edge 2:
Start: End Cost
Node D Node A 1
The solution to this problem, would be the following:
Assign Edge 1 and Edge 3, as that is the maximum amount of matches ( in this case, there's obviously only 2 solutions, but there could be tons of branching edges to other nodes)
Edge 1 and Edge 3 is assigned, because it's the solution with maximum amount of matches and the minimum overall cost (1)
I've looked into quite a few algorithms including Hungarian, Blossom, Minimal-cost flow, but I'm uncertain which is the best for this case. Also there seems so be an awful lot of material to solving these kinds of problems in bipartial graph's, which isn't really the case in this matter.
So I ask you:
Which algorithm would be the best in this scenario to return the (a) maximum amount of matches and (b) with the lowest overall cost.
Do you know of any good material (maybe some easy-to-understand pseudocode), for your recomended algorithm? I'm not the strongest in mathematical notation.

For (a), the most suitable algorithm (there are theoretically faster ones, but they're more difficult to understand) would be Edmonds' Blossom algorithm. Unfortunately it is quite complicated, but I'll try to explain the basis as best I can.
The basic idea is to take a matching, and continually improve it (increase the number of matched nodes) by making some local changes. The key concept is an alternating path: a path from an unmatched node to another unmatched node, with the property that the edges alternate between being in the matching, and being outside it.
If you have an alternating path, then you can increase the size of the matching by one by flipping the state (whether or not they are in the matching) of the edges in the alternating path.
If there exists an alternating path, then the matching is not maximum (since the path gives you a way to increase the size of the matching) and conversely, you can show that if there is no alternating path, then the matching is maximum. So, to find a maximum matching, all you need to be able to do is find an alternating path.
In bipartite graphs, this is very easy to do (it can be done with DFS). In general graphs this is more complicated, and this is were Edmonds' Blossom algorithm comes in. Roughly speaking:
Build a new graph, where there is an edge between two vertices if you can get from u to v by first traversing an edge that is in the matching, and then traversing and edge that isn't.
In this graph, try to find a path from an unmatched vertex to a matched vertex that has an unmatched neighbor (that is, a neighbor in the original graph).
Each edge in the path you find corresponds to two edges of the original graph (namely an edge in the matching and one not in the matching), so the path translates to an alternating walk in the new graph, but this is not necessarily an alternating path (the distinction between path and walk is that a path only uses each vertex once, but a walk can use each vertex multiple times).
If the walk is a path, you have an alternating path and are done.
If not, then the walk uses some vertex more than once. You can remove the part of the walk between the two visits to this vertex, and you obtain a new graph (with part of the vertices removed). In this new graph you have to do the whole search again, and if you find an alternating path in the new graph you can "lift" it to an alternating path for the original graph.
Going into the details of this (crucial) last step would be a bit too much for a stackoverflow answer, but you can find more details on Wikipedia and perhaps having this high-level overview helps you understand the more mathematical articles.
Implementing this from scratch will be quite challenging.
For the weighted version (with the Euclidean distance), there is an even more complicated variant of Edmonds' Algorithm that can handle weights. Kolmogorov offers a C++ implementation and accompanying paper. This can also be used for the unweighted case, so using this implementation might be a good idea (even if it is not in java, there should be some way to interface with it).
Since your weights are based on Euclidean distances there might be a specialized algorithm for that case, but the more general version I mentioned above would also work and and implementation is available for it.

Traverse a tree represented by its edges

My tree is represented by its edges and the root node. The edge list is undirected.
char[][] edges =new char[][]{
new char[]{'D','B'},
new char[]{'A','C'},
new char[]{'B','A'}
};
char root='A';
The tree is
A
B C
D
How do I do depth first traversal on this tree? What is the time complexity?
I know time complexity of depth first traversal on linked nodes is O(n). But if the tree is represented by edges, I feel the time complexity is O(n^2). Am I wrong?
Giving code is appreciated, although I know it looks like homework assignment..

The general template behind DFS looks something like this:
function DFS(node) {
if (!node.visited) {
node.visited = true;
for (each edge {node, v}) {
DFS(v);
}
}
}
If you have your edges represented as a list of all the edges in the graph, then you could implement the for loop by iterating across all the edges in the graph and, every time you find one with the current node as its source, following the edge to its endpoint and running a DFS from there. If you do this, then you'll do O(m) work per node in the graph (here, m is the number of edges), so the runtime will be O(mn), since you'll do this at most once per node in the graph. In a tree, the number of edges is always O(n), so for a tree the runtime is O(n2).
That said, if you have a tree and there are only n edges, you can speed this up in a bunch of ways. First, you could consider doing an O(n log n) preprocessing step to sort the array of edges. Then, you can find all the edges leaving a given node by doing a binary search to find the first edge leaving the node, then iterating across the edges starting there to find just the edges leaving the node. This improves the runtime quite a bit: you do O(log n) work per node for the binary search, and then every edge gets visited only once. This means that the runtime is O(n log n). Since you've mentioned that the edges are undirected, you'll actually need to create two different copies of the edges array - one that's the original one, and one with the edges reversed - and should sort each one independently. The fact that DFS marks visited nodes along the way means that you don't need to do any extra bookkeeping here to figure out which direction you should go at each step, and this doesn't change the overall time complexity, though it does increase the space usage.
Alternatively, you could use a hashing-based solution. Before doing the DFS, iterate across the edges and convert them into a hash table whose keys are the nodes and whose values are lists of the edges leaving that node. This will take expected time O(n). You can then implement the "for each edge" step quite efficiently by just doing a hash table lookup to find the edges in question. This reduces the time to (expected) O(n), though the space usage goes up to O(n) as well. Since your edges are undirected, as you populate the table, just be sure to insert the edge in each direction.

Binary Search Tree of Strings

I had a question of exactly how a binary search tree of strings works. I know and have implemented binary search trees of integers by checking if the new data <= parent data then by branching left if its less or right if its greater. However I am a little confused on how to implement this with nodes of strings.
With the integers or characters I can just insert in an array into my insert method of the tree i programmed and it builds the tree nodes correctly. My question is how you would work this with an array of strings. How would you get the strings to branch off correctly in the tree? For example if I had an array of questions how would I be able to branch the BST correctly so I would eventually get to the correct answer.
For example look at the following trivial tree example.
land animal?
have tentacles?------------^-------------indoor animal
have claws?-----^----jellyfish live in jungle?----^----does it bark?
eat plankton?----^----lobster bear----^----lion cat----^----dog
shark----^----whale
How would you populate a tree such as this so that nodes populate where how you want them. I am trying to make a BST for trouble shooting and i am confused how to populate the nodes of strings so they appear in the correct positions. Do you need to hard code the nodes?

Update 2, to build a binary decision tree:
A binary decision tree can be thought of as a bunch of questions that yield boolean responses about facets of leaf nodes - the facet either exists / holds true or it does not. That is, for every descendent of a particular node/edge we must be able to say "this question/answer holds" (answers can be "true" or "false"). For instance, a bark is a facet of a (normal) dog, but tentacles are not a facet of a Whale. In the presented tree, the false edge always leads to the left subtree: this is a convention to avoid labeling each edge with true/false or Y/N.
The tree can only be built from existing/external knowledge that allows one to answer each question for every animal.
Here is a rough algorithm can be used to build such a tree:
Start with a set of possible animals, call this A, and a set of questions, call this Q.
Pick a question, q, from Q for which count(True(q, a in A)) is closest to that of count(False(q, a in A)) - if the resulting tree is a balanced binary tree these counts will always be equal for the best question to ask.
Remove q from Q and use it as the question to ask for the current node. Put all False(q,a) into the set of animals (A') available to the left child node and put all True(q,a) into the set of animals (A'') available to the right child node.
Following each edge/branch (false=left, true=right), pick a suitable question from the remaining Q and repeat (using A' or A'' for A, as appropriate).
(Of course, there are many more complete/detailed/accurate resources found online as course material or whitepapers. Not to mention a suitable selection of books at most college campuses ..)
Update, for a [binary] decision tree:
In this particular case (which is clear with the added diagram) the graph is based on the "yes" or "no" response for the question which represent the edges between the nodes. That is, the tree is not not built using an ordering of the string values themselves. In this case it might make sense to always have the left branch "false" and the right branch "true" although each node could have more edges/children if non-binary responses are allowed.
The decision tree must be "trained" (google search). That is, the graph must be built initially based on the questions/responses which is unlike a BST that is based merely on ordering between nodes. The initial graph building cannot be done from merely an array of questions as the edges do not follow an intrinsic ordering.
Initial response, for a binary search tree:
The same way it does for integers: the algorithm does not change.
Consider a function, compareTo(a,b) that will return -1, 0 or 1 for a < b, a == b, and a > b, respectively.
Then consider that the type of neither a nor b matter (as long as they are the same) when implementing a function with this contract if such a type supports ordering: it will be "raw" for integers and use the host language's corresponding string comparison for string types.

Queue data structure supporting fast k-th largest element finding

I'm faced with a problem which requires a Queue data structure supporting fast k-th largest element finding.
The requirements of this data structure are as follows:
The elements in the queue are not necessarily integers, but they must be comparable to each other, i.e we can tell which one is greater when we compare two elements(they can be equal as well).
The data structure must support enqueue(adds the element at the tail) and dequeue(removes the element at the head).
It can quickly find the k-th largest element in the queue, pls note k is not a constant.
You can assume that operations enqueue , dequeue and k-th largest element finding all occur with the same frequency.
My idea is to use a modified balanced binary search tree. The tree is the same as ordinary balanced binary search tree except that every nodei is augmented with another field ni, ni denotes the number of nodes contained in the subtree with root nodei. The aforementioned operations are supported as follows:
For simplicity assume that all elements are distinct.
Enqueue(x): x is first inserted into the tree, suppose the corresponding node is nodet, we append pair(x,pointer to nodet) to the queue.
Dequeue: suppose (e1, node1) is the element at the head, node1 is the pointer into the tree corresponding to e1. We delete node1 from the tree and remove (e1, node1) from the queue.
K-th largest element finding: suppose root node is noderoot, its two children are nodeleft and noderight(suppose they all exist), we compare K with nroot , three cases may happen:
if K< nleft we find the K-th largest element in the left subtree of nroot;
if K>nroot-nright we find the (K-nroot+nright)-th largest element in the right subtree of nroot;
otherwise nroot is the node we want.
The time complexity of all the three operations are O(logN) , where N is the number of elements currently in the queue.
How can I speed up the operations mentioned above? With what data structures and how?

Note - you cannot achieve better then O(logn) for all, at best you need to "chose" which op you care for the most. (Otherwise, you could sort in O(n) by feeding the array to the DS, and querying 1st, 2nd, 3rd, ... nth elements)
Using a skip list instead of a Balanced BST as the sorted structure
can reduce dequeue complexity to O(1) average case. It does
not affect complexity of any other op.
To remove from a skip list - all you need to do is to get to the element using the pointer from the head of the queue, and follow the links up and remove each. The expected number of nodes needed to be deleted is 1 + 1/2 + 1/4 + ... = 2.
find Kth can be achieved in O(logK) by starting from the leftest node (and not the root) and making your way up until you find you have "more sons then needed", and then treat the just found node as the root just like the algorithm in the question. Though it is better in asymptotic complexity - the constant factor is double.

I found an interesting paper:
Sliding-Window Top-k Queries on Uncertain Streams published in VLDB 2008 and cited by 71.
https://www.cse.ust.hk/~yike/wtopk.pdf
VLDB is the best conference in database research area, and the number of citations proves the data structure actually works.
The paper looks pretty difficult, but if you really need improve your data structure, I suggest you to read this paper or papers in the reference page of this paper.

You can also use a finger tree.
For example, a priority queue can be implemented by labeling the internal nodes by the minimum priority of its children in the tree, or an indexed list/array can be implemented with a labeling of nodes by the count of the leaves in their children. Finger trees can provide amortized O(1) cons, reversing, cdr, O(log n) append and split; and can be adapted to be indexed or ordered sequences.
Also note that being a purely functional structure makes this a good choice for concurrent usage.

Closest Point on a Map

I am making a program where you can click on a map to see a "close-up view" of the area around it, such as on Google Maps.
When a user clicks on the map, it gets the X and Y coordinate of where they clicked.
Let's assume that I have an array of booleans of where these close-up view pictures are:
public static boolean[][] view_set=new boolean[Map.width][Map.height];
//The array of where pictures are. The map has a width of 3313, and a height of 3329.
The program searches through a folder, where images are named to where the X and Y coordinate of where it was taken on the map. The folder contains the following images (and more, but I'll only list five):
2377,1881.jpg, 2384,1980.jpg, 2389,1923.jpg, 2425,1860.jpg, 2475,1900.jpg
This means that:
view_set[2377][1881]=true;
view_set[2384][1980]=true;
view_set[2389][1923]=true;
view_set[2425][1860]=true;
view_set[2475][1900]=true;
If a user clicks at the X and Y of, for example, 2377,1882, then I need the program to figure out which image is closest (the answer in this case would be 2377,1881).
Any help would be appreciated,
Thanks.

Your boolean[][] is not a good datastructure for this problem, at least if it is not really dense (e.g. normally a point with close-up view is available in the surrounding 3×3 or maybe 5×5 square).
You want a 2-D-map with nearest-neighbor search. A useful data structure for this goal is the QuadTree. This is a tree of degree 4, used to represent spatial data. (I'm describing here the "Region QuadTree with point data".)
Basically, it divides a rectangle in four about equal size rectangles, and subdivides each of the rectangles further if there is more than one point in it.
So a node in your tree is one of these:
a empty leaf node (corresponding to a rectangle without points in it)
a leaf node containing exactly one point (corresponding to a rectangle with one point in it)
a inner node with four child nodes (corresponding to a rectangle with more than one point in it)
(In implementations, we can replace empty leaf nodes with a null-pointer in its parent.)
To find a point (or "the node a point would be in"), we start at the root node, look if our point is north/south/east/west of the dividing point, and go to the corresponding child node. We continue this until we arrive at some leaf node.
For adding a new point, we either wind up with an empty node - then we can put the new point here. If we end up at a node with already a point in it, create four child nodes (by splitting the rectangle) and add both points to the appropriate child node. (This might be the same, then repeat recursively.)
For the nearest-neighbor search, we will either wind up with an empty node - then we back up one level, and look at the other child nodes of this parent (comparing each distance). If we reach a child node with one point in it, we measure the distance of our search point to this point. If it is smaller than the distance to the edges or the node, we are done. Otherwise we will have to look at the points in the neighboring nodes, too, and compare the results here, taking the minimum. (We will have to look at at most four points, I think.)
For removal, after finding a point, we make its node empty. If the parent node now contains only one point, we replace it by a one-point leaf node.
The search and adding/removing are in O(depth) time complexity, where the maximum depth is limited by log((map length+width)/minimal distance of two points in your structure), and average depth is depending on the distribution of the points (e.g. the average distance to the next point), more or less.
Space needed is depending on number of points and average depth of the tree.
There are some variants of this data structure (for example splitting a node only when there are more than X points in it, or splitting not necessarily in the middle), to optimize the space usage and avoid too large depths of the tree.

Given the location the user clicked, you could search for the nearest image using a Dijkstra search.
Basically you start searching in increasingly larger rectangles around the clicked location for images. Of course you only have to search the boundaries of these rectangles, since you've already searched the body. This algorithm should stop as soon as an image is found.
Pseudo code:
int size = 0
Point result = default
while(result == default)
result = searchRectangleBoundary(size++, pointClicked)
function Point searchRectangleBoundary(int size, Point centre)
{
point p = {centre.X - size, centre.Y - size}
for i in 0 to and including size
{
if(view_set[p.X + i][p.Y]) return { p.X + i, p.Y}
if(view_set[p.X][p.Y + i]) return { p.X, p.Y + i}
if(view_set[p.X + i][p.Y + size]) return { p.X + i, p.Y + size}
if(view_set[p.X + size][p.Y + i]) return { p.X + size, p.Y + i}
}
return default
}
Do note that I've left out range checking for brevity.
There is a slight problem, but depending on the application, it might not be a problem. It doesn't use euclidian distances, but the manhattan metric. So it doesn't necessarily find the closest image, but an image at most the square root of 2 times as far.

Based on
your comment that states you have 350-500 points of interest,
your question that states you have a map width of 3313, and a height of 3329
my calculator which tells me that that represents ~11 million boolean values
...you're going about this the wrong way. #JBSnorro's answer is quite an elegant way of finding the needle (350 points) in the haystack (11 million points), but really, why create the haystack in the first place?
As per my comment on your question, why not just use a Pair<Integer,Integer> class to represent co-ordinates, store them in a set, and scan them? It's simpler, quicker, less memory consuming, and is way more scalable for larger maps (assuming the points of interest are sparse... which it seems is a sensible assumption given that they're points of interest).
..trust me, computing the Euclidean distance ~425 times beats wandering around an 11 million value boolean[][] looking for the 1 value in 25,950 that's of interest (esp. in a worst case analysis).
If you're really not thrilled with the idea of scanning ~425 values each time, then (i) you're more OCD than me (:P); (ii) you should check out nearest neighbour search algorithms.

I do not know if you are asking for this. If the user point is P1 {x1, y1} and you want to calculate its distance to P2 {x2,y2}, the distance is calculated using Pythagoras'Theorem
distance^2 = (x2-x1)^2 + (y2-y1)^2
If you only want to know the closest, you can avoid calculating the square root (the smaller the distance, the smaller the square too so it serves you the same).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.