Queue data structure supporting fast k-th largest element finding

Queue data structure supporting fast k-th largest element finding - java

I'm faced with a problem which requires a Queue data structure supporting fast k-th largest element finding.
The requirements of this data structure are as follows:
The elements in the queue are not necessarily integers, but they must be comparable to each other, i.e we can tell which one is greater when we compare two elements(they can be equal as well).
The data structure must support enqueue(adds the element at the tail) and dequeue(removes the element at the head).
It can quickly find the k-th largest element in the queue, pls note k is not a constant.
You can assume that operations enqueue , dequeue and k-th largest element finding all occur with the same frequency.
My idea is to use a modified balanced binary search tree. The tree is the same as ordinary balanced binary search tree except that every nodei is augmented with another field ni, ni denotes the number of nodes contained in the subtree with root nodei. The aforementioned operations are supported as follows:
For simplicity assume that all elements are distinct.
Enqueue(x): x is first inserted into the tree, suppose the corresponding node is nodet, we append pair(x,pointer to nodet) to the queue.
Dequeue: suppose (e1, node1) is the element at the head, node1 is the pointer into the tree corresponding to e1. We delete node1 from the tree and remove (e1, node1) from the queue.
K-th largest element finding: suppose root node is noderoot, its two children are nodeleft and noderight(suppose they all exist), we compare K with nroot , three cases may happen:
if K< nleft we find the K-th largest element in the left subtree of nroot;
if K>nroot-nright we find the (K-nroot+nright)-th largest element in the right subtree of nroot;
otherwise nroot is the node we want.
The time complexity of all the three operations are O(logN) , where N is the number of elements currently in the queue.
How can I speed up the operations mentioned above? With what data structures and how?

Note - you cannot achieve better then O(logn) for all, at best you need to "chose" which op you care for the most. (Otherwise, you could sort in O(n) by feeding the array to the DS, and querying 1st, 2nd, 3rd, ... nth elements)
Using a skip list instead of a Balanced BST as the sorted structure
can reduce dequeue complexity to O(1) average case. It does
not affect complexity of any other op.
To remove from a skip list - all you need to do is to get to the element using the pointer from the head of the queue, and follow the links up and remove each. The expected number of nodes needed to be deleted is 1 + 1/2 + 1/4 + ... = 2.
find Kth can be achieved in O(logK) by starting from the leftest node (and not the root) and making your way up until you find you have "more sons then needed", and then treat the just found node as the root just like the algorithm in the question. Though it is better in asymptotic complexity - the constant factor is double.

I found an interesting paper:
Sliding-Window Top-k Queries on Uncertain Streams published in VLDB 2008 and cited by 71.
https://www.cse.ust.hk/~yike/wtopk.pdf
VLDB is the best conference in database research area, and the number of citations proves the data structure actually works.
The paper looks pretty difficult, but if you really need improve your data structure, I suggest you to read this paper or papers in the reference page of this paper.

You can also use a finger tree.
For example, a priority queue can be implemented by labeling the internal nodes by the minimum priority of its children in the tree, or an indexed list/array can be implemented with a labeling of nodes by the count of the leaves in their children. Finger trees can provide amortized O(1) cons, reversing, cdr, O(log n) append and split; and can be adapted to be indexed or ordered sequences.
Also note that being a purely functional structure makes this a good choice for concurrent usage.

Related

Traverse a tree represented by its edges

My tree is represented by its edges and the root node. The edge list is undirected.
char[][] edges =new char[][]{
new char[]{'D','B'},
new char[]{'A','C'},
new char[]{'B','A'}
};
char root='A';
The tree is
A
B C
D
How do I do depth first traversal on this tree? What is the time complexity?
I know time complexity of depth first traversal on linked nodes is O(n). But if the tree is represented by edges, I feel the time complexity is O(n^2). Am I wrong?
Giving code is appreciated, although I know it looks like homework assignment..

The general template behind DFS looks something like this:
function DFS(node) {
if (!node.visited) {
node.visited = true;
for (each edge {node, v}) {
DFS(v);
}
}
}
If you have your edges represented as a list of all the edges in the graph, then you could implement the for loop by iterating across all the edges in the graph and, every time you find one with the current node as its source, following the edge to its endpoint and running a DFS from there. If you do this, then you'll do O(m) work per node in the graph (here, m is the number of edges), so the runtime will be O(mn), since you'll do this at most once per node in the graph. In a tree, the number of edges is always O(n), so for a tree the runtime is O(n2).
That said, if you have a tree and there are only n edges, you can speed this up in a bunch of ways. First, you could consider doing an O(n log n) preprocessing step to sort the array of edges. Then, you can find all the edges leaving a given node by doing a binary search to find the first edge leaving the node, then iterating across the edges starting there to find just the edges leaving the node. This improves the runtime quite a bit: you do O(log n) work per node for the binary search, and then every edge gets visited only once. This means that the runtime is O(n log n). Since you've mentioned that the edges are undirected, you'll actually need to create two different copies of the edges array - one that's the original one, and one with the edges reversed - and should sort each one independently. The fact that DFS marks visited nodes along the way means that you don't need to do any extra bookkeeping here to figure out which direction you should go at each step, and this doesn't change the overall time complexity, though it does increase the space usage.
Alternatively, you could use a hashing-based solution. Before doing the DFS, iterate across the edges and convert them into a hash table whose keys are the nodes and whose values are lists of the edges leaving that node. This will take expected time O(n). You can then implement the "for each edge" step quite efficiently by just doing a hash table lookup to find the edges in question. This reduces the time to (expected) O(n), though the space usage goes up to O(n) as well. Since your edges are undirected, as you populate the table, just be sure to insert the edge in each direction.

Count number of nodes within given range in AVL tree

Suppose I have an AVL tree of distinct integers. I need to determine the number of nodes which lie in the interval [a, b) where a < b. Note that [a, b) is supplied by the user and hence I do not know beforehand what the value of a and b are. Also, a and b may not be present in the tree at all. For example, if I have a tree containing the integers {1, 2, 4, 5, 6, 7} then the user should expect an answer of 3 if he supplies the interval [3, 7).
A naive implementation would be to traverse every node and increment the count by 1 every time a node is found in the given interval. But this would have a worst case time complexity of O(n), as it is possible for every single integer in the tree to be within the given range. I need a faster algorithm, and after doing some research I found that it requires storing a size statistic in every node so that the rank of any given node can be easily computed.
I would like to do something like rank(b) - rank(a), but the problem is that a and b may not exist in the tree. In the above example, rank(7) would return 6 but rank(3) will not return any meaningful value.
Can anyone offer suggestions as to how I can address this issue? Also, I know that there is another similar question on this website, but that one involves C++ while this one involves Java. Also, I could not find a satisfactory answer there.

I've implemented a stack based tree iterator for an AVL tree some (long) time ago. It should work for your case like this:
create an array "treestack" which holds structs for traversal info. The struct just needs a bool "visited", and a pointer to your node type. The array can be of static size e.g. hold 64 info elements (one for each level of your tree, so this 64 will mean your tree contains max 4G nodes)
change the search method of your AVL tree to put the root node at treestack[0] when you begin with the search, and put all other nodes on top of the treestack as you follow the left and right child nodes during your search. Edit: Note that an unsuccessful search will result in your treestack having a node with the next smaller or next higher value, which is exactly what you want (just skip counting in case it's smaller, we still have abvalid iterator start oath).
You've now a path un your treestack which you can use for subsequent in-order traversal to find the next higher values. In-order traversal using the stack works like this:
start at the last element in treestack, keep a treeindex which is initially = arrayindex of the last item.
When there is a right node, and it is not marked visited: try follow ONE right of the current node, then ENDLESS to the left of any following nodes. Wherever you stop (so also if there are no left nodes but a single right one) is the next higher value. Put all nodes at the end of your tree stack as you follow them and inc your treeindex as you follow paths. And Mark the choosen final node with the next higher value as visited. Increase your node counter+1.
now to find subsequent higher values, ascend in the tree by taking treeindex-1 in your treestack as current node, an repeat the above step to find the next node with higher value.
if there is no right child node the current node, and the node is not marked visited: mark as visited, and inc node counter by 1
you're done when you either reach the root node with treeindex 0 or your node containing max value.
I hope it helps.

Instead of
rank(b) - rank(a)
what I would do is
rank(X) - rank(Y)
X is the very first node having value > b.
Y is the very first node having value >= a.

Why store the points in a binary tree?

This question covers a software algorithm, from On topic
I am working on an interview question from Amazon Software Question,
specifically "Given a set of points (x,y) and an integer "n", return n number of points which are close to the origin"
Here is the sample high level psuedocode answer to this question, from Sample Answer
Step 1: Design a class called point which has three fields - int x, int y, int distance
Step 2: For all the points given, find the distance between them and origin
Step 3: Store the values in a binary tree
Step 4: Heap sort
Step 5: print the first n values from the binary tree
I agree with steps 1 and 2 because it makes sense in terms of object-oriented design to have one software bundle of data, Point, encapsulate away the fields of x, y and distance.Ensapsulation
Can someone explain the design decisions from 3 to 5?
Here's how I would do steps of 3 to 5
Step 3: Store all the points in an array
Step 4: Sort the array with respect to distance(I use some build in sort here like Arrays.Sort
Step 5: With the array sorted in ascending order, I print off the first n values
Why the author of that response use a more complicated data structure, binary tree and not something simpler like an array that I used? I know what a binary tree is - hierarchical data structure of nodes with two pointers. In his algorithm, would you have to use a BST?

First, I would not say that having Point(x, y, distance) is good design or encapsulation. distance is not really part of a point, it can be computed from x and y. In term of design, I would certainly have a function, i.e. a static method from Point or an helper class Points.
double distance(Point a, Point b)
Then for the specific question, I actually agree with your solution, to put the data in an array, sort this array and then extract the N first.
What the example may be hinted at is that the heapsort actually often uses a binary tree structure inside the array to be sorted as explained here :
The heap is often placed in an array with the layout of a complete binary tree.
Of course, if the distance to the origin is not stored in the Point, for performance reason, it had to be put with the corresponding Point object in the array, or any information that will allow to get the Point object from the sorted distance (reference, index), e.g.
List<Pair<Long, Point>> distancesToOrigin = new ArrayList<>();
to be sorted with a Comparator<Pair<Long, Point>>

It is not necessary to use BST. However, it is a good practice to use BST when needing a structure that is self-sorted. I do not see the need to both use BST and heapsort it (somehow). You could use just BST and retrieve the first n points. You could also use an array, sort it and use the first n points.
If you want to sort an array of type Point, you could implement the interface Comparable (Point would imolement that interface) and overload the default method.
You never have to choose any data structures, but by determining the needs you have, you would also easily determine the optimum structure.

The approach described in this post is more complex than needed for such a question. As you noted, simple sorting by distance will suffice. However, to help explain your confusion about what your sample answer author was trying to get at, maybe consider the k nearest neighbors problem which can be solved with a k-d tree, a structure that applies space partitioning to the k-d dataset. For 2-dimensional space, that is indeed a binary tree. This tree is inherently sorted and doesn't need any "heap sorting."
It should be noted that building the k-d tree will take O(n log n), and is only worth the cost if you need to do repeated nearest neighbor searches on the structure. If you only need to perform one search to find k nearest neighbors from the origin, it can be done with a naive O(n) search.
How to build a k-d tree, straight from Wiki:
One adds a new point to a k-d tree in the same way as one adds an element to any other search tree. First, traverse the tree, starting from the root and moving to either the left or the right child depending on whether the point to be inserted is on the "left" or "right" side of the splitting plane. Once you get to the node under which the child should be located, add the new point as either the left or right child of the leaf node, again depending on which side of the node's splitting plane contains the new node.
Adding points in this manner can cause the tree to become unbalanced, leading to decreased tree performance. The rate of tree performance degradation is dependent upon the spatial distribution of tree points being added, and the number of points added in relation to the tree size. If a tree becomes too unbalanced, it may need to be re-balanced to restore the performance of queries that rely on the tree balancing, such as nearest neighbour searching.
Once have have built the tree, you can find k nearest neighbors to some point (the origin in your case) in O(k log n) time.
Straight from Wiki:
Searching for a nearest neighbour in a k-d tree proceeds as follows:
Starting with the root node, the algorithm moves down the tree recursively, in the same way that it would if the search point were being inserted (i.e. it goes left or right depending on whether the point is lesser than or greater than the current node in the split dimension).
Once the algorithm reaches a leaf node, it saves that node point as the "current best"
The algorithm unwinds the recursion of the tree, performing the following steps at each node:
If the current node is closer than the current best, then it becomes the current best.
The algorithm checks whether there could be any points on the other side of the splitting plane that are closer to the search point than the current best. In concept, this is done by intersecting the splitting hyperplane with a hypersphere around the search point that has a radius equal to the current nearest distance. Since the hyperplanes are all axis-aligned this is implemented as a simple comparison to see whether the difference between the splitting coordinate of the search point and current node is lesser than the distance (overall coordinates) from the search point to the current best.
If the hypersphere crosses the plane, there could be nearer points on the other side of the plane, so the algorithm must move down the other branch of the tree from the current node looking for closer points, following the same recursive process as the entire search.
If the hypersphere doesn't intersect the splitting plane, then the algorithm continues walking up the tree, and the entire branch on the other side of that node is eliminated.
When the algorithm finishes this process for the root node, then the search is complete.
This is a pretty tricky algorithm that I would hate to need to describe as an interview question! Fortunately the general case here is more complex than is needed, as you pointed out in your post. But I believe this approach may be close to what your (wrong) sample answer was trying to describe.

how to link elements with a certain probability inversely proportional to a variable

I have two array lists and I would like to link element from the first array to element of the second array list. Elements have a property, say A.
The condition is: an element of the first array with an high value of element.getA() prefers to link with an element of the second array with a low value of A.
I understand that for selecting an element according to a biased probability I can calculate the cumulative probabilities and then do something like this Selecting nodes with probability proportional to trust
Let's see if this is more clear: think about preferential attachment mechanism. In that case, a node links to another node with a probability which increments with the degree of the chosen node. I simply would like to hack the preferential attachment and bias the probability for a node to link another node not only on a property of the second node, but also on a property of the first node. And I want this to be inverse, like small node prefers to link big nodes and big nodes prefers to link small nodes.
Best regards,
Simone

[edited]
for each pair, calculate the difference (or absolute difference, or difference squared). then use that difference as weighting to select one pair.
remove pairs that are no longer valid and repeat.

Time Complexity of Creating a Binary Tree

I am trying to create a tree from a source that provides: the 2 nodes to be added to the tree, and the node which these 2 news nodes should be added. To find where this node is in the tree, I used a inorder traversal which takes O(n). So if there was n number of nodes to be added in the tree, will the creation of the whole tree be O(n^2). My constraint is that it should only take O(n) to create the tree.

Looking up a node in a binary tree is O(log(n)) because the tree has log(n) levels (each level holds twice as much as the level above it). Therefore to create/insert n elements into a binary tree it's O(nlog(n)).

You could keep references to each node of the tree in a HashMap [1], to get O(1) access to each node instead of the O(log(n)) which is typical of trees. That would make it possible to build the tree in O(n) time, because that HashMap lets you jump directly to a node instead of traversing there from the tree's root node.
[1] The key would be whatever the source uses for uniquely identifying the nodes (I'm assuming it to be an integer or string). The value would be a reference to the node in the tree. Note that the tree implementation must make all its nodes public, so you will probably need to write the tree yourself or find a suitable library (JDK's trees such as TreeMap keep their internal structure private, so they won't cut it).

for Binary search tree time complexity will be O(nlogn) when the elements are not sorted and sorted it takes O(n^2). It is because to to insert one element in a sorted list in a BST O(n) time is taken so for n elements O(n^2) and for a balanced or almost balanced binary search tree max time for insertion is logn so for n elements it is nlogn

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.