k nearest neighbors graph implementation in Java

k nearest neighbors graph implementation in Java - java

I am implementing a Flame clustering algorithm as a way of learning a bit more about graphs and graph traversal, and one of the first steps is constructing a K-nearest-neighbors graph, and I'm wondering what the fastest way would be of running through a list of nodes and connecting each one only to say, it's nearest five neighbors. My thought was that I would start at a node, iterate through the list of other nodes and keep the ones that are closest within an array, making sure that everything past the top n are discarded. Now, I could do this by just sorting a list and keeping the top n entries, but I would much rather keep less fewer things in memory and so I was wondering if there was a way to just have the final array and update that array as I iterate through, or if there is a more efficient way of generating a k nearest neighbors graph.
Also, please note, this is NOT a duplicate of K-Nearest Neighbour Implementation in Java. KNNG is distinct from KNN.

Place the first n nodes, sorted in a List. Then iterate through the rest of nodes and if it fits in the current list (i.e. is a top n node), place it in the corresponding position in the list and discard the last top n node. If it doesn't fit in the top n list, discard it.
for each neighborNode
for(int i = 0; i < topNList.size(); i++){
if((dist = distanceMetric(neighborNode,currentNode)) > topNList.get(i).distance){
topNList.remove(topNList.size()-1)
neighborNode.setDistance(dist);
topNList.add(i, neighborNode);
}

I think the most efficient way would be using a bound priority queue, like
https://github.com/tdebatty/java-graphs#bounded-priority-queue

Related

How to compare all elements between two arrays?

I'm attempting to write a program which can identify all nodes in a graph that don't share any common neighbors, and in which all vertices are contained within the various subsystems in the graph. Assume all nodes are numerically labeled for simplicity.
For example, in a graph of a cube, the furthest corners share no common nodes and are part of subsystems that together contain all vertices.
I'm looking to write a program that compares each potential subsystem against all other potential subsystems, regardless of the graph, number of nodes or sides, and finds groups of subsystems whose central nodes don't share common neighbors. For simplicity's sake, assume the graphs aren't usually symmetrical, unlike the cube example, as this introduces functionally equivalent systems. However, the number of nodes in a subsystem, or elements in an array, can vary.
The goal for the program is to find a set of central nodes whose neighbors are each unique to the subsystem, that is no neighbor appears in another subsystem. This would also mean that the total number of nodes in all subsystems, central nodes and neighbors together, would equal the total number of vertices in the graph.
My original plan was to use a 2d array, where rows act as stand-ins for the subsystems. It would compare individual elements in an array against all other elements in all other arrays. If two arrays contain no similar elements, then index the compared array and its central node is recorded, otherwise it is discarded for this iteration. After the program has finished iterating through the 2d array against the first row, it adds up the number of elements from all recorded rows to see if all nodes in the graph are represented. So if a graph contains x nodes, and the number of elements in the recorded rows is less than x, then the program iterates down one row to the next subsystem and compares all values in the row against all other values like before.
Eventually, this system should print out which nodes can make up a group of subsystems that encompass all vertices and whose central nodes share no common neighbors.
My limited exposure to CS makes a task like this daunting, as it's just my way of solving puzzles presented by my professor. I'd find the systems by hand through guess-and-check methods, but with a 60+ node array...
Thanks for any help, and simply pointers in the right direction would be very much appreciated.

I don't have a good solution (and maybe there exists none; it sounds very close to vertex cover). So you may need to resort backtracking. The idea is the following:
Keep a list of uncovered vertices and a list of potential central node candidates. Both lists initially contain all vertices. Then start placing a random central node from the candidate list. This will erase the node and its one-ring from the uncovered list and the node plus its one-ring and two-ring from the candidate list. Do this until the uncovered list is empty or you run out of candidates. If you make a mistake, revert the last step (and possibly more).
In pseudo-code, this looks as follows:
findSolution(uncoveredVertices : list, centralNodeCandidates : list, centralNodes : list)
if uncoveredVertices is empty
return centralNodes //we have found a valid partitioning
if centralNodeCandidates is empty
return [failure] //we cannot place more central nodes
for every n in centralNodeCandidates
newUncoveredVertices <- uncoveredVertices \ { n } \ one-ring of n
newCentralNodeCandidates <- centralNodeCandidates \ { n } \ one-ring of n \ two-ring of n
newCentralNodes = centralNodes u { n }
subProblemSolution = findSolution(newUncoveredVertices, newCentralNodeCandidates, newCentralNodes)
if subProblemSolution is not [failure]
return subProblemSolution
next
return [failure] //none of the possible routes to go yielded a valid solution
Here, \ is the set minus operator and u is set union.
There are several possible optimizations:
If you know the maximum number of nodes, you can represent the lists as bitmaps (for a maximum of 64 nodes, this even fits into a 64 bit integer).
You may end up checking the same state multiple times. To avoid this, you may want to cache the states that resulted in failures. This is in the spirit of dynamic programming.

Traverse a tree represented by its edges

My tree is represented by its edges and the root node. The edge list is undirected.
char[][] edges =new char[][]{
new char[]{'D','B'},
new char[]{'A','C'},
new char[]{'B','A'}
};
char root='A';
The tree is
A
B C
D
How do I do depth first traversal on this tree? What is the time complexity?
I know time complexity of depth first traversal on linked nodes is O(n). But if the tree is represented by edges, I feel the time complexity is O(n^2). Am I wrong?
Giving code is appreciated, although I know it looks like homework assignment..

The general template behind DFS looks something like this:
function DFS(node) {
if (!node.visited) {
node.visited = true;
for (each edge {node, v}) {
DFS(v);
}
}
}
If you have your edges represented as a list of all the edges in the graph, then you could implement the for loop by iterating across all the edges in the graph and, every time you find one with the current node as its source, following the edge to its endpoint and running a DFS from there. If you do this, then you'll do O(m) work per node in the graph (here, m is the number of edges), so the runtime will be O(mn), since you'll do this at most once per node in the graph. In a tree, the number of edges is always O(n), so for a tree the runtime is O(n2).
That said, if you have a tree and there are only n edges, you can speed this up in a bunch of ways. First, you could consider doing an O(n log n) preprocessing step to sort the array of edges. Then, you can find all the edges leaving a given node by doing a binary search to find the first edge leaving the node, then iterating across the edges starting there to find just the edges leaving the node. This improves the runtime quite a bit: you do O(log n) work per node for the binary search, and then every edge gets visited only once. This means that the runtime is O(n log n). Since you've mentioned that the edges are undirected, you'll actually need to create two different copies of the edges array - one that's the original one, and one with the edges reversed - and should sort each one independently. The fact that DFS marks visited nodes along the way means that you don't need to do any extra bookkeeping here to figure out which direction you should go at each step, and this doesn't change the overall time complexity, though it does increase the space usage.
Alternatively, you could use a hashing-based solution. Before doing the DFS, iterate across the edges and convert them into a hash table whose keys are the nodes and whose values are lists of the edges leaving that node. This will take expected time O(n). You can then implement the "for each edge" step quite efficiently by just doing a hash table lookup to find the edges in question. This reduces the time to (expected) O(n), though the space usage goes up to O(n) as well. Since your edges are undirected, as you populate the table, just be sure to insert the edge in each direction.

how to link elements with a certain probability inversely proportional to a variable

I have two array lists and I would like to link element from the first array to element of the second array list. Elements have a property, say A.
The condition is: an element of the first array with an high value of element.getA() prefers to link with an element of the second array with a low value of A.
I understand that for selecting an element according to a biased probability I can calculate the cumulative probabilities and then do something like this Selecting nodes with probability proportional to trust
Let's see if this is more clear: think about preferential attachment mechanism. In that case, a node links to another node with a probability which increments with the degree of the chosen node. I simply would like to hack the preferential attachment and bias the probability for a node to link another node not only on a property of the second node, but also on a property of the first node. And I want this to be inverse, like small node prefers to link big nodes and big nodes prefers to link small nodes.
Best regards,
Simone

[edited]
for each pair, calculate the difference (or absolute difference, or difference squared). then use that difference as weighting to select one pair.
remove pairs that are no longer valid and repeat.

Queue data structure supporting fast k-th largest element finding

I'm faced with a problem which requires a Queue data structure supporting fast k-th largest element finding.
The requirements of this data structure are as follows:
The elements in the queue are not necessarily integers, but they must be comparable to each other, i.e we can tell which one is greater when we compare two elements(they can be equal as well).
The data structure must support enqueue(adds the element at the tail) and dequeue(removes the element at the head).
It can quickly find the k-th largest element in the queue, pls note k is not a constant.
You can assume that operations enqueue , dequeue and k-th largest element finding all occur with the same frequency.
My idea is to use a modified balanced binary search tree. The tree is the same as ordinary balanced binary search tree except that every nodei is augmented with another field ni, ni denotes the number of nodes contained in the subtree with root nodei. The aforementioned operations are supported as follows:
For simplicity assume that all elements are distinct.
Enqueue(x): x is first inserted into the tree, suppose the corresponding node is nodet, we append pair(x,pointer to nodet) to the queue.
Dequeue: suppose (e1, node1) is the element at the head, node1 is the pointer into the tree corresponding to e1. We delete node1 from the tree and remove (e1, node1) from the queue.
K-th largest element finding: suppose root node is noderoot, its two children are nodeleft and noderight(suppose they all exist), we compare K with nroot , three cases may happen:
if K< nleft we find the K-th largest element in the left subtree of nroot;
if K>nroot-nright we find the (K-nroot+nright)-th largest element in the right subtree of nroot;
otherwise nroot is the node we want.
The time complexity of all the three operations are O(logN) , where N is the number of elements currently in the queue.
How can I speed up the operations mentioned above? With what data structures and how?

Note - you cannot achieve better then O(logn) for all, at best you need to "chose" which op you care for the most. (Otherwise, you could sort in O(n) by feeding the array to the DS, and querying 1st, 2nd, 3rd, ... nth elements)
Using a skip list instead of a Balanced BST as the sorted structure
can reduce dequeue complexity to O(1) average case. It does
not affect complexity of any other op.
To remove from a skip list - all you need to do is to get to the element using the pointer from the head of the queue, and follow the links up and remove each. The expected number of nodes needed to be deleted is 1 + 1/2 + 1/4 + ... = 2.
find Kth can be achieved in O(logK) by starting from the leftest node (and not the root) and making your way up until you find you have "more sons then needed", and then treat the just found node as the root just like the algorithm in the question. Though it is better in asymptotic complexity - the constant factor is double.

I found an interesting paper:
Sliding-Window Top-k Queries on Uncertain Streams published in VLDB 2008 and cited by 71.
https://www.cse.ust.hk/~yike/wtopk.pdf
VLDB is the best conference in database research area, and the number of citations proves the data structure actually works.
The paper looks pretty difficult, but if you really need improve your data structure, I suggest you to read this paper or papers in the reference page of this paper.

You can also use a finger tree.
For example, a priority queue can be implemented by labeling the internal nodes by the minimum priority of its children in the tree, or an indexed list/array can be implemented with a labeling of nodes by the count of the leaves in their children. Finger trees can provide amortized O(1) cons, reversing, cdr, O(log n) append and split; and can be adapted to be indexed or ordered sequences.
Also note that being a purely functional structure makes this a good choice for concurrent usage.

Extracting a given number of the highest values in a List

I'm seeking to display a fixed number of items on a web page according to their respective weight (represented by an Integer). The List where these items are found can be of virtually any size.
The first solution that comes to mind is to do a Collections.sort() and to get the items one by one by going through the List. Is there a more elegant solution though that could be used to prepare, say, the top eight items?

Just go for Collections.sort(..). It is efficient enough.
This algorithm offers guaranteed n log(n) performance.
You can try to implement something more efficient for your concrete case if you know some distinctive properties of your list, but that would not be justified. Furthermore, if your list comes from a database, for example, you can LIMIT it & order it there instead of in code.

Your options:
Do a linear search, maintaining the top N weights found along the way. This should be quicker than sorting a lengthly list if, for some reason, you can't reuse the sorting results between displaying the page (e.g. the list is changing quickly).
UPDATE: I stand corrected on the linear search necessarily being better than sorting. See Wikipedia article "Selection_algorithm - Selecting k smallest or largest elements" for better selection algorithms.
Manually maintain a List (the original one or a parallel one) sorted in weight order. You can use methods like Collections.binarySearch() to determine where to insert each new item.
Maintain a List (the original one or a parallel one) sorted in weight order by calling Collections.sort() after each modification, batch modifications, or just before display (possibly maintaining a modification flag to avoid sorting an already sorted list).
Use a data structure that maintains sorted weight-order for you: priority queue, tree set, etc. You could also create your own data structure.
Manually maintain a second (possibly weight-ordered) data structure of the top N items. This data structure is updated anytime the original data structure is modified. You could create your own data structure to wrap the original list and this "top N cache" together.

You could use a max-heap.
If your data originates from a database, put an index on that column and use ORDER BY and TOP or LIMIT to fetch only the records you need to display.

Or a priority queue.

using dollar:
List<Integer> topTen = $(list).sort().slice(10).toList();
without using dollar you should sort() it using Collections.sort(), then get the first n items using list.sublist(0, n).

Since you say the list of items from which to extract these top N may be of any size, and so may be large I assume, I'd augment the simple sort() answers above (which are entirely appropriate for reasonably-sized input) by suggesting most of the work here is finding the top N -- then sorting those N is trivial. That is:
Queue<Integer> topN = new PriorityQueue<Integer>(n);
for (Integer item : input) {
if (topN.size() < n) {
topN.add(item);
} else if (item > topN.peek()) {
topN.add(item);
topN.poll();
}
}
List<Integer> result = new ArrayList<Integer>(n);
result.addAll(topN);
Collections.sort(result, Collections.reverseOrder());
The heap here (a min-heap) is at least bounded in size. There's no real need to make a heap out of all your items.

No, not really. At least not using Java's built-in methods.
There are clever ways to get the highest (or lowest) N number of items from a list quicker than an O(n*log(n)) operation, but that will require you to code this solution by hand. If the number of items stays relatively low (not more than a couple of hundred), sorting it using Collections.sort() and then grabbing the top N numbers is the way to go IMO.

Depends on how many. Lets define n as the total number of keys, and m as the number you wish to display.
Sorting the entire thing: O(nlogn)
Scanning the array each time for the next highest number: O(n*m)
So the question is - What's the relation between n to m?
If m < log n, scanning will be more efficient.
Otherwise, m >= log n, which means sorting will be better. (Since for the edge case of m = log n it doesn't actually matter, but sorting will also give you the benefit of, well, sorting the array, which is always nice.

If the size of the list is N, and the number of items to be retrieved is K, you need to call Heapify on the list, which converts the list (which has to be indexable, e.g. an array) into a priority queue. (See heapify function in http://en.wikipedia.org/wiki/Heapsort)
Retrieving an item on the top of the heap (the max item) takes O (lg N) time. So your overall time would be:
O(N + k lg N)
which is better than O (N lg N) assuming k is much smaller than N.

If keeping a sorted array or using a different data structure is not an option, you could try something like the following. The O time is similar to sorting the large array but in practice this should be more efficient.
small_array = big_array.slice( number_of_items_to_find );
small_array.sort();
least_found_value = small_array.get(0).value;
for ( item in big_array ) { // needs to skip first few items
if ( item.value > least_found_value ) {
small_array.remove(0);
small_array.insert_sorted(item);
least_found_value = small_array.get(0).value;
}
}
small_array could be an Object[] and the inner loop could be done with swapping instead of actually removing and inserting into an array.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.