Solving TSP using dynamic programming in Java - java

I've found many resources online, discussing this and related topics, but I haven't found anything that really helps me know where to start with implementing this solution.
To clarify, starting from city 0, I need to visit every other city once, and return back to city 0.
I have an array such as this one:
0 1129 1417 1240 1951
1129 0 1100 800 2237
1417 1100 0 1890 3046
1240 800 1890 0 1558
1951 2237 3046 1558 0
Along with finding the optimal route, I need to also find the optimal partial routes along the way. For example, I'd start with routes of length 2, and end up printing out something like this:
S = {0,1}
C({0,1},1) = 1129
S = {0,2}
C({0,2},2) = 1417
S = {0,3}
C({0,3},3) = 1240
S = {0,4}
C({0,4},4) = 1951
Then I'd go to routes of length 3, and print something like this:
S = {0,1,2}
C({0,1,2},1) = 2517
C({0,1,2},2) = 2229
and so on...
To make this a dynamic programming solution, I assume I should be saving the shortest distance between any n nodes, and the best way I've thought to do that is with a Hashmap, where the key would be an integer value of every node included in that path, in ascending order (A path going from nodes 0>1>3>4 or 0>1>4>3 could be stored as '134'), and each key would hold a pair that could store the path order as a List, and the total distance as an integer.
At this point I would think I'd want to calculate all paths of distance 2, then all of distance 3, and then take the smallest few and use the hashmap to find the shortest path back for each, and compare.
Does this seem like it could work? Or am I completely on the wrong track?

You're sort of on track. Dynamic programming isn't the way to calculate a TSP. What you're sort of close to is calculating a minimum spanning tree. This is a tree that connects all nodes using the shortest possible sum of edges. There are two algorithms that are frequently used: Primm's, and Kruskal's. They produce something similar to your optimal partial routes list. I'd recommend you look at Primm's algorithm: https://en.wikipedia.org/wiki/Prim%27s_algorithm
The easiest way of solving TSP is by finding the minimum spanning tree, and then doing a pre-order tree walk over the tree. This gives you an approximate travelling salesman solution, and is known as the Triangle Inequality Approximation. It's guaranteed to be no more than twice as long as an optimal TSP, but it can be calculated much faster. This web page explains it fairly well http://www.personal.kent.edu/~rmuhamma/Algorithms/MyAlgorithms/AproxAlgor/TSP/tsp.htm
If you want a more optimal solution, you'll need to look at Christofide's method, which is more complicated.

You are on the right track.
I think you're getting at the fact that the DP recursion itself only tells you the optimal cost for each S and j, not the node that attained that cost. The classical way to handle this is through "backtracking": once you found out that, for example, C({0,1,2,3,4},-) = 10000, you figure out which node attained that cost; let's say it's 3. Then you figure out which node attained C({0,1,2,3,4},3); let's say it's 1. Then you figure out which node attained C({0,1,2,4},1); let's say it's 2, ... and so on. This is the classical way and it avoids having to store a lot of intermediate data. For DPs with a small state space, it's easier just to store all those optimizers along the way. In your case you have an exponentially large state space so it might be expensive to store all of those optimizers, but you already have to store an equally large data structure (C), so most likely you can store the optimizers as well. Your intermediate solution of storing a few of the best could work -- then if the backtracking routine turns out to need one you didn't store you can just calculate it the classical way -- seems reasonable, but I'm not sure it's worth the extra coding vs. a pure backtracking approach.
One clarification: You don't actually want to "calculate all paths of distance 2, then all of distance 3, ...", but rather you want to enumerate all sets of size 2, then all sets of size 3, etc. Still exponential, but not as bad as enumerating all paths.

Related

Condition to terminate BFS

I have this assignment where given a list of tuples where each tuple contains 2 Strings like this :
[ ("...","...") , ("...","...") , ("...","...") ... ]
I have to calculate the shortest path which will lead to an extreme-string.
An extreme-string is defined as a tuple of 2 strings where the first string is equal to the second string.
I know this might sound confusing so let me set an example.
Given :
The list [("0","100") , ("01","00") , ("110","11")]
With indices 0,1,2
The shortest path is : [2,1,2,0]
The extreme-string is equal to : "110011100"
Step by step explanation :
Starting with tuple of index 2 the initial string is : "110","11"
Appending tuple of index 1 next string is : "11001","1100"
Appending tuple of index 2 next string is : "11001110","110011"
Appending tuple of index 0 final string is : "110011100","110011100"
So say you begin with tuple ("X","Y") and then you pick tuple ("A","B") then result is ("XA","YB").
The way I approached this problem was using BFS which I already implemented and sounds right to me but there is an issue I am dealing with.
If the input is something like :
[("1","111")]
then the algorithm will never terminate as it will always be in the state "111..." - "111111111..." .
Checking for this specific input is not a good idea as there many inputs that can reproduce this result.
Having an upper bound for the iterations is also not a good idea because in some cases a finite result may actually exist after the iterations bound.
Any insight would be really useful.
Since its an assignment I can't really solve it for you, but I'll try to give tips:
BFS sounds great to me as well.
One thing that differentiates the BFS from, say, DFS is that you place the elements of level N into the queue (as opposed to stack). Since queue is FIFO, you'll process the elements of Level N before elements at the level of N + 1. So this algorithm will finish (although might occupy a lot of memory).
The interesting part is what exactly you put into the queue and how you organize the traversal algorithm. This is something that I feel you've already solved or at least you have a direction. So think about my previous paragraph and hopefully you'll come to the solution ;)

Why store the points in a binary tree?

This question covers a software algorithm, from On topic
I am working on an interview question from Amazon Software Question,
specifically "Given a set of points (x,y) and an integer "n", return n number of points which are close to the origin"
Here is the sample high level psuedocode answer to this question, from Sample Answer
Step 1: Design a class called point which has three fields - int x, int y, int distance
Step 2: For all the points given, find the distance between them and origin
Step 3: Store the values in a binary tree
Step 4: Heap sort
Step 5: print the first n values from the binary tree
I agree with steps 1 and 2 because it makes sense in terms of object-oriented design to have one software bundle of data, Point, encapsulate away the fields of x, y and distance.Ensapsulation
Can someone explain the design decisions from 3 to 5?
Here's how I would do steps of 3 to 5
Step 3: Store all the points in an array
Step 4: Sort the array with respect to distance(I use some build in sort here like Arrays.Sort
Step 5: With the array sorted in ascending order, I print off the first n values
Why the author of that response use a more complicated data structure, binary tree and not something simpler like an array that I used? I know what a binary tree is - hierarchical data structure of nodes with two pointers. In his algorithm, would you have to use a BST?
First, I would not say that having Point(x, y, distance) is good design or encapsulation. distance is not really part of a point, it can be computed from x and y. In term of design, I would certainly have a function, i.e. a static method from Point or an helper class Points.
double distance(Point a, Point b)
Then for the specific question, I actually agree with your solution, to put the data in an array, sort this array and then extract the N first.
What the example may be hinted at is that the heapsort actually often uses a binary tree structure inside the array to be sorted as explained here :
The heap is often placed in an array with the layout of a complete binary tree.
Of course, if the distance to the origin is not stored in the Point, for performance reason, it had to be put with the corresponding Point object in the array, or any information that will allow to get the Point object from the sorted distance (reference, index), e.g.
List<Pair<Long, Point>> distancesToOrigin = new ArrayList<>();
to be sorted with a Comparator<Pair<Long, Point>>
It is not necessary to use BST. However, it is a good practice to use BST when needing a structure that is self-sorted. I do not see the need to both use BST and heapsort it (somehow). You could use just BST and retrieve the first n points. You could also use an array, sort it and use the first n points.
If you want to sort an array of type Point, you could implement the interface Comparable (Point would imolement that interface) and overload the default method.
You never have to choose any data structures, but by determining the needs you have, you would also easily determine the optimum structure.
The approach described in this post is more complex than needed for such a question. As you noted, simple sorting by distance will suffice. However, to help explain your confusion about what your sample answer author was trying to get at, maybe consider the k nearest neighbors problem which can be solved with a k-d tree, a structure that applies space partitioning to the k-d dataset. For 2-dimensional space, that is indeed a binary tree. This tree is inherently sorted and doesn't need any "heap sorting."
It should be noted that building the k-d tree will take O(n log n), and is only worth the cost if you need to do repeated nearest neighbor searches on the structure. If you only need to perform one search to find k nearest neighbors from the origin, it can be done with a naive O(n) search.
How to build a k-d tree, straight from Wiki:
One adds a new point to a k-d tree in the same way as one adds an element to any other search tree. First, traverse the tree, starting from the root and moving to either the left or the right child depending on whether the point to be inserted is on the "left" or "right" side of the splitting plane. Once you get to the node under which the child should be located, add the new point as either the left or right child of the leaf node, again depending on which side of the node's splitting plane contains the new node.
Adding points in this manner can cause the tree to become unbalanced, leading to decreased tree performance. The rate of tree performance degradation is dependent upon the spatial distribution of tree points being added, and the number of points added in relation to the tree size. If a tree becomes too unbalanced, it may need to be re-balanced to restore the performance of queries that rely on the tree balancing, such as nearest neighbour searching.
Once have have built the tree, you can find k nearest neighbors to some point (the origin in your case) in O(k log n) time.
Straight from Wiki:
Searching for a nearest neighbour in a k-d tree proceeds as follows:
Starting with the root node, the algorithm moves down the tree recursively, in the same way that it would if the search point were being inserted (i.e. it goes left or right depending on whether the point is lesser than or greater than the current node in the split dimension).
Once the algorithm reaches a leaf node, it saves that node point as the "current best"
The algorithm unwinds the recursion of the tree, performing the following steps at each node:
If the current node is closer than the current best, then it becomes the current best.
The algorithm checks whether there could be any points on the other side of the splitting plane that are closer to the search point than the current best. In concept, this is done by intersecting the splitting hyperplane with a hypersphere around the search point that has a radius equal to the current nearest distance. Since the hyperplanes are all axis-aligned this is implemented as a simple comparison to see whether the difference between the splitting coordinate of the search point and current node is lesser than the distance (overall coordinates) from the search point to the current best.
If the hypersphere crosses the plane, there could be nearer points on the other side of the plane, so the algorithm must move down the other branch of the tree from the current node looking for closer points, following the same recursive process as the entire search.
If the hypersphere doesn't intersect the splitting plane, then the algorithm continues walking up the tree, and the entire branch on the other side of that node is eliminated.
When the algorithm finishes this process for the root node, then the search is complete.
This is a pretty tricky algorithm that I would hate to need to describe as an interview question! Fortunately the general case here is more complex than is needed, as you pointed out in your post. But I believe this approach may be close to what your (wrong) sample answer was trying to describe.

Fast string retrieval based on metric distance in Java

Given an arbitrary string s, I would like a method to quickly retrieve all strings S ⊆ M from a large set of strings M (where |M| > 1 million), where all strings of S have minimal edit distance < t (some minimum threshold) from s.
At worst, S may be empty if no strings in M match this criteria, and at best, S = {s} (an exact match). For any case in between, I completely expect that S may be quite large.
In general, I expect to have the maximum edit distance threshold fixed (e.g., 2), and need to perform this operation very many times over arbitrary strings s, thus the need for an efficient method, as naively iterating and testing all strings would be too expensive.
While I have used edit distance as an example metric, I would like to use other metrics as well, such as the Jaccard index.
Can anyone make a suggestion about an existing Java implementation which can achieve this, or point me to the right algorithms and data structures for solving this problem?
UPDATE #1
I have since learned that Metric trees are precisely the kind of structure I am after, which exploits the distance metric to organise subsets of strings in M based on their distance from each other with the metric. Both Vantage-Point, BK and other similar metric tree data structures and algorithms seem ideal for this kind of problem. Now, to find easy-to-use implementations in Java...
UPDATE #2
Using a combination of this bk-tree and this Levenshtein distance implementation, I'm successfully able to retrieve subsets against arbitrary strings from a set (M) of one million strings with retrieval times of around 10ms.
BK trees are designed for such a case. It works with metric distance, such as Levenshtein or Jaccard index.
Although I never tried it myself, it might be worth looking at a Levenshtein Automaton. I once bookmarked this article, which looks rather elaborate and provides several code snippets:
Damn Cool Algorithms: Levenshtein Automata
As already mentioned by H W you will not be able to avoid checking each word in your dictionary. However, the automaton will speed up calculating the distance. Combine this with an efficient data structure for your dictionary (e.g. a Trie, as mentioned in the Wikipedia article), and you might be able to accelerate you current approach.

Java.util.ArrayList.sort() sorting algorithm

I was looking at the source code of the sort() method of the java.util.ArrayList on grepcode. They seem to use insertion sort on small arrays (of size < 7) and merge sort on large arrays. I was just wondering if that makes a lot of difference given that they use insertion sort only for arrays of size < 7. The difference in running time will be hardly noticeable on modern machines.
I have read this in Cormen:
Although merge sort runs in O(n*logn) worst-case time and insertion sort runs in O(n*n) worst-case time, the constant factors in insertion sort can make it faster in practice for small problem sizes on many machines. Thus, it makes sense to coarsen the leaves of the recursion by using insertion sort within merge sort when subproblems become sufficiently small.
If I would have designed sorting algorithm for some component which I require, then I would consider using insertion-sort for greater sizes (maybe upto size < 100) before the difference in running time, as compared to merge sort, becomes evident.
My question is what is the analysis behind arriving at size < 7?
The difference in running time will be hardly noticeable on modern machines.
How long it takes to sort small arrays becomes very important when you realize that the overall sorting algorithm is recursive, and the small array sort is effectively the base case of that recursion.
I don't have any inside info on how the number seven got chosen. However, I'd be very surprised if that wasn't done as the result of benchmarking the competing algorithms on small arrays, and choosing the optimal algorithm and threshold based on that.
P.S. It is worth pointing out that Java7 uses Timsort by default.
I am posting this for people who visit this thread in future and documenting my own research. I stumbled across this excellent link in my quest to find the answer to the mystery of choosing 7:
Tim Peters’s description of the algorithm
You should read the section titled "Computing minrun".
To give a gist, minrun is the cutoff size of the array below which the algorithm should start using insertion sort. Hence, we will always have sorted arrays of size "minrun" on which we will need to run merge operation to sort the entire array.
In java.util.ArrayList.sort(), "minrun" is chosen to be 7, but as far as my understanding of the above document goes, it busts that myth and shows that it should be near powers of 2 and less than 256 and more than 8. Quoting from the document:
At 256 the data-movement cost in binary insertion sort clearly hurt, and at 8 the increase in the number of function calls clearly hurt. Picking some power of 2 is important here, so that the merges end up perfectly balanced (see next section).
The point which I am making is that "minrun" can be any power of 2 (or near power of 2) less than 64, without hindering the performance of TimSort.
http://en.wikipedia.org/wiki/Timsort
"Timsort is a hybrid sorting algorithm, derived from merge sort and insertion sort, designed to perform well on many kinds of real-world data... The algorithm finds subsets of the data that are already ordered, and uses the subsets to sort the data more efficiently. This is done by merging an identified subset, called a run, with existing runs until certain criteria are fulfilled."
About number 7:
"... Also, it is seen that galloping is beneficial only when the initial element is not one of the first seven elements of the other run. This also results in MIN_GALLOP being set to 7. To avoid the drawbacks of galloping mode, the merging functions adjust the value of min-gallop. If the element is from the array currently under consideration (that is, the array which has been returning the elements consecutively for a while), the value of min-gallop is reduced by one. Otherwise, the value is incremented by one, thus discouraging entry back to galloping mode. When this is done, in the case of random data, the value of min-gallop becomes so large, that the entry back to galloping mode never takes place.
In the case where merge-hi is used (that is, merging is done right-to-left), galloping needs to start from the right end of the data, that is the last element. Galloping from the beginning also gives the required results, but makes more comparisons than required. Thus, the algorithm for galloping includes the use of a variable which gives the index at which galloping should begin. Thus the algorithm can enter galloping mode at any index and continue thereon as mentioned above, as in, it will check at the next index which is offset by 1, 3, 7,...., (2k - 1).. and so on from the current index. In the case of merge-hi, the offsets to the index will be -1, -3, -7,...."

What's the simplest algorithm/solution for a single pair shortest path through a real-weighted undirected graph?

I need to find a shortest path through an undirected graph whose nodes are real (positive and negative) weighted. These weights are like resources which you can gain or loose by entering the node.
The total cost (resource sum) of the path isn't very important, but it must be more than 0, and length has to be the shortest possible.
For example consider a graph like so:
A-start node; D-end node
A(+10)--B( 0 )--C(-5 )
\ | /
\ | /
D(-5 )--E(-5 )--F(+10)
The shortest path would be A-E-F-E-D
Dijkstra's algorithm alone doesn't do the trick, because it can't handle negative values. So, I thought about a few solutions:
First one uses Dijkstra's algorithm to calculate the length of a shortest path from each node to the exit node, not considering the weights. This can be used like some sort of heuristics value like in A*. I'm not sure if this solution could work, and also it's very costly. I also thought about implement Floyd–Warshall's algorithm, but I'm not sure how.
Another solution was to calculate the shortest path with Dijkstra's algorithm not considering the weights, and if after calculating the path's resource sum it's less than zero, go through each node to find a neighbouring node which could quickly increase the resource sum, and add it to the path(several times if needed). This solution won't work if there is a node that could be enough to increase the resource sum, but farther away from the calculated shortest path.
For example:
A- start node; E- end node
A(+10)--B(-5 )--C(+40)
\
D(-5 )--E(-5 )
Could You help me solve this problem?
EDIT: If when calculating the shortest path, you reach a point where the sum of the resources is equal to zero, that path is not valid, since you can't go on if there's no more petrol.
Edit: I didn't read the question well enough; the problem is more advanced than a regular single-source shortest path problem. I'm leaving this post up for now just to give you another algorithm that you might find useful.
The Bellman-Ford algorithm solves the single-source shortest-path problem, even in the presence of edges with negative weight. However, it does not handle negative cycles (a circular path in the graph whose weight sum is negative). If your graph contains negative cycles, you are probably in trouble, because I believe that that makes the problem NP-complete (because it corresponds to the longest simple path problem).
This doesn't seem like an elegant solution, but given the ability to create cyclic paths I don't see a way around it. But I would just solve it iteratively. Using the second example - Start with a point at A, give it A's value. Move one 'turn' - now I have two points, one at B with a value of 5, and one at D also with a value of 5. Move again - now I have 4 points to track. C: 45, A: 15, A: 15, and E: 0. It might be that the one at E can oscillate and become valid so we can't toss it out yet. Move and accumulate, etc. The first time you reach the end node with a positive value you are done (though there may be additional equivalent paths that come in on the same turn)
Obviously problematic in that the number of points to track will rise pretty quickly, and I assume your actual graph is much more complex than the example.
I would do it similarly to what Mikeb suggested: do a breadth-first search over the graph of possible states, i.e. (position, fuel-left)-pairs.
Using your example graph:
Octagons: Ran out of fuel
Boxes: Child nodes omitted for space reasons
Searching this graph breadth-first is guaranteed to give you the shortest route that actually reaches the goal if such a route exists. If it does not, you will have to give up after a while (after x nodes searched, or maybe when you reach a node with a score greater than the absolute value of all negative scores combined), as the graph can contain infinite loops.
You have to make sure not to abort immediately on finding the goal if you want to find the cheapest path (fuel wise) too, because you might find more than one path of the same length, but with different costs.
Try adding the absolute value of the minimun weight (in this case 5) to all weights. That will avoid negative ciclic paths
Current shortest path algorithms requires calculate shortest path to every node because it goes combinating solutions on some nodes that will help adjusting shortest path in other nodes. No way to make it only for one node.
Good luck

Categories

Resources