Binary Search Tree of Strings

Binary Search Tree of Strings - java

I had a question of exactly how a binary search tree of strings works. I know and have implemented binary search trees of integers by checking if the new data <= parent data then by branching left if its less or right if its greater. However I am a little confused on how to implement this with nodes of strings.
With the integers or characters I can just insert in an array into my insert method of the tree i programmed and it builds the tree nodes correctly. My question is how you would work this with an array of strings. How would you get the strings to branch off correctly in the tree? For example if I had an array of questions how would I be able to branch the BST correctly so I would eventually get to the correct answer.
For example look at the following trivial tree example.
land animal?
have tentacles?------------^-------------indoor animal
have claws?-----^----jellyfish live in jungle?----^----does it bark?
eat plankton?----^----lobster bear----^----lion cat----^----dog
shark----^----whale
How would you populate a tree such as this so that nodes populate where how you want them. I am trying to make a BST for trouble shooting and i am confused how to populate the nodes of strings so they appear in the correct positions. Do you need to hard code the nodes?

Update 2, to build a binary decision tree:
A binary decision tree can be thought of as a bunch of questions that yield boolean responses about facets of leaf nodes - the facet either exists / holds true or it does not. That is, for every descendent of a particular node/edge we must be able to say "this question/answer holds" (answers can be "true" or "false"). For instance, a bark is a facet of a (normal) dog, but tentacles are not a facet of a Whale. In the presented tree, the false edge always leads to the left subtree: this is a convention to avoid labeling each edge with true/false or Y/N.
The tree can only be built from existing/external knowledge that allows one to answer each question for every animal.
Here is a rough algorithm can be used to build such a tree:
Start with a set of possible animals, call this A, and a set of questions, call this Q.
Pick a question, q, from Q for which count(True(q, a in A)) is closest to that of count(False(q, a in A)) - if the resulting tree is a balanced binary tree these counts will always be equal for the best question to ask.
Remove q from Q and use it as the question to ask for the current node. Put all False(q,a) into the set of animals (A') available to the left child node and put all True(q,a) into the set of animals (A'') available to the right child node.
Following each edge/branch (false=left, true=right), pick a suitable question from the remaining Q and repeat (using A' or A'' for A, as appropriate).
(Of course, there are many more complete/detailed/accurate resources found online as course material or whitepapers. Not to mention a suitable selection of books at most college campuses ..)
Update, for a [binary] decision tree:
In this particular case (which is clear with the added diagram) the graph is based on the "yes" or "no" response for the question which represent the edges between the nodes. That is, the tree is not not built using an ordering of the string values themselves. In this case it might make sense to always have the left branch "false" and the right branch "true" although each node could have more edges/children if non-binary responses are allowed.
The decision tree must be "trained" (google search). That is, the graph must be built initially based on the questions/responses which is unlike a BST that is based merely on ordering between nodes. The initial graph building cannot be done from merely an array of questions as the edges do not follow an intrinsic ordering.
Initial response, for a binary search tree:
The same way it does for integers: the algorithm does not change.
Consider a function, compareTo(a,b) that will return -1, 0 or 1 for a < b, a == b, and a > b, respectively.
Then consider that the type of neither a nor b matter (as long as they are the same) when implementing a function with this contract if such a type supports ordering: it will be "raw" for integers and use the host language's corresponding string comparison for string types.

Related

Should I test an exact result of algorithm or just test some elements of the result?

I've written a BFS algorithm and I'd like to test the algorithm.
I've written tests in 2 approaches because I realized that for example the way of storing adjacent vertices may change and the order will be different so the result will be different but not necessarily incorrect.
Test of full path:
#Test
void traverse_UndirectedGraph_CommonTraverse() {
BreadthFirstSearchTest<String> breadthFirstSearchTest= new BreadthFirstSearchTest<>(undirectedGraph);
assertIterableEquals(Lists.newArrayList("A", "B", "E"), breathFirstSearch.traverse("A", "E"));
}
Test if the path contains an initial vertex and a terminal vertex:
#Test
void traverse_UndirectedGraph_CommonTraverse() {
BreadthFirstSearchTest<String> breadthFirstSearchTest= new BreadthFirstSearchTest<>(undirectedGraph);
List<String> path = breathFirstSearch.traverse("A", "E");
assertEquals("A", path.get(0));
assertEquals("E", path.get(path.size() - 1));
}
Is any of these two approaches correct?
If no how would you test that algorithms?

Is any of these two approaches correct?
Probably. But that is hard to say without exactly understanding your full requirements and your context, like the classes/data structures your search is relying on.
If no how would you test that algorithms?
I would follow TDD.
Meaning: you start with writing tests first.
To be precise:
you write one simple test
you ensure the test fails
you then write just enough "production" code so that your test passes
you maybe refactor your code base (to improve its quality)
go back to step 1
In other words: you develop your algorithm while gradually walking from small, simple tests, to more advanced scenarios.
Beyond that, you can also look at this from a true "tester" perspective. Meaning: you totally ignore the implementation. Instead, you look at the problem, and at the contract that the production code should follow. And then you try to find examples for all important cases, and most importantly: edge cases. You write those down, and then run them against your implementation.
(most like: your two test cases are way too simple, and you would need many more)

For cases that have at most one traversal, the simplest verification is by exact match.
For cases that have multiple valid traversals, verification could be either by matching against the enumerated traversals, or by verifying the breadth-first property of the traversal.
For cases which have many valid traversals, verifying the breadth-first property seems to be necessary.
Working from the problem as stated:
Key features are that the graph is un-directed, and the search is breadth first.
No other characteristics of the graph are specified. The graph is assumed to possibly have cycles, and is assumed to not necessarily be connected. For simplicity, at most one edge is present between nodes, and no edge is present from a node to itself.
As basics, a traversal which is obtained by a breadth first search must be a subgraph of the searched graph. That is, each edge of the traversal must be an edge of the searched graph. Also, the initial node of the traversal must be the beginning node of the search and the final node of the traversal must be the target node.
In each case, the search must not get into an infinite loop, and must obtain a breadth first result. Or, must indicate that a traversal is not possible.
Testing should demonstrate a variety of cases, for example, traversal of a list, a tree, a loop, a bipartite graph, or of a complete graph.
One test methodology builds a collection of test graphs (enumerating at least the variety of cases described above), and builds a collection of test cases for each of the graphs. The test cases would supply the initial and final nodes of the case, and would supply the collection of valid traversals.
Supplying the collection of valid results is easy if there is zero, one, or perhaps a handful of valid paths. For particular graphs, there can be many traversals, and as an alternative there might need to be a way to verify the "breadth-first-ness" of a traversal, as opposed to enumerating the possible traversals.
For example:
A <-> B1, B2 <-> C1, C2 <-> D1, D2 <-> E
Here A <-> B1, B2 means that there is an edge between both A and B1 and between A a B2. Similarly, B1, B2 <-> C1, C2 represents the complete bipartite graph of B1 and B2 with C1 and C2.
There are eight valid breadth-first traversals from A to E.
There are traversals which are not valid breadth first traversals, for example:
( A, B1, C1, B2, C2, D1, E )
Also for example, for the simple graph:
A <-> B, C
B <-> C
A breadth first traversal from A to C must yield ( A, C ) and not ( A, B, C ). A depth first traversal may obtain either ( A, C ) or ( A, B, C ) depending on whether the traversal steps from A to B first, or steps from A to C first.
One characterization is, if minimum distances are assigned to nodes, starting with the initial node of the traversal, then a breadth-first traversal must never step from a node to node that is closer to the initial node.
Labeling the first example with distances gives:
A(0) <-> B1(1), B2(1) <-> C1(2), C2(2) <-> D1(3), D2(3) <-> E(4)
Similarly labeling the second candidate traversal gives:
( A(0), B1(1), C1(2), B2(1), C2(2), D1(3), E(4) )
This is not a valid breadth-first traversal because the edge C1 -> B2 decreases the distance from the initial node.

Generating hierarchy from flat data

I need to generate a hierarchy from flat data. This question is not for homework or for an interview test, although I imagine it'd make a good example for either one. I've seen this and this and this, none of them exactly fit my situation.
My data is as follows. I have a list of objects. Each object has a breadcrumb and a text. Examples are below:
Object 1:
---------
breadcrumb: [Person, Manager, Hourly, New]
text: hello world
Object 2:
---------
breadcrumb: [Person, Manager, Salary]
text: hello world again
And I need to convert that to a hierarchy:
Person
|--Manager
|--Hourly
|--New
|--hello world
|--Salary
|--hello world again
I'm doing this in Java but any language would work.

You need a Trie datastructure, where each Node holds children in
List<Node>
Trie itself should contain one Node--root, that initially is empty;
When new sequence arrives, iterate over its items trying to find corresponding value among existing children at the current Node, moving forward if corresponding item is found. Such you find a longest prefix existing in trie, common to a given sequence;
If longest common prefix doesn't cover entire sequence, use remaining items to build a chain of nodes where each node have only one child (next item), and attach it as child to a node at which you stopped at step (2).
You see, this is not so easy. Implementation code would be long and not obvious. Unfortunately, JDK doesn't have standard trie implementations, but you can try to find some existing or write your own.
For more details, see https://en.wikipedia.org/wiki/Trie#Algorithms

Why store the points in a binary tree?

This question covers a software algorithm, from On topic
I am working on an interview question from Amazon Software Question,
specifically "Given a set of points (x,y) and an integer "n", return n number of points which are close to the origin"
Here is the sample high level psuedocode answer to this question, from Sample Answer
Step 1: Design a class called point which has three fields - int x, int y, int distance
Step 2: For all the points given, find the distance between them and origin
Step 3: Store the values in a binary tree
Step 4: Heap sort
Step 5: print the first n values from the binary tree
I agree with steps 1 and 2 because it makes sense in terms of object-oriented design to have one software bundle of data, Point, encapsulate away the fields of x, y and distance.Ensapsulation
Can someone explain the design decisions from 3 to 5?
Here's how I would do steps of 3 to 5
Step 3: Store all the points in an array
Step 4: Sort the array with respect to distance(I use some build in sort here like Arrays.Sort
Step 5: With the array sorted in ascending order, I print off the first n values
Why the author of that response use a more complicated data structure, binary tree and not something simpler like an array that I used? I know what a binary tree is - hierarchical data structure of nodes with two pointers. In his algorithm, would you have to use a BST?

First, I would not say that having Point(x, y, distance) is good design or encapsulation. distance is not really part of a point, it can be computed from x and y. In term of design, I would certainly have a function, i.e. a static method from Point or an helper class Points.
double distance(Point a, Point b)
Then for the specific question, I actually agree with your solution, to put the data in an array, sort this array and then extract the N first.
What the example may be hinted at is that the heapsort actually often uses a binary tree structure inside the array to be sorted as explained here :
The heap is often placed in an array with the layout of a complete binary tree.
Of course, if the distance to the origin is not stored in the Point, for performance reason, it had to be put with the corresponding Point object in the array, or any information that will allow to get the Point object from the sorted distance (reference, index), e.g.
List<Pair<Long, Point>> distancesToOrigin = new ArrayList<>();
to be sorted with a Comparator<Pair<Long, Point>>

It is not necessary to use BST. However, it is a good practice to use BST when needing a structure that is self-sorted. I do not see the need to both use BST and heapsort it (somehow). You could use just BST and retrieve the first n points. You could also use an array, sort it and use the first n points.
If you want to sort an array of type Point, you could implement the interface Comparable (Point would imolement that interface) and overload the default method.
You never have to choose any data structures, but by determining the needs you have, you would also easily determine the optimum structure.

The approach described in this post is more complex than needed for such a question. As you noted, simple sorting by distance will suffice. However, to help explain your confusion about what your sample answer author was trying to get at, maybe consider the k nearest neighbors problem which can be solved with a k-d tree, a structure that applies space partitioning to the k-d dataset. For 2-dimensional space, that is indeed a binary tree. This tree is inherently sorted and doesn't need any "heap sorting."
It should be noted that building the k-d tree will take O(n log n), and is only worth the cost if you need to do repeated nearest neighbor searches on the structure. If you only need to perform one search to find k nearest neighbors from the origin, it can be done with a naive O(n) search.
How to build a k-d tree, straight from Wiki:
One adds a new point to a k-d tree in the same way as one adds an element to any other search tree. First, traverse the tree, starting from the root and moving to either the left or the right child depending on whether the point to be inserted is on the "left" or "right" side of the splitting plane. Once you get to the node under which the child should be located, add the new point as either the left or right child of the leaf node, again depending on which side of the node's splitting plane contains the new node.
Adding points in this manner can cause the tree to become unbalanced, leading to decreased tree performance. The rate of tree performance degradation is dependent upon the spatial distribution of tree points being added, and the number of points added in relation to the tree size. If a tree becomes too unbalanced, it may need to be re-balanced to restore the performance of queries that rely on the tree balancing, such as nearest neighbour searching.
Once have have built the tree, you can find k nearest neighbors to some point (the origin in your case) in O(k log n) time.
Straight from Wiki:
Searching for a nearest neighbour in a k-d tree proceeds as follows:
Starting with the root node, the algorithm moves down the tree recursively, in the same way that it would if the search point were being inserted (i.e. it goes left or right depending on whether the point is lesser than or greater than the current node in the split dimension).
Once the algorithm reaches a leaf node, it saves that node point as the "current best"
The algorithm unwinds the recursion of the tree, performing the following steps at each node:
If the current node is closer than the current best, then it becomes the current best.
The algorithm checks whether there could be any points on the other side of the splitting plane that are closer to the search point than the current best. In concept, this is done by intersecting the splitting hyperplane with a hypersphere around the search point that has a radius equal to the current nearest distance. Since the hyperplanes are all axis-aligned this is implemented as a simple comparison to see whether the difference between the splitting coordinate of the search point and current node is lesser than the distance (overall coordinates) from the search point to the current best.
If the hypersphere crosses the plane, there could be nearer points on the other side of the plane, so the algorithm must move down the other branch of the tree from the current node looking for closer points, following the same recursive process as the entire search.
If the hypersphere doesn't intersect the splitting plane, then the algorithm continues walking up the tree, and the entire branch on the other side of that node is eliminated.
When the algorithm finishes this process for the root node, then the search is complete.
This is a pretty tricky algorithm that I would hate to need to describe as an interview question! Fortunately the general case here is more complex than is needed, as you pointed out in your post. But I believe this approach may be close to what your (wrong) sample answer was trying to describe.

loading binary tree from string (parent-left-right)

I have a binary tree that looks like this
the object that represents it looks like this (java)
public class node {
private String value = "";
private TreeNode aChild;
private TreeNode bChild;
....
}
I want to read the data and build the tree from a string.
So I wrote some small method to serialize it and I have it like this
(parent-left-right)
0,null,O#1,left,A#2,left,C#3,left,D#4,left,E#4,right,F#1,right,B#
Then I read it and I have it as a list - objects in this order O,A,C,D,E,F,B
And now my question is - how to I build the tree?
iterating and putting it on a stack, queue ?
should I serialize on a different order ?
(basically I want to learn the best practices for building a tree from string data)
can you refer me to a link on that subject ?

Given your second string representation, there is no way to retrieve the original tree. So unless any tree with that sequence is acceptable, you'll have to include mor information in your string. One possible way would be representing null references in some fashion. Another would be using parentheses or similar.
Given your first representation, restoring the data is still possible. One algorithm expliting the level information would be the following:
Maintain a reference x to the current position in your tree
For every node n you want to add, move that reference x up in your tree as long as the level of x is no less than the level of n
Check that now the level of x is exactly one less than the level of n
Make x the parent of n, and n the next child of x
Move x to now point at n
This works if you have parent links in your nodes. If you don't, then you can maintain a list of the most recent node for every level. x would then correspond to the last element of that list, and moving x up the tree would mean removing the last element from the list. The level of x would be the length of the list.

Your serialization is not well explained, especially regarding how you represent missing nodes. There are several ways, such as representing the tree structure with ()s or by using the binary tree in an array technique. Both of these can be serialized easily. Take a look at Efficient Array Storage for Binary Tree for further explanations.

Using Binary Trees to find Anagrams

I am currently trying to create a method that uses a binary tree that finds anagrams of a word inputted by the user.
If the tree does not contain any other anagram for the word (i.e., if the key was not in the tree or the only element in the associated linked list was the word provided by the user), the message "no anagram found " gets printed
For example, if key "opst" appears in the tree with an associated linked list containing the words "spot", "pots", and "tops", and the user gave the word "spot", the program should print "pots" and "tops" (but not spot).
public boolean find(K thisKey, T thisElement){
return find(root, thisKey, thisElement);
}
public boolean find(Node current, K thisKey, T thisElement){
if (current == null)
return false;
else{
int comp = current.key.compareTo(thisKey);
if (comp>0)
return find(current.left, thisKey, thisElement);
else if(comp<0)
return find(current.right, thisKey, thisElement);
else{
return current.item.find(thisElement);
}
}
}
While I created this method to find if the element provided is in the tree (and the associated key), I was told not to reuse this code for finding anagrams.
K is a generic type that extends Comparable and represents the Key, T is a generic type that represents an item in the list.
If extra methods I've done are required, I can edit this post, but I am absolutely lost. (Just need a pointer in the right direction)

It's a little unclear what exactly is tripping you up (beyond "I've written a nice find method but am not allowed to use it."), so I think the best thing to do is start from the top.
I think you will find that once you get your data structured in just the right way, the actual algorithms will follow relatively easily (many computer science problems share this feature.)
You have three things:
1) Many linked lists, each of which contains the set of anagrams of some set of letters. I am assuming you can generate these lists as you need to.
2) A binary tree, that maps Strings (keys) to lists of anagrams generated from those strings. Again, I'm assuming that you are able to perform basic operations on these treed--adding elements, finding elements by key, etc.
3) A user-inputted String.
Insight: The anagrams of a group of letters form an equivalence class. This means that any member of an anagram list can be used as the key associated with the list. Furthermore, it means that you don't need to store in your tree multiple keys that point to the same list (provided that we are a bit clever about structuring our data; see below).
In concrete terms, there is no need to have both "spot" and "opts" as keys in the tree pointing to the same list, because once you can find the list using any anagram of "spot", you get all the anagrams of "spot".
Structuring your data cleverly: Given our insight, assume that our tree contains exactly one key for each unique set of anagrams. So "opts" maps to {"opts", "pots", "spot", etc.}. What happens if our user gives us a String that we're not using as the key for its set of anagrams? How do we figure out that if the user types "spot", we should find the list that is keyed by "opts"?
The answer is to normalize the data stored in our data structures. This is a computer-science-y way of saying that we enforce arbitrary rules about how we store the data. (Normalizing data is a useful technique that appears repeatedly in many different computer science domains.) The first rule is that we only ever have ONE key in our tree that maps to a given linked list. Second, what if we make sure that each key we actually store is predictable--that is we know that we should search for "opts" even if the user types "spot"?
There are many ways to achieve this predictability--one simple one is to make sure that the letters of every key are in alphabetical order. Then, we know that every set of anagrams will be keyed by the (unique!) member of the set that comes first in alphabetical order. Consistently enforcing this rule makes it easy to search the tree--we know that no matter what string the user gives us, the key we want is the string formed from alphabetizing the user's input.
Putting it together: I'll provide the high-level algorithm here to make this a little more concrete.
1) Get a String from the user (hold on to this String, we'll need it later)
2) Turn this string into a search key that follows our normalization scheme
(You can do this in the constructor of your "K" class, which ensures that you will never have a non-normalized key anywhere in your program.)
3) Search the tree for that key, and get the linked list associated with it. This list contains every anagram of the user's input String.
4) Print every item in the list that isn't the user's original string (see why we kept the string handy?)
Takeaways:
Frequently, your data will have some special features that allow you to be clever. In this case it is the fact that any member of an anagram list can be the sole key we store for that list.
Normalizing your data give you predictability and allows you to reason about it effectively. How much more difficult would the "find" algorithm be if each key could be an arbitrary member of its anagram list?
Corollary: Getting your data structures exactly right (What am I storing? How are the pieces connected? How is it represented?) will make it much easier to write your algorithms.

What about sorting the characters in the words, and then compare that.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.