Easy DAWG creation algorithm?

Easy DAWG creation algorithm? - java

I need to create a DAWG (http://en.wikipedia.org/wiki/Directed_acyclic_word_graph) structure for my Scrabble player given the word list in a file. I'm using Java. I need to do it only once and then store it in a file or files. I've seen so far 2 approaches: 1) build a Trie and reduce it to a DAWG or 2) build a DAWG right away. Since I need to do it only once I guess I just want the easiest algorithm to implement that does it. Speed and memory requirements don't matter.
Also I want to know how I should store the structure in memory at runtime and how I should save it in a file? The DAWG is basically a graph which suggests using some nodes and edges/pointers of some very simple classes written by me but I saw implementations using array and offsets (in this array) which seems complicated and illegible. This time I care both about memory size (at runtime and of the saved file) and speed of loading the DAWG/using the DAWG.

The easiest and most efficient DAWG construction algorithm is defined in this paper, and requires the set of words the DAWG is to represent to be sorted. Given that you plan on constructing a DAWG from a pre-existing word list, that list may already be sorted, or can be for this purpose.
I've cursorily transcribed the pseudocode of the algorithm in a more "programmer-friendly" format than that in which it is given in the paper (disclaimer: I may have made some transcription errors; you should probably take a look at the original to determine if there are any) :
Given:
startState is the state from which traversal of the DAWG is to start
register is a map of representations (hint: hashes) OF graphs which extend
from states in the DAWG TO said states
While there is newWord in wordList
Get newWord from wordList
Determine longestPrefix of newWord, starting from startState, which already exists in DAWG
Get longestPrefixEndState, the state which the sequence of transitions defined by longestPrefix leads to
Get suffix of newWord, the substring of newWord after longestPrefix
if longestPrefixEndState has children
replace_or_register(longestPrefixEndState)
endIf
Create the sequence of transitions and states defined by suffix, starting from longestPrefixEndState
endWhile
replace_or_register(startState)
function replace_or_register(argState)
Get greatestCharEndState of argState, the state which the lexicographically-greatest-char-labelled-transition in the outgoing transition set of argState leads to
if greatestCharEndState has children
replace_or_register(greatestCharEndState)
endIf
if there exists state in DAWG that is in the register and is equivalent (has an identical graph extending from it) to greatestCharEndState
Redefine the transition that extends from argState to greatestCharEndState, as one that extends from argState to state
Delete greatestCharEndState
endIf
else
add greatestCharEndState to the register
endElse
Given that you are using Java, you can take advantage of the Serializable interface to handle all of your serialization & deserialization needs.
If you're interested in an existing DAWG implementation in Java which implements the algorithm above, check out MDAG, which also provides several nifty features which other implementations do not (including adding and removing strings on-the-fly), and is maintained by me!

I had to implement such a structure in C for one of my client once. The final structure is loaded from an xml file describing the character set and the dawg, another process created the xml file from a word list.
step 1 : structure to build the first dawg serialized to an xml file
We used :
typedef struct _s_build_node build_node_t;
struct _s_build_node {
char_t letter;
build_node_t* first_child;
build_node_t* right_sibling;
hash_t hash;
size_t depth;
size_t ID;
};
typedef struct _s_build_dawg {
charset_t charset;
node_t* all_nodes; // an array of all the created nodes
node_t* root;
} build_dawg_t;
siblibgs are ordered ascending, end-of-word special character is less than any other character.
The algorithm is quite simple :
// create the build dawg
foreach word in wordlist
insert(dawg, word)
// compact the dawg
compact(dawg)
// generate the xml file
xml_dump(dawg)
In order to compact the dawg, we computed a hash value for each node. Two nodes with the same hash can be factorized. This part can be tricky. Only the node with the lowest depth is kept, the others are deleted and their parents now point to the one kept.
Once compacted we assign a unique ID to each node (via bfs, ID are between 0 and N-1, N is the number of nodes in the compacted dawg). The xml file simply described the trie :
<dawg>
<charset ....>
...
</charset>
<node ID="node_id" letter="letter" fist_child="first_child_ID" next_sibling="next_sibling_id" />
<node .... />
<node .... />
<node .... />
</dawg>
step 2 : The final dagw
The structure is a little bit simpler
typedef struct {
char_t letter;
size_t first_child;
size_t next_sibling;
} node_t;
typedef struct {
node_t nodes[];
... whatever you need ...
} dawg_t;
Here root is dawg.nodes[0], and first_child/next_sibling is an index in the nodes array. Creating such a struct is easy from the xml file. The main drawback is that any wordlist modification triggers the generation of a new xml file.

Related

Generating hierarchy from flat data

I need to generate a hierarchy from flat data. This question is not for homework or for an interview test, although I imagine it'd make a good example for either one. I've seen this and this and this, none of them exactly fit my situation.
My data is as follows. I have a list of objects. Each object has a breadcrumb and a text. Examples are below:
Object 1:
---------
breadcrumb: [Person, Manager, Hourly, New]
text: hello world
Object 2:
---------
breadcrumb: [Person, Manager, Salary]
text: hello world again
And I need to convert that to a hierarchy:
Person
|--Manager
|--Hourly
|--New
|--hello world
|--Salary
|--hello world again
I'm doing this in Java but any language would work.

You need a Trie datastructure, where each Node holds children in
List<Node>
Trie itself should contain one Node--root, that initially is empty;
When new sequence arrives, iterate over its items trying to find corresponding value among existing children at the current Node, moving forward if corresponding item is found. Such you find a longest prefix existing in trie, common to a given sequence;
If longest common prefix doesn't cover entire sequence, use remaining items to build a chain of nodes where each node have only one child (next item), and attach it as child to a node at which you stopped at step (2).
You see, this is not so easy. Implementation code would be long and not obvious. Unfortunately, JDK doesn't have standard trie implementations, but you can try to find some existing or write your own.
For more details, see https://en.wikipedia.org/wiki/Trie#Algorithms

Binary Search Tree of Strings

I had a question of exactly how a binary search tree of strings works. I know and have implemented binary search trees of integers by checking if the new data <= parent data then by branching left if its less or right if its greater. However I am a little confused on how to implement this with nodes of strings.
With the integers or characters I can just insert in an array into my insert method of the tree i programmed and it builds the tree nodes correctly. My question is how you would work this with an array of strings. How would you get the strings to branch off correctly in the tree? For example if I had an array of questions how would I be able to branch the BST correctly so I would eventually get to the correct answer.
For example look at the following trivial tree example.
land animal?
have tentacles?------------^-------------indoor animal
have claws?-----^----jellyfish live in jungle?----^----does it bark?
eat plankton?----^----lobster bear----^----lion cat----^----dog
shark----^----whale
How would you populate a tree such as this so that nodes populate where how you want them. I am trying to make a BST for trouble shooting and i am confused how to populate the nodes of strings so they appear in the correct positions. Do you need to hard code the nodes?

Update 2, to build a binary decision tree:
A binary decision tree can be thought of as a bunch of questions that yield boolean responses about facets of leaf nodes - the facet either exists / holds true or it does not. That is, for every descendent of a particular node/edge we must be able to say "this question/answer holds" (answers can be "true" or "false"). For instance, a bark is a facet of a (normal) dog, but tentacles are not a facet of a Whale. In the presented tree, the false edge always leads to the left subtree: this is a convention to avoid labeling each edge with true/false or Y/N.
The tree can only be built from existing/external knowledge that allows one to answer each question for every animal.
Here is a rough algorithm can be used to build such a tree:
Start with a set of possible animals, call this A, and a set of questions, call this Q.
Pick a question, q, from Q for which count(True(q, a in A)) is closest to that of count(False(q, a in A)) - if the resulting tree is a balanced binary tree these counts will always be equal for the best question to ask.
Remove q from Q and use it as the question to ask for the current node. Put all False(q,a) into the set of animals (A') available to the left child node and put all True(q,a) into the set of animals (A'') available to the right child node.
Following each edge/branch (false=left, true=right), pick a suitable question from the remaining Q and repeat (using A' or A'' for A, as appropriate).
(Of course, there are many more complete/detailed/accurate resources found online as course material or whitepapers. Not to mention a suitable selection of books at most college campuses ..)
Update, for a [binary] decision tree:
In this particular case (which is clear with the added diagram) the graph is based on the "yes" or "no" response for the question which represent the edges between the nodes. That is, the tree is not not built using an ordering of the string values themselves. In this case it might make sense to always have the left branch "false" and the right branch "true" although each node could have more edges/children if non-binary responses are allowed.
The decision tree must be "trained" (google search). That is, the graph must be built initially based on the questions/responses which is unlike a BST that is based merely on ordering between nodes. The initial graph building cannot be done from merely an array of questions as the edges do not follow an intrinsic ordering.
Initial response, for a binary search tree:
The same way it does for integers: the algorithm does not change.
Consider a function, compareTo(a,b) that will return -1, 0 or 1 for a < b, a == b, and a > b, respectively.
Then consider that the type of neither a nor b matter (as long as they are the same) when implementing a function with this contract if such a type supports ordering: it will be "raw" for integers and use the host language's corresponding string comparison for string types.

loading binary tree from string (parent-left-right)

I have a binary tree that looks like this
the object that represents it looks like this (java)
public class node {
private String value = "";
private TreeNode aChild;
private TreeNode bChild;
....
}
I want to read the data and build the tree from a string.
So I wrote some small method to serialize it and I have it like this
(parent-left-right)
0,null,O#1,left,A#2,left,C#3,left,D#4,left,E#4,right,F#1,right,B#
Then I read it and I have it as a list - objects in this order O,A,C,D,E,F,B
And now my question is - how to I build the tree?
iterating and putting it on a stack, queue ?
should I serialize on a different order ?
(basically I want to learn the best practices for building a tree from string data)
can you refer me to a link on that subject ?

Given your second string representation, there is no way to retrieve the original tree. So unless any tree with that sequence is acceptable, you'll have to include mor information in your string. One possible way would be representing null references in some fashion. Another would be using parentheses or similar.
Given your first representation, restoring the data is still possible. One algorithm expliting the level information would be the following:
Maintain a reference x to the current position in your tree
For every node n you want to add, move that reference x up in your tree as long as the level of x is no less than the level of n
Check that now the level of x is exactly one less than the level of n
Make x the parent of n, and n the next child of x
Move x to now point at n
This works if you have parent links in your nodes. If you don't, then you can maintain a list of the most recent node for every level. x would then correspond to the last element of that list, and moving x up the tree would mean removing the last element from the list. The level of x would be the length of the list.

Your serialization is not well explained, especially regarding how you represent missing nodes. There are several ways, such as representing the tree structure with ()s or by using the binary tree in an array technique. Both of these can be serialized easily. Take a look at Efficient Array Storage for Binary Tree for further explanations.

Data structure to represent nodes lengths of paths between them?

Okay I am new to Java, and I'm asking this question because I'm sure there is a better simple way to deal with this and the more experienced folk out there may be able to give me some pointers.
I have a graph of cities with lengths of paths between them. I am trying to construct an algorithm using Java to go from a start city to a destination city, finding the shortest path. Each city will have a name and map coordinates. More specifically I will be using the A* algorithm, but that is (probably) not important to my question.
My issue is I am trying to figure out a good way to represent the nodes and the paths between them with the length.
The easiest way I could think of was to create a huge 2 dimensional square array with each city represented by an index, where the connecting cities can be represented by where they intersect in the array. I assigned an index # to each city. In the array values, 0's would go where there is no connection, and the distance would go where there is a connection.
I will also have a city subclass with an "index" attribute, with the value of its index in the array. The downside to this is to figure out which cities have connections, there have to be extra steps to lookup what the city's index is in the array, and also having to lookup which connecting city has the connecting index.
Is there a better way to represent this?

An alternative way would be having a Node structure that store all the pointers to the adjacent nodes.
E.g.
if you have something like this in your data structure
A B C
A / 0 1
B 0 / 1
C 1 1 /
in the new structure it would be
A: [C]
B: [C]
C: [AB]
Compare to your 2D array approach, this way takes longer time to check if two nodes are connected, but uses smaller space

Consider...
class Node {
List<Link> link;
String cityName;
}
class Link {
Node destinationCity;
Long distance;
}

Using Binary Trees to find Anagrams

I am currently trying to create a method that uses a binary tree that finds anagrams of a word inputted by the user.
If the tree does not contain any other anagram for the word (i.e., if the key was not in the tree or the only element in the associated linked list was the word provided by the user), the message "no anagram found " gets printed
For example, if key "opst" appears in the tree with an associated linked list containing the words "spot", "pots", and "tops", and the user gave the word "spot", the program should print "pots" and "tops" (but not spot).
public boolean find(K thisKey, T thisElement){
return find(root, thisKey, thisElement);
}
public boolean find(Node current, K thisKey, T thisElement){
if (current == null)
return false;
else{
int comp = current.key.compareTo(thisKey);
if (comp>0)
return find(current.left, thisKey, thisElement);
else if(comp<0)
return find(current.right, thisKey, thisElement);
else{
return current.item.find(thisElement);
}
}
}
While I created this method to find if the element provided is in the tree (and the associated key), I was told not to reuse this code for finding anagrams.
K is a generic type that extends Comparable and represents the Key, T is a generic type that represents an item in the list.
If extra methods I've done are required, I can edit this post, but I am absolutely lost. (Just need a pointer in the right direction)

It's a little unclear what exactly is tripping you up (beyond "I've written a nice find method but am not allowed to use it."), so I think the best thing to do is start from the top.
I think you will find that once you get your data structured in just the right way, the actual algorithms will follow relatively easily (many computer science problems share this feature.)
You have three things:
1) Many linked lists, each of which contains the set of anagrams of some set of letters. I am assuming you can generate these lists as you need to.
2) A binary tree, that maps Strings (keys) to lists of anagrams generated from those strings. Again, I'm assuming that you are able to perform basic operations on these treed--adding elements, finding elements by key, etc.
3) A user-inputted String.
Insight: The anagrams of a group of letters form an equivalence class. This means that any member of an anagram list can be used as the key associated with the list. Furthermore, it means that you don't need to store in your tree multiple keys that point to the same list (provided that we are a bit clever about structuring our data; see below).
In concrete terms, there is no need to have both "spot" and "opts" as keys in the tree pointing to the same list, because once you can find the list using any anagram of "spot", you get all the anagrams of "spot".
Structuring your data cleverly: Given our insight, assume that our tree contains exactly one key for each unique set of anagrams. So "opts" maps to {"opts", "pots", "spot", etc.}. What happens if our user gives us a String that we're not using as the key for its set of anagrams? How do we figure out that if the user types "spot", we should find the list that is keyed by "opts"?
The answer is to normalize the data stored in our data structures. This is a computer-science-y way of saying that we enforce arbitrary rules about how we store the data. (Normalizing data is a useful technique that appears repeatedly in many different computer science domains.) The first rule is that we only ever have ONE key in our tree that maps to a given linked list. Second, what if we make sure that each key we actually store is predictable--that is we know that we should search for "opts" even if the user types "spot"?
There are many ways to achieve this predictability--one simple one is to make sure that the letters of every key are in alphabetical order. Then, we know that every set of anagrams will be keyed by the (unique!) member of the set that comes first in alphabetical order. Consistently enforcing this rule makes it easy to search the tree--we know that no matter what string the user gives us, the key we want is the string formed from alphabetizing the user's input.
Putting it together: I'll provide the high-level algorithm here to make this a little more concrete.
1) Get a String from the user (hold on to this String, we'll need it later)
2) Turn this string into a search key that follows our normalization scheme
(You can do this in the constructor of your "K" class, which ensures that you will never have a non-normalized key anywhere in your program.)
3) Search the tree for that key, and get the linked list associated with it. This list contains every anagram of the user's input String.
4) Print every item in the list that isn't the user's original string (see why we kept the string handy?)
Takeaways:
Frequently, your data will have some special features that allow you to be clever. In this case it is the fact that any member of an anagram list can be the sole key we store for that list.
Normalizing your data give you predictability and allows you to reason about it effectively. How much more difficult would the "find" algorithm be if each key could be an arbitrary member of its anagram list?
Corollary: Getting your data structures exactly right (What am I storing? How are the pieces connected? How is it represented?) will make it much easier to write your algorithms.

What about sorting the characters in the words, and then compare that.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.