Generate deletion, insertion, substitution, transpotions for a string

Generate deletion, insertion, substitution, transpotions for a string - java

I am implementing a spell checker algorithm. I have constructed a Trie that stores my words for quick searching.
When a given input string is passed what I want to do is generate potential deletions, insertions, substitutions and transpositions for that string with an edit distance of 1. Using this super set I can then try to find the word in my Trie and offer the user "did you mean?" type results.
I have looked online and most solutions mention calculating the Levenstein Distance. That only works if you already know the two strings and you want to find the edit distance between the two.
Suggestions?

I would use an 2 pass algo:
Pass 1
look and calculate the distance for all words starting with the same letter as the word to spell check. This will be fast. you can stop the depth search when the number of chars is greater then spell word length + 2 (then this obiously another word)
Display results of pass1, eg by marking word red underline
Pass 2
look for all words and stop when length + 3 or 4
Update the results found in pass 1

Related

Depth first search or backtrack recursion for finding all possible combination of letters in a crossword puzzle/boggle board?

What would be the time complexity? I just want to avoid this being O(n!). Would using depth first search be time complexity O(n^2), as for each letter it may have to go through all other letters worst case?
I guess I'm not sure if I'm thinking about this the right way.
When I say use depth first search, I mean starting depth first search from first the letter, and then starting from the second letter, etc.
Is that necessary?
Note:
The original problem is to find all possible words in a crossword/boggle board. I'm thinking of using the trie data structure to find if a word is in the dictionary, but am thinking about ways of generating the words themselves.

Following the discussion above, here is my answer:
Definition: a trieX is a sub trie, with words of length X only.
Since we have a trie with all words in the desired language, we can also get the appropriate trieX.
We say that the crossword puzzle has w words, so we create an array w long where each entry is the root of a trieX where X is the length of the relevantword. This gives us the list of possible words in each blank word.
Then the iterate over intersections between words and eliminate words that can not be placed. When there are no more changes done - we stop.
Two remarks:
1. In order to improve performance, we start by adding either long words, or VERY short ones. What is short or long? have a look at this and this.
2. Elimination of words from the trieX's can also be done by checking dependencies between words (if THIS words is here, then THAT words can't be there, etc.). This is more complicated, so if anyone wants to add some ideas on how to do this easily - please do.

Best approach to solve Word Chain

I am trying to solve this problem in CodeEval.
In this challenge we suggest you to play in the known game "Word
chain" in which players come up with words that begin with the letter
that the previous word ended with. The challenge is to determine the
maximum length of a chain that can be created from a list of words.
Example:
Input:
soup,sugar,peas,rice
Ouput:
4
Explanation: We can form a chain of 4 words like this: "soup->peas->sugar->rice".
Constraints:
The length of a list of words is in range [4, 35].
A word in a list of words is represented by a random lowercase ascii string with the length of [3, 7] letters.
There is no repeating words in a list of words.
My attempt: My approach is to model the words as a graph, such that each word in the inputs represents a node and there is an (directed) edge between from wordi to wordj if last character of wordi is equal to the first character of wordj.
After that I am running bfs from each node and computing the length of the farthest node from the this node. The final result is the maximum value possible for all nodes.
But this approach is not giving me a full score. Hence, my question is how to solve this problem correctly and efficiently?

For my reputation is less than 50, so I can't make a comment...
If the total number of word is less than 20, we can solve using dynamic programming and bitmask.
make dp[20][1<<20]. dp[i][j] means currently you are in i, and you have visit the bitmask j's word.
For number is bigger than 20, I still haven't a good idea. May be we need to use some random algorithm, perhaps...。
My idea is to use dfs and add some optimizaion, because 35 is not too big. I think it's enough to solve the problem.

See the solution mentioned here: Detecting when matrix multiplication is possible
The solution to your problem is pretty much same. Create a directed graph such that for every work add an edge from first letter to last letter.
Then find a Euler path ( http://en.wikipedia.org/wiki/Euler_path ) in that graph.
EDIT: I see that you are not assured of using all words and you need the longest path in the graph ( http://en.wikipedia.org/wiki/Longest_path_problem ). This problem is NP-complete.

See the solution mentioned word chain in core java
The page gives a solution in Core Java, it follows the following process:
Load the Dictionary Items in memory for a given word length
Get the next eligible list of words from the memory for the given word
There is another approach using the Map/reduce hadoop framework, which is mentioned in detail in the word chain using map-reduce

Inverted indexing

I'm working on inverted indexing and my question is: in the final step we should return the total number of documents the word appeared in or just each document number ?
for example :
if the word "Hello" appeared in 3 documents(document A and document B and document C) I should return 3 or A,B,C ?

An Index implies it will give you a lookup to something, not just a number. A frequency count would give you a count of the number of occurrences of a word.
BTW You can get the number from the A,B,C but not the other way around.

That's totally up to you !
If you just need to return the total number of documents a certain word appears in, then you won't even need an inverted index. All you would need is a mapping from words to counts. That would take much less computation and space than an inverted index.
If you're working on an exercise in Information Retrieval (or doing some proof of concept, etc), it seems to me that you would also need to return the docs where a given words was found, that's Boolean Retrieval

Returning a Subset of Strings from 10000 ascii strings

My college is getting over so I have started preparing for the interviews to get the JOB and I came across this interview question while I was preparing for the interview
You have a set of 10000 ascii strings (loaded from a file)
A string is input from stdin.
Write a pseudocode that returns (to stdout) a subset of strings in (1) that contain the same distinct characters (regardless of order) as
input in (2). Optimize for time.
Assume that this function will need to be invoked repeatedly. Initializing the string array once and storing in memory is okay .
Please avoid solutions that require looping through all 10000 strings.
Can anyone provide me a general pseudocode/algorithm kind of thing how to solve this problem? I am scratching my head thinking about the solution. I am mostly familiar with Java.

Here is an O(1) algorithm!
Initialization:
For each string, sort characters, removing duplicates - eg "trees" becomes "erst"
load sorted word into a trie tree using the sorted characters, adding a reference to the original word to the list of words stored at the each node traversed
Search:
sort input string same as initialization for source strings
follow source string trie using the characters, at the end node, return all words referenced there

They say optimise for time, so I guess we're safe to abuse space as much as we want.
In that case, you could do an initial pass on the 10000 strings and build a mapping from each of the unique characters present in the 10000 to their index (rather a set of their indices). That way you can ask the mapping the question, which sets contain character 'x'? Call this mapping M> ( order: O(nm) when n is the number of strings and m is their maximum length)
To optimise in time again, you could reduce the stdin input string to unique characters, and put them in a queue, Q. (order O(p), p is the length of the input string)
Start a new disjoint set, say S. Then let S = Q.extractNextItem.
Now you could loop over the rest of the unique characters and find which sets contain all of them.
While (Q is not empty) (loops O(p)) {
S = S intersect Q.extractNextItem (close to O(1) depending on your implementation of disjoint sets)
}
voila, return S.
Total time: O(mn + p + p*1) = O(mn + p)
(Still early in the morning here, I hope that time analysis was right)

As Bohemian says, a trie tree is definitely the way to go!
This sounds like the way an address book lookup would work on a phone. Start punching digits in, and then filter the address book based on the number representation as well as any of the three (or actually more if using international chars) letters that number would represent.

Algorithm to choose random letters for word search game that allows many words to be spelled

I'm making a boggle-like word game. The user is given a grid of letters like this:
O V Z W X
S T A C K
Y R F L Q
The user picks out a word using any adjacent chains of letters, like the word "STACK" across the middle line. The letters used are then replaced by the machine e.g. (new letters in lowercase):
O V Z W X
z e x o p
Y R F L Q
Notice you can now spell "OVeRFLoW" by using the new letters. My problem is: What algorithm can I use to pick new letters that maximizes the number of long words the user can spell? I want the game to be fun and involve spelling e.g. 6 letter words sometimes but, if you pick bad letters, games involve the user just spelling 3 letter words and not getting a chance to find larger words.
For example:
You could just randomly pick new letters from the alphabet. This does not work well.
Likewise, I found picking randomly but using the letter frequencies from Scrabble didn't work well. This works better in Scrabble I think as you are less constrained about the order you use the letters in.
I tried having a set of lists, each representing one of the dies from the Boggle game, and each letter would be picked from a random die side (I also wonder whether I can legally use this data in a product). I didn't notice this working well. I imagine the Boggle dice sides were chosen in some sensible manner, but I cannot find how this was done.
Some ideas I've considered:
Make a table of how often letter pairs occur together in the dictionary. For the sake of argument, say E is seen next to A 30% of the time. When picking a new letter, I would randomly pick a letter based on the frequency of this letter occurring next to a randomly chosen adjacent letter on the grid. For example, if the neighboring letter was E, the new letter would be "A" 30% of the time. The should mean there are lots of decent pairs to use scattered around the map. I could maybe improve this by making probability tables of a letter occurring between two other letters.
Somehow do a search for what words can be spelt on the current grid, taking the new letters to be wildcards. I would then replace the wildcards with letters that allowed the biggest words to be spelt. I'm not sure how you would do this efficiently however.
Any other ideas are appreciated. I wonder if there is a common way to solve this problem and what other word games use.
Edit: Thanks for the great answers so far! I forgot to mention, I'm really aiming for low memory/cpu requirements if possible, I'm probably going to use the SOWPODS dictionary (about 250,000) and my grid will be able 6 x 6.

Here's a simple method:
Write a fast solver for the game using the same word list that the player will use. Generate say 100 different possible boards at random (using letter frequencies is probably a good idea here, but not essential). For each board calculate all the words that can be generated and score the board based on the number of words found or the count weighted by word length (i.e. the total sum of word lengths of all words found). Then just pick the best scoring board from the 100 possibilities and give that to the player.
Also instead of always picking the highest scoring board (i.e. the easiest board) you could have different score thresholds to make the game more difficult for experts.

A minor variation on the letter-pair approach: use the frequency of letter pairs in long words - say 6 letters or longer - since that's your objective. You could also develop a weighting that included all adjacent letters, not just a random one.

This wordgame I slapped up a while back, which behaves very similarly to what you describe, uses English frequency tables to select letters, but decides first whether to generate a vowel or consonant, allowing me to ensure a given rate of vowels on the board. This seems to work reasonably well.

You should look up n-gramming, and Markovian Models.
Your first idea is very losely related to Markovian algorithms.
Basically, if you have a large text corpus, say of 1000 words. What you can do is analyse each letter and create a table to know the probability of a certain letter following the current letter.
For example, I know that the letter Q from my 1000 words ( 4000 letters in total ) is used only 40 times. Then I calculate what probable letters follow using my markov hash table.
For example,
QU happens 100% of the time so I know that should Q be randomly chosen by your application that I need to make sure that the letter U is also included.
Then, the letter "I" is used 50% of the time, and "A" 25% of the times and "O" 25% of the time.
Its actually really complicated to explain and I bet there are other explainations out there which are much better then this.
But the idea is that given a legitmately large text corpus you can create a chain of X letters which are probably consistent with English language and thus should be easy for users to make words out of.
You can choose to look forward on a value of n-gram, the highest the number the easier you could make your game. For example, an n-gram of two would probably make it very hard to create words over 6, but an n-gram of 4 would be very easy.
The Wikipedia explains it really badly, so I wouldn't follow that.
Take a look at this Markov generator:
http://www.haykranen.nl/projects/markov/demo/

I do not know about a precanned algorithm for this, but...
There is a dictionary file in UNIX, and I imagine there is something similar available on other platforms (maybe even in the java libraries? - google it). Anyways, use the files the spell checker uses.
After they spell a word an it drops out, you have existing letters and blank spaces.
1) From each existing letter, go right, left, up, down (you will need to understand recursive algorithms). As long as the string you have built so far is found at the start of words or backwards from the end of words in the dictionary file, continue. When you come across a blank space, count the frequency of the letters you need next. Use the most frequent letters.
It will not guarantee a word as you have not checked the corresponding ending or beginning, but I think it would be much easier to implement than an exhaustive search and get pretty good results.

I think this will get you a step closer to your destination: http://en.wikipedia.org/wiki/Levenshtein_distance

You might look at this Java implementation of the Jumble algorithm to find sets of letters that permute to multiple dictionary words:
$ java -jar dist/jumble.jar | sort -nr | head
11 Orang Ronga angor argon goran grano groan nagor orang organ rogan
10 Elaps Lepas Pales lapse salep saple sepal slape spale speal
9 ester estre reest reset steer stere stree terse tsere
9 caret carte cater crate creat creta react recta trace
9 Easter Eastre asteer easter reseat saeter seater staree teaser
9 Canari Carian Crania acinar arnica canari carina crania narica
8 leapt palet patel pelta petal plate pleat tepal
8 laster lastre rastle relast resalt salter slater stelar
8 Trias arist astir sitar stair stria tarsi tisar
8 Trema armet mater metra ramet tamer terma trame
...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.