I am working on this spellchecker and one of the methods I use to suggest corrections to the user is inserting multiple characters into the word. This allows for words like exmpl to be corrected to example
Here's the actual code:
public static Set<String> multiInsert(String word, int startIndex) {
Set<String> result = new HashSet<>();
//List of characters to insert
String correction = "azertyuiopqsdfghjklmwxcvbnùûüéèçàêë";
for (int i = startIndex; i <= word.length(); i++) {
for (int j = i + 1; j <= word.length(); j++) {
for (int k = 0; k < correction.length(); k++) {
String newWord = word.substring(0, j) + correction.charAt(k) + word.substring(j);
result.addAll(multiInsert(newWord, startIndex + 2));
if(dico.contains(newWord)) result.add(newWord);
}
}
}
return result;
}
The problem with this function is that it takes A LOT of time to process especially when the word is long or when I have too many words to correct. Is there any better way to implement this function or optimize it?
What's making it slow is you're testing for strings that are not in the dictionary.
There are lots more possible misspellings than there are words in the dictionary.
You need to be guided by the dictionary.
This is the general spelling correction problem.
I've programmed it several times.
In a nutshell, the method is to store the dictionary as a trie, and do bounded depth-first walk of the trie.
At each step, you keep track of the distance between the word in the trie and the original word.
Whenever that distance exceeds the bound, you prune the search.
So you do it in cycles, increasing the bound each time.
First you do it with a bound of 0, so it will only find an exact match.
That is equivalent to ordinary trie search.
If that did not yield a match, do the walk again with a bound of 1.
That will find all dictionary words that are distance 1 from the original word.
If that didn't yield any, increase the bound to 2, and so on.
(What constitutes an increment of distance is any transformation you choose, like insertion, deletion, replacement, or more general re-writings.)
The performance is bounded by the true distance times the dictionary size.
Short of that, it is exponential in the true distance. Since each walk costs a factor times the previous walk, the time is dominated by the final walk, so the prior walks do not add much time.
There is an advantage to organizing the dictionary as a trie, since a trie is just a particular form of Finite State Machine.
You can add sub-machines to it to handle common prefixes and suffixes, without massively expanding the dictionary.
Consider these words: nation, national, nationalism, nationalistic, nationalistical, nationalisticalism, ...
Such words may not be common, but they are not impossible.
A suffix trie handles them easily.
Similarly prefixes like pre-, post-, un-, de-, in-, etc.
You can have a look at Jazzy which is a Java spellchecker API.
You might also want to consider Fuzzy String Matching.
Related
I am hoping to build a tree in which a node is an English word and a branch of leaves form a sentence. Namely,
a sentence tree (plz ignore the numbers):
I was thinking to use a Trie but I am having trouble with inserting the nodes. I am not sure how to determine the level of the nodes. In a Trie, all the nodes are characters so it's possible to use . But having words is different.
Does it make sense? I am open to other data structures as well. The goal is to create a dictionary/corpus which stores a bunch of English sentences. Users can use the first a couple of words to look up the whole sentence. I am most proficient in Java but I also know python and R so if they are easier to use for my purposes.
Thank you!
void insert(String key) {
int level;
int length = key.length();
int index;
TrieNode pCrawl = root;
for (level = 0; level < length; level++)
{
index = key.charAt(level) - 'a';
if (pCrawl.children[index] == null)
pCrawl.children[index] = new TrieNode();
pCrawl = pCrawl.children[index];
}
// mark last node as leaf
pCrawl.isEndOfWord = true;
}
A little late, but maybe I can help a bit even now.
A trie sorts each level by unique key. Traditionally this is a character from a string, and the value stored at the final location is the string itself.
Tries can be much more than this. If I understand you correctly then you wish to sort sentences by their constituent words.
At each level of your trie you look at the next word and seek its position in the list of children, rather than looking at the next character. Unfortunately all the traditional implementations show sorting by character.
I have a solution for you, or rather two. The first is to use my java source code trie. This sorts any object (in your case the string containing your sentence) by an Enumeration of integers. You would need to map your words to integers (store words in trie give each a unique number), and then write an enumerator that returned the wordIntegers for a sentence. That would work. (Do not use hash for the word -> integer conversion as two words can give the same hash).
The second solution is to take my code and instead of comparing integers compare the words as strings. This would take more work, but looks entirely feasible. In fact, I have had a suspicion that my solution can be made more generic by replacing Enumeration of Integer with an Enumeration of Comparable. If you wish to do this, or collaborate in doing this I would be interested. Heck, I may even do it myself for the fun of it.
The resultant trie would have generic type
Trie<K extends Comparable, T>
and would store instances of T against a sequence of K. The coder would need to define a method
Iterator<K extends Comparable> getIterator(T t)
============================
EDIT: ========================
It was actually remarkably easy to generalise my code to use Comparable instead of Integer. Although there are plenty of warnings that I am using raw type of Comparable rather than Comparable. Maybe I will sort those out another day.
SentenceSorter sorter = new SentenceSorter();
sorter.add("This is a sentence.");
sorter.add("This is another sentence.");
sorter.add("A sentence that should come first.");
sorter.add("Ze last sentence");
sorter.add("This is a sentence that comes somewhere in the middle.");
sorter.add("This is another sentence entirely.");
Then listing sentences by:
Iterator<String> it = sorter.iterator();
while (it.hasNext()) {
System.out.println(it.next());
}
gives
A sentence that should come first.
This is a sentence that comes somewhere in the middle.
This is a sentence.
This is another sentence entirely.
This is another sentence.
Note that the sentence split is including the full stop with the ord and that is affecting the sort. You could improve upon this.
We can show that we are sorting by words rather than characters:
it = sorter.sentencesWithPrefix("This is a").iterator();
while (it.hasNext()) {
System.out.println(it.next());
}
gives
This is a sentence that comes somewhere in the middle.
This is a sentence.
whereas
it = sorter.sentencesWithPrefix("This is another").iterator();
while (it.hasNext()) {
System.out.println(it.next());
}
gives
This is another sentence entirely.
This is another sentence.
Hope that helps - the code is all up on the above mentioned repo, and freely available under Apache2.
I solved this problem from codefights:
Note: Write a solution with O(n) time complexity and O(1) additional space complexity, since this is what you would be asked to do during a real interview.
Given an array a that contains only numbers in the range from 1 to a.length, find the first duplicate number for which the second occurrence has the minimal index. In other words, if there are more than 1 duplicated numbers, return the number for which the second occurrence has a smaller index than the second occurrence of the other number does. If there are no such elements, return -1.
int firstDuplicate(int[] a) {
HashSet z = new HashSet();
for (int i: a) {
if (z.contains(i)){
return i;
}
z.add(i);
}
return -1;
}
My solution passed all of the tests. However I don't understand how my solution met the O(1) additional space complexity requirement. The size of the hashtable is directly proportional to the input so I would think it is O(n) space complexity. Did codefights incorrectly test my algorithm or am I misunderstanding something?
Your code doesn’t have O(1) auxiliary space complexity, since that hash set can grow up to size n if given an array of all different elements.
My guess is that the online testing infrastructure didn’t check memory usage or otherwise checked memory usage incorrectly. If you want to meet the space constraints, you’ll need to go back and try solving the problem a different way.
As a hint, think about reordering the array elements.
In case you are able to modify incomming array, you could fix your problem with O(n) time complexity, and do not use external memory.
public static int getFirstDuplicate(int... arr) {
for (int i = 0; i < arr.length; i++) {
int val = Math.abs(arr[i]);
if (arr[val - 1] < 0)
return val;
arr[val - 1] = -arr[val - 1];
}
return -1;
}
This is technically incorrect, for two reasons.
Firstly, depending on the values in the array, there may be overhead when the ints become Integers and added to the HashSet.
Secondly, while the additional memory is largely the overhead associated with a HashSet, that overhead is linearly proportional to the size of the set. (Note that I am not counting the elements in this, as they are already present in the array.)
Usually, these memory constraints are tested by setting a limit to the amount of memory it can use. A solution like this I would expect to fall below the said threshold.
The problem is to find frequencies of each element of an array of reals.
double[] a = new double[n]
int[] freq = new int[n]
I have come up with two solution:
First solution O(n^2):
for (int i = 0; i < a.length; i++) {
if (freq[i] != -1) {
for (int j = i + 1; j < a.length; j++) {
if (a[i] == a[j]) {
freq[i]++;
freq[j] = -1;
}
}
}
}
Second solution O(nlogn):
quickSort(a, 0, a.length - 1);
freq[j] = 1;
for (int i = 0; i < a.length - 1; i++) {
if (a[i] == a[i + 1]) {
freq[j]++;
}
else {
j = i + 1;
freq[j] = 1;
}
}
Is there any faster algorithm for this problem (O(n) maybe)?
Thank you in advance for any help you can provide.
Let me start by saying that checking for identity of doubles is not a good practice. For more details see: What every programmer should know about floating points.
You should use more robust double comparisons.
Now, that we are done with that, let's face your problem.
You are dealing with variation of Element Distinctness Problem with floating points number.
Generally speaking, under the algebraic tree computation model, one cannot do it better than Omega(nlogn) (references in this thread: https://stackoverflow.com/a/7055544/572670).
However, if you are going to stick with the doubles identity checks (please don't), you can use a stronger model and hash table to achieve O(n) solution, by maintaining a hash-table based histogram (implemented as HashMap<Double,Integer>) of the elements, and when you are done, scan the histogram and yield the key of the highest value.
(Please don't do it)
There is a complex way to do achieve O(n) time based on hashing, even when dealing with floating points. This is based on adding elements to multiple entries of the hash table and assuming a hash function is taking a range of elements [x-delta/2,x+delta/2) to the same hash value (so it is hashing in chunks [x1,x2)->h1, [x2,x3)->h2, [x3,x4)->h3, ....) . You then can create a hash table where an element x will be hashed to 3 values: x-3/4delta, x, x + 3/4delta.
This guarantees when checking an equal value later, it will have a match in at least one of the 3 places you put the element.
This is significantly more complex to implement, but it should work. A variant of this can be found in cracking the code interview, mathematical, question 6. (Just make sure you look at edition 5, the answer in edition 4 is wrong and was fixed in the newer edition)
As another side note, you don't need to implement your own sort. Use Arrays.sort()
Your doubles have already been rounded appropriately and you are confident there isn't an error to worry about, you can use a hash map like
Map<Double, Long> freqCount = DoubleStream.of(reals).boxed()
.collect(Collectors.groupingBy(d -> d, Collectors.counting()));
This uses quite a bit of memory, but is O(n).
The alternative is to use the following as a first pass
NavigableMap<Double, Long> freqCount = DoubleStream.of(reals).boxed()
.collect(Collectors.groupingBy(d -> d, TreeMap::new, Collectors.counting()));
This will count all the values which are exactly the same, and you can use a grouping strategy to combine double values which are almost the same, but should be considered equal for your purposes. This is O(N log N)
Using a Trie would perform in pretty much linear time because insertions are going to be extremely fast (or as fast as the order of your real number).
Sorting and counting is definitely way too slow if all you need is the frequencies. Your friend is the trie: https://en.wikipedia.org/wiki/Trie
If you were using a Trie, then you would convert each integer into a String (simple enough in Java). The complexity of an insertion into a Trie varies slightly based on the implementation, but in general it will be proportional to the length of the String.
If you need an implementation of a Trie, I suggest looking into Robert Sedgwick's implementation for his Algorithm's course here:
http://algs4.cs.princeton.edu/52trie/TrieST.java.html
I am reading about the problem of Edit Distance between 2 strings.
It can be solved by Dynamic Programming using a formula of Edit Distance. What I can not understand is its usefulness.
First of all how is this any different than knowing the longest common subsequense of 2 strings?
If the idea is to pick a string with the smallest edit distance you might as well use the max LCS among the strings.Right?
Additionally when we actually code to do the replacement, the code would be similar to the following:
if(a.length == b.length){
for(int i = 0;i < a.length;i++){
a[i] = b[i];
}
}
else{
a = new char[b.length];
for(int i = 0;i < a.length;i++){
a[i] = b[i];
}
}
I mean just replace the characters. Is there any difference between doing an assignment and checking if the characters are the same and if not, only then do the assignment at runtime? Aren't both constant time operations?
What am I misunderstanding with this problem?
Edit Distance and LCS are related by a simple formula if no substitution is allowed in editing (or if substitution is twice as expensive as insertion or deletion):
ed(x,y) = x.length + y.length - 2*lcs(x,y).length
If substitution is a separate unit-cost operation, then ED can be less than that. This is important in practice since we want a way to create shorter diff files. Not just asymptotically bounded up to a constant factor, but actually smallest possible ones.
Edit shorter diff files are probably not a concern here, they won't be substantially shorter if we do not allow substitution. There are more interesting applications, like ranking correction suggestions in a spell checker (this is based on a comment by #nhahtdh below).
Edit distance is quite different form LCS.
The edit distance is the minimal number of edit operations you need to transform the one String to the other String. A very popular example is the Levinstein distance which has the edit operations:
insert one character
delete one character
replace one character
All this operations a biased with a cost of 1.
Put there are a lot of other operations and cost functions possible.
E.g. you could also allow the operation: swap two adjacent characters.
It is e.g. used to align DNA-sequences (or protein-sequences).
If the STring have length n and m the complexity is:
time: O(n*m)
space: O(min(n,m))
Can became worse for complex cost-functions.
I'm trying to find an efficient algorithm for identifying a reoccurring sequence of characters. Let's say the sequence could be a minimum of 3 characters, yet only returns the maximum length sequence. The dataset could potentially be thousands of characters. Also, I only want to know about the sequence if it's repeated, lets say, 3 times.
As an example:
ASHEKBSHEKCSHEDSHEK
"SHEK" occurs 3 times and would be identified. "SHE" occurs 4 times, but isn't identified since "SHEK" is the maximum length sequence that contains that sequence.
Also, no "seed" sequence is fed to the algorithm, it must find them autonomously.
Thanks in advance,
j
Try to create suffix array for string.
Online builder: http://allisons.org/ll/AlgDS/Strings/Suffix/
Check the beginning of consecutive lines in suffix array to match
Looks like Rabin-Karp Wiki Entry
If you consider that there exist \sum(n) / 2 possible starting strings, and you aren't looking for simply a match, but the substring with the most matches, I think your algorithm will have a terrible theoretical complexity if it is to be correct and complete.
However, you might get some practical speed using a Trie. The algorithm would go something like this:
For each offset into the string...
1 For each length sub-string...
Insert it into the trie. Each node in the trie has a data value (an "count" integer) that you increment by when you visit the node.
Once you've built-up the trie to model your data, delete all the sub-trees from the trie with roots below some optimization threshold (3 in your case).
Those remaining paths should be few enough in number for you to efficiently sort-and-pick the ones you want.
I suggest this as a starting point because the Trie is built to manipulate common prefixes and, as a side-effect, will compress your dataset.
A personal choice I would make would be to identify the location of the sub-strings as a separate process after identifying the ones I want. Otherwise you are going to store every substring location, and that will explode your memory. Your computation is already pretty complex.
Hope this makes some sense! Good luck!
Consider the following algorithm, where:
str is the string of events
T(i) is the suffix tree for the substring str(0..i).
T(i+1) is quickly obtained from T(i), for example using this algorithm
For each character position i in the input string str, traverse a
path starting at the root of T(i), along edges, labeled with
successive characters from the input, beginning from position i + 1.
This path determines a repeating string. If the path is longer than
the previously found paths, record the new max length and the position
i + 1.
Update the suffixe tree with str [i+1] and repeat for the next position.
Something like this pseudocode:
max.len = 0
max.off = -1
T = update_suffix_tree (nil, str [0])
for i = 1 to len (str)
r = root (T)
j = i + 1
while j < len (str) and r.child (str [j]) != nil
r = r.child (str [j])
++j
if j - i - 1 > max.len
max.len = j - i - 1
max.off = i + 1
T = update_suffix_tree (T, str [i+1])
In the kth iteration, the inner while is executed for at most n -
k iterations and the suffix tree construction is O(k), hence the
loop body's complexity is O(n) and it's executed n-1 times,
therefore the whole algorithm complexity is O(n^2).