I'm trying to find an efficient algorithm for identifying a reoccurring sequence of characters. Let's say the sequence could be a minimum of 3 characters, yet only returns the maximum length sequence. The dataset could potentially be thousands of characters. Also, I only want to know about the sequence if it's repeated, lets say, 3 times.
As an example:
ASHEKBSHEKCSHEDSHEK
"SHEK" occurs 3 times and would be identified. "SHE" occurs 4 times, but isn't identified since "SHEK" is the maximum length sequence that contains that sequence.
Also, no "seed" sequence is fed to the algorithm, it must find them autonomously.
Thanks in advance,
j
Try to create suffix array for string.
Online builder: http://allisons.org/ll/AlgDS/Strings/Suffix/
Check the beginning of consecutive lines in suffix array to match
Looks like Rabin-Karp Wiki Entry
If you consider that there exist \sum(n) / 2 possible starting strings, and you aren't looking for simply a match, but the substring with the most matches, I think your algorithm will have a terrible theoretical complexity if it is to be correct and complete.
However, you might get some practical speed using a Trie. The algorithm would go something like this:
For each offset into the string...
1 For each length sub-string...
Insert it into the trie. Each node in the trie has a data value (an "count" integer) that you increment by when you visit the node.
Once you've built-up the trie to model your data, delete all the sub-trees from the trie with roots below some optimization threshold (3 in your case).
Those remaining paths should be few enough in number for you to efficiently sort-and-pick the ones you want.
I suggest this as a starting point because the Trie is built to manipulate common prefixes and, as a side-effect, will compress your dataset.
A personal choice I would make would be to identify the location of the sub-strings as a separate process after identifying the ones I want. Otherwise you are going to store every substring location, and that will explode your memory. Your computation is already pretty complex.
Hope this makes some sense! Good luck!
Consider the following algorithm, where:
str is the string of events
T(i) is the suffix tree for the substring str(0..i).
T(i+1) is quickly obtained from T(i), for example using this algorithm
For each character position i in the input string str, traverse a
path starting at the root of T(i), along edges, labeled with
successive characters from the input, beginning from position i + 1.
This path determines a repeating string. If the path is longer than
the previously found paths, record the new max length and the position
i + 1.
Update the suffixe tree with str [i+1] and repeat for the next position.
Something like this pseudocode:
max.len = 0
max.off = -1
T = update_suffix_tree (nil, str [0])
for i = 1 to len (str)
r = root (T)
j = i + 1
while j < len (str) and r.child (str [j]) != nil
r = r.child (str [j])
++j
if j - i - 1 > max.len
max.len = j - i - 1
max.off = i + 1
T = update_suffix_tree (T, str [i+1])
In the kth iteration, the inner while is executed for at most n -
k iterations and the suffix tree construction is O(k), hence the
loop body's complexity is O(n) and it's executed n-1 times,
therefore the whole algorithm complexity is O(n^2).
Related
I am working on this spellchecker and one of the methods I use to suggest corrections to the user is inserting multiple characters into the word. This allows for words like exmpl to be corrected to example
Here's the actual code:
public static Set<String> multiInsert(String word, int startIndex) {
Set<String> result = new HashSet<>();
//List of characters to insert
String correction = "azertyuiopqsdfghjklmwxcvbnùûüéèçàêë";
for (int i = startIndex; i <= word.length(); i++) {
for (int j = i + 1; j <= word.length(); j++) {
for (int k = 0; k < correction.length(); k++) {
String newWord = word.substring(0, j) + correction.charAt(k) + word.substring(j);
result.addAll(multiInsert(newWord, startIndex + 2));
if(dico.contains(newWord)) result.add(newWord);
}
}
}
return result;
}
The problem with this function is that it takes A LOT of time to process especially when the word is long or when I have too many words to correct. Is there any better way to implement this function or optimize it?
What's making it slow is you're testing for strings that are not in the dictionary.
There are lots more possible misspellings than there are words in the dictionary.
You need to be guided by the dictionary.
This is the general spelling correction problem.
I've programmed it several times.
In a nutshell, the method is to store the dictionary as a trie, and do bounded depth-first walk of the trie.
At each step, you keep track of the distance between the word in the trie and the original word.
Whenever that distance exceeds the bound, you prune the search.
So you do it in cycles, increasing the bound each time.
First you do it with a bound of 0, so it will only find an exact match.
That is equivalent to ordinary trie search.
If that did not yield a match, do the walk again with a bound of 1.
That will find all dictionary words that are distance 1 from the original word.
If that didn't yield any, increase the bound to 2, and so on.
(What constitutes an increment of distance is any transformation you choose, like insertion, deletion, replacement, or more general re-writings.)
The performance is bounded by the true distance times the dictionary size.
Short of that, it is exponential in the true distance. Since each walk costs a factor times the previous walk, the time is dominated by the final walk, so the prior walks do not add much time.
There is an advantage to organizing the dictionary as a trie, since a trie is just a particular form of Finite State Machine.
You can add sub-machines to it to handle common prefixes and suffixes, without massively expanding the dictionary.
Consider these words: nation, national, nationalism, nationalistic, nationalistical, nationalisticalism, ...
Such words may not be common, but they are not impossible.
A suffix trie handles them easily.
Similarly prefixes like pre-, post-, un-, de-, in-, etc.
You can have a look at Jazzy which is a Java spellchecker API.
You might also want to consider Fuzzy String Matching.
I want to count the number of occurances for a particular phrase in a document. For example "stackoverflow forums". Suppose D represents the documents set with document containing both terms.
Now, suppose I have the following data structure:
A[numTerms][numMatchedDocuments][numOccurInADocument]
where numMatchedDocuments is the size of D and numOccurInADocument is the number of occurrences a particular term occurs in a particular document, for example:
A[stackoverflow][document1][occurance1]=3;
means, the term "stackoverflow" occurs in document "document1" and its first occurance is at position "3".
Then I pick the term that occur the least and loop over all its positions to find if "forum" occurs at a position+1 the current term "stackoverflow" positions. In other words, if I find "forum" at position 4 then that is a phrase and I've found a match for it.
the matching is straightforward per document and runs reasonably fast but when the number of documents exceed 2,000,000 it gets very slow. I've distributed it over cores and it gets faster of course but wonder if there is algorithmically better way of doing this.
thanks,
Psudo-Code:
boolean docPhrase=true;
int numOfTerms=2;
// 0 for "stackoverflow" and 1 for "forums"
for (int d=0;d<D.size();d++){
//D is a set containing the matched documents
int minId=getTheLeastOccuringTerm();
for (int i=0; i<A[minId][d].length;i++){ // For every position for LeastOccuringTerm
for( int t=0;t<numOfTerms;t++){ // For every terms
int id=BinarySearch(A[t][d], A[minId][d][i] - minId + t);
if (id<0) docPhrase=false;
}
}
}
As I mentioned in comments, Suffix Array can solve this sort of problem. I answered a similar question ( Fastest way to search a list of names in C# ) with a simple c# implementation of a Suffix Array.
The basic idea is you have an array of index pairs that point to a document index, and a position within that document. The index pair represents the string that starts at that point in the document, and continues to the end of the document. But the actual documents and their contents exist only once in your original store. The Suffix Array is just an array of these index pairs, with a pair for every position in every document. You then sort the Suffix Array in the order of the text they point to. Once sorted, you can now very quickly find any phrase among any of the documents by doing a simple Binary Search on the Suffix Array. Constructing (mainly sorting) the Suffix Array can be time consumptive. But once constructed, it is very fast to search on. It's fairly easy on memory since the actual document contents only exist once.
It would be trivial to extend it to returning counts of phrase matches within each document.
This is a little different than the classic description of a Suffix Array where they are usually talking about the Suffix Array operating over one single, very large string. But the changes to make it work for an array of strings/documents is not that large, although it can increase the amount of memory consumed by the Suffix Array depending on the maximum number of documents and the maximum document length, and how you encode the index pairs.
My college is getting over so I have started preparing for the interviews to get the JOB and I came across this interview question while I was preparing for the interview
You have a set of 10000 ascii strings (loaded from a file)
A string is input from stdin.
Write a pseudocode that returns (to stdout) a subset of strings in (1) that contain the same distinct characters (regardless of order) as
input in (2). Optimize for time.
Assume that this function will need to be invoked repeatedly. Initializing the string array once and storing in memory is okay .
Please avoid solutions that require looping through all 10000 strings.
Can anyone provide me a general pseudocode/algorithm kind of thing how to solve this problem? I am scratching my head thinking about the solution. I am mostly familiar with Java.
Here is an O(1) algorithm!
Initialization:
For each string, sort characters, removing duplicates - eg "trees" becomes "erst"
load sorted word into a trie tree using the sorted characters, adding a reference to the original word to the list of words stored at the each node traversed
Search:
sort input string same as initialization for source strings
follow source string trie using the characters, at the end node, return all words referenced there
They say optimise for time, so I guess we're safe to abuse space as much as we want.
In that case, you could do an initial pass on the 10000 strings and build a mapping from each of the unique characters present in the 10000 to their index (rather a set of their indices). That way you can ask the mapping the question, which sets contain character 'x'? Call this mapping M> ( order: O(nm) when n is the number of strings and m is their maximum length)
To optimise in time again, you could reduce the stdin input string to unique characters, and put them in a queue, Q. (order O(p), p is the length of the input string)
Start a new disjoint set, say S. Then let S = Q.extractNextItem.
Now you could loop over the rest of the unique characters and find which sets contain all of them.
While (Q is not empty) (loops O(p)) {
S = S intersect Q.extractNextItem (close to O(1) depending on your implementation of disjoint sets)
}
voila, return S.
Total time: O(mn + p + p*1) = O(mn + p)
(Still early in the morning here, I hope that time analysis was right)
As Bohemian says, a trie tree is definitely the way to go!
This sounds like the way an address book lookup would work on a phone. Start punching digits in, and then filter the address book based on the number representation as well as any of the three (or actually more if using international chars) letters that number would represent.
How can I design an algorithm which can return the 10 most frequently used words in a document in O(n) time? If additional space can be used.
I can parse and place the words in a hash map with count . But next I have to sort the values to get the most frequent ones . Also I have to have a mapping btw the values -> Key which cannot be maintained since values may be repeating.
So how can I solve this ?
Here's a simple algorithm:
Read one word at a time through the document. O(n)
Build a HashTable using each word. O(n)
Use the word as the key. O(1)
Use the number of times you've seen this word as the value. O(1)
(e.g. If you are adding the key to the hashtable, then value is 1; if you already have the key in the hashtable, increase its associated value by 1) O(1)
Create a pair of arrays of size 10 (i.e. String words[10] / int count[10], or use a < Pair >), use this pair to keep track of the 10 most frequent words and their word counts in the next step. O(1)
Iterate through the completed HashTable once: O(n)
If the current word has a higher word count than an entry in the array pair, replace that particular entry and shift everything down 1 slot. O(1)
Output the pair of arrays. O(1)
O(n) Run-time.
O(n) Storage for the HashTable + Arrays
(Side note: You can just think of HashTable as a dictionary: a way to store key:value pairs where keys are unique. Technically HashMaps imply asynchronous access, and HashTable imply synchronous.)
It may be done in O(n) if you use the correct data structure.
Consider a Node, consisting of 2 things:
A counter (initially set to 0).
An array of 255 (or whatever number of characters) pointers to Node. All the pointers are initially set to NULL.
Create a root node. Define a "current" Node pointer, set it to root node initially.
Then walk through all the characters of the document and do the following:
If the next characters is not a space - pick the appropriate pointer from the array of the current node. If it's NULL - allocate it. The current Node pointer is updated.
If it's a space (or whatever word delimiter) - increment the counter of the "current" Node. Then reset the "current" Node pointer to point to the root node.
By such you build a tree in O(n). Every element (both node and leave) denote a specific word, together with its counter.
Then transverse the tree to find the node with the largest counter. It's also O(n), since the number of elements in the tree is not bigger than O(n).
Update:
The last step is not mandatory. Actually the most common word may be updated during the character processing.
The following is a pseudo-code:
struct Node
{
size_t m_Counter;
Node* m_ppNext[255];
Node* m_pPrev;
Node(Node* pPrev) :m_Counter(0)
{
m_pPrev = pPrev;
memset(m_ppNext, 0, sizeof(m_ppNext));
}
~Node()
{
for (int i = 0; i < _countof(m_ppNext) i++)
if (m_ppNext[i])
delete m_ppNext[i];
}
};
Node root(NULL);
Node* pPos = &root;
Node* pBest = &root;
char c;
while (0 != (c = GetNextDocumentCharacter()))
{
if (c == ' ')
{
if (pPos != &root)
{
pPos->m_Counter++;
if (pBest->m_Counter < pPos->m_Counter)
pBest = pPos;
pPos = &root;
}
} else
{
Node*& pNext = pPos->m_ppNext[c - 1];
if (!pNext)
pNext = new Node(pPos);
pPos = pNext;
}
}
// pBest points to the most common word. Using pBest->m_pPrev we iterate in reverse order through its characters
The fastest approach is to use a radix tree. You can store the count of words at the leaf of the radix tree. Keep a separate list of the 10 most frequent words and their occurrence count along with a variable that stores the threshold value necessary to get into this list. Update this list as items are added to the tree.
I would use a ArrayList and a HashTable.
Here is the algorithm I am thinking of,
Loop through all the word in the document.
if (HashTable.contains(word) )
increment count for that word in the HashTable;
else
ArrayList.add(word);
HashTable.add(word);
word count in HashTable = 1;
After looping through the whole document,
Loop through ArrayList<word>
Retrieve the word count for that word from the HashTable;
Keep a list of the top 10 words;
The running time should be O(n) to construct the HashTable and ArrayList.
Making the top 10 list should be O(m), where m is the number of unique words.
O(n+m) where n>>m --> O(n)
Maintaining a map of (word,count) will be O(n).
Once the map is built, iterate over the keys and retrieve ten most common keys.
O(n) + O(n)
--
But not exactly happy with this soln coz of the extra amount of external memory needed.
% java BinarySearch 1.txt < 2.txt
If I have two text files (1.txt and 2.txt), where 2.txt contains values not in 1.txt, how does the binary search work in giving us these values? If the arguments to BinarySearch are a key and a sorted array, I don't see how this applies.
Here is the code for the binary search:
import java.util.Arrays;
public class BinarySearch {
// precondition: array a[] is sorted
public static int rank(int key, int[] a) {
int lo = 0;
int hi = a.length - 1;
while (lo <= hi) {
// Key is in a[lo..hi] or not present.
int mid = lo + (hi - lo) / 2;
if (key < a[mid]) hi = mid - 1;
else if (key > a[mid]) lo = mid + 1;
else return mid;
}
return -1;
}
public static void main(String[] args) {
int[] whitelist = In.readInts(args[0]);
Arrays.sort(whitelist);
// read key; print if not in whitelist
while (!StdIn.isEmpty()) {
int key = StdIn.readInt();
if (rank(key, whitelist) == -1)
StdOut.println(key);
}
}
}
According to Wikipedia, and from what I'd understood: A binary search or half-interval search algorithm finds the position of a specified value (the input "key") within a sorted array.
So how is it working to find uncommon values in two text files?
As I understand the question, you want to know how this program works when it (correctly) determines an entry in 2.txt is NOT in 1.txt. That has a pretty simple answer.
This algorithm sorts the array whitelist. It initializes the lo pointer to point to element 0 and the hi pointer to point to element whitelist.length-1 which is the last elment in whitelist. The array segment is the whole array for the first iteration. The array must be ordered or sorted for this to work.
For each successive iteration, if the value is not found in the middle of the current array segment, the logic determines whether the value has to be in the half-segment above the middle or the half-segment below the middle. That half-segment, excluding the old middle element, becomes the new search segment for the next iteration. The algorithm adjusts the hi and lo pointers to close in, one half of the remaining segment of the array at a time, on where the searched for value has to be, if it is in the array.
Eventually, for a search value not in the array, hi and lo (and therefore mid) will converge to the same single element and it will be the last segment of the array searched, a segment of just one element. If that element doesn't have the search value, then, depending on the search value and that element's value either hi will become mid - 1 or lo will become mid + 1. Either way, the while continuation condition will become false because lo <= hi is no longer true. The new remaining search segment now has a negative size. This can be interpreted as meaning that if a return doesn't occur before the while terminates, then the search didn't find the value in any previous segment and there is no remaining segment to search. Therefore, the search value can't be in the array.
The implementation given in this question works. I've tested it usng the Princeton.edu stdlib that contains the In and StdIn classes used here. I've compiled and run it from a command line using stdin pipe to pipe in the second text file. I don't think I would implement this application like this except as a demonstration of binary search methods, perhaps for a class or examining some techniques.
Here is some further background on why a binary search is used. The reason to use a binary search is to obtain a worst case 2*logBase2(n) execution complexity with an average 1.5*logBase2(n) complexity. A binary search for a value not in the array will always be the worst case of 2*logBase2(n) compares.
A binary search is vastly superior to a linear search that just starts at one end of the array and searches every elemnt until it finds a match or comes to the end of the array. The average search may be about n/2, depending on the distribution of values in the array. A linear search for a value not in the array will always have the worst case of n compares.
In a binary search, each pair of compares eliminates half the possibilites. An array of 1024 entries can be searched in a maximum of 20 compares. Compare that to a 1024 maximum for a linear search. Squaring the size of the searched array only doubles the number of compares for a binary search. A binary search can search an array with 1,048,576 entries with a maximum of 40 compares. Compare that to a linear search maximum of 1,048,576.
The basic binary search algorithm given in the question can be very useful with objects that inherit from a sorted or ordered collection and where you have to implement your own compare and search method to overload the inherited methods. As long as you have a compare that determines less, greater and equal amongst objects, and the collection is ordered or sorted according to that compare, you can use this basic binary search algorithm to search the collection.
while (!StdIn.isEmpty()) { //WHILE THE INPUT FILE (OR STANDARD INPUT) ISN'T EMPTY
int key = StdIn.readInt(); //GET THE NEXT INTEGER
if (rank(key, whitelist) == -1) // USE BINARY SEARCH TO SEARCH FOR THAT INTEGER
StdOut.println(key); //PRINT WHEN IT'S NOT FOUND
}
the code it's executing N binary searches where N is the number of integers in the standard input file.
the complexity is O(n * log n) + O(m * log n). n and m the sizes of the different files. n of the while list, and m of the other. This will work well if the whilelist is much smaller than the other file. If not, it'd probably be a better idea to sort both files, and compare them using something like the merge step of the merge sort.
I think creating hash table will be better than modified merge sort algorithm for comparing large files containing only ints.All you have to do is to read the first file(which it is already doing) and while reading put the ints in some sort of hash table. Read the next file one int at a time, which the loop in main is doing, calculate the hash of the int and compare if the table contains any value or not in the hash table corresponding to the hash. i have assumed perfect hash table, so you might need to modify in case of collisions.