I have to make a Java program which finds all repeating sub-strings of length n in a given String. The input is string is extremely long and a brute-force approach takes too much time.
I alread tried:
Presently I am finding each sub-string separately and checking for repetitions of that sub-string using the KMP alogrithm. This too is taking too much time.
What is a more efficient approach for this problem?
1) You should look at using a suffix tree data structure.
Suffix Tree
This data structure can be built in O(N * log N) time
(I think even in O(N) time using Ukkonen's algorithm)
where N is the size/length of the input string.
It then allows for solving many (otherwise) difficult
tasks in O(M) time where M is the size/length of the pattern.
So even though I didn't try your particular problem, I am pretty sure that
if you use a suffix tree and a smart formulation of your problem, then the
problem can be solved by using a suffix tree (in reasonable O time).
2) A very good book on these (and related) subjects is this one:
Algorithms on Strings, Trees and Sequences
It's not really easy to read though unless you're well-trained in algorithms.
But OK, reading such things is the only way to get well-trained ;)
3) I suggest you have a quick look at this algorithm too.
Aho-Corasick Algorithm
Even though, I am not sure but... this one might be somewhat
off-topic with respect to your particular problem.
I am going to take #peter.petrov's suggestion and enhance it by explaining how can one actually use a suffix tree to solve the problem:
1. Create a suffix tree from the string, let it be `T`.
2. Find all nodes of depth `n` in the tree, let that set of nodes be `S`. This can be done using DFS, for example.
3. For each node `n` in `S`, do the following:
3.1. Do a DFS, and count the number of terminals `n` leads to. Let this number be `count`
3.2. If `count>1`, yield the substring that is related to `n` (the path from root to `n`), and `count`
Note that this algorithm treats any substring of length n and add it to the set S, and from there it search for how many times this was actually a substring by counting the number of terminals this substring leads to.
This means that the complexity of the problem is O(Creation + Traversal) - meaning, you first create the tree and then you traverse it (easy to see you don't traverse in steps 2-3 each node in the tree more than once). Since the traversal is obviously "faster" than creation of the tree - it leaves you with O(Creation), which as was pointed by #perer.petrov is O(|S|) or O(|S|log|S|) depending on your algorithm of choice.
Related
Recently below questions were asked in an interview
You are given an array of integer with all elements repeated twice except one element which occurs only once, you need to find the unique element with O(nlogn)time complexity. Suppose array is {2,47,2,36,3,47,36} the output should be 3 here. I told we can perform merge sort(as it takes(nlogn)) after that we can check next element, but he said it will take O(nlogn)+O(n). I also told we can use HashMap to keep count of elements but again he said no as we have to iterate over hashmap again to get results. After some research, I came to know that using xor operation will give output in O(n). Is there any better solution other than sorting which can give the answer in O(nlogn) time?
As we use smartphones we can open many apps at a time. when we look at what all apps are open currently we see a list where the recently opened app is at the front and we can remove or close an app from anywhere on the list. There is some Collection available in java which can perform all these tasks in a very efficient way. I told we can use LinkedList or LinkedHashMap but he was not convinced. What could be the best Collection to use?
Firstly, if the interviewer used Big-O notation and expected a O(n log n) solution, there's nothing wrong with your answer. We know that O(x + y) = O(max(x, y)). Therefore, although your algorithm is O(n log n + n), it's okay if we just call O(n log n). However, it's possible to find the element that appears once in a sorted array can be achieved in O(log n) using binary search. As a hint, exploit odd and even indices while performing search. Also, if the interviewer expected a O(n log n) solution, the objection for traversing is absurd. The hash map solution is already O(n), and if there's a problem with this, it's the requirement of extra space. For this reason, the best one is to use XOR as you mentioned. There're still some more O(n) solutions but they're not better than the XOR solution.
To me, LinkedList is proper to use for this task as well. We want to remove from any location and also want to use some stack operations (push, pop, peek). A custom stack can be built from a LinkedList.
What would be the time complexity? I just want to avoid this being O(n!). Would using depth first search be time complexity O(n^2), as for each letter it may have to go through all other letters worst case?
I guess I'm not sure if I'm thinking about this the right way.
When I say use depth first search, I mean starting depth first search from first the letter, and then starting from the second letter, etc.
Is that necessary?
Note:
The original problem is to find all possible words in a crossword/boggle board. I'm thinking of using the trie data structure to find if a word is in the dictionary, but am thinking about ways of generating the words themselves.
Following the discussion above, here is my answer:
Definition: a trieX is a sub trie, with words of length X only.
Since we have a trie with all words in the desired language, we can also get the appropriate trieX.
We say that the crossword puzzle has w words, so we create an array w long where each entry is the root of a trieX where X is the length of the relevantword. This gives us the list of possible words in each blank word.
Then the iterate over intersections between words and eliminate words that can not be placed. When there are no more changes done - we stop.
Two remarks:
1. In order to improve performance, we start by adding either long words, or VERY short ones. What is short or long? have a look at this and this.
2. Elimination of words from the trieX's can also be done by checking dependencies between words (if THIS words is here, then THAT words can't be there, etc.). This is more complicated, so if anyone wants to add some ideas on how to do this easily - please do.
I'm comparing song titles, using Latin script (although not always), my aim is an algorithm that gives a high score if the two song titles seem to be the same same title and a very low score if they have nothing in common.
Now I already had to code (Java) to write this using Lucene and a RAMDirectory - however using Lucene simply to compare two strings is too heavyweight and consequently too slow. I've now moved to using https://github.com/nickmancol/simmetrics which has many nice algorithms for comparing two strings:
https://github.com/nickmancol/simmetrics/tree/master/src/main/java/uk/ac/shef/wit/simmetrics/similaritymetrics
BlockDistance
ChapmanLengthDeviation
ChapmanMatchingSoundex
ChapmanMeanLength
ChapmanOrderedNameCompoundSimilarity
CosineSimilarity
DiceSimilarity
EuclideanDistance
InterfaceStringMetric
JaccardSimilarity
Jaro
JaroWinkler
Levenshtein
MatchingCoefficient
MongeElkan
NeedlemanWunch
OverlapCoefficient
QGramsDistance
SmithWaterman
SmithWatermanGotoh
SmithWatermanGotohWindowedAffine
Soundex
but I'm not well versed in these algorithms and what would be a good choice ?
I think Lucene uses CosineSimilarity in some form, so that is my starting point but I think there might be something better.
Specifically, the algorithm should work on short strings and should understand the concept of words, i.e spaces should be treated specially. Good matching of Latin script is most important, but good matching of other scripts such as Korean and Chinese is relevant as well but I expect would need different algorithm because of the way they treat spaces.
They're all good. They work on different properties of strings and have different matching properties. What works best for you depends on what you need.
I'm using the JaccardSimilarity to match names. I chose the JaccardSimilarity because it was reasonably fast and for short strings excelled in matching names with common typo's while quickly degrading the score for anything else. Gives extra weight to spaces. It is also insensitive to word order. I needed this behavior because the impact of a false positive was much much higher then that off a false negative, spaces could be typos but not often and word order was not that important.
Note that this was done in combination with a simplifier that removes non-diacritics and a mapper that maps the remaining characters to the a-z range. This is passed through a normalizes that standardizes all word separator symbols to a single space. Finally the names are parsed to pick out initials, pre- inner- and suffixes. This because names have a structure and format to them that is rather resistant to just string comparison.
To make your choice you need to make a list of what criteria you want and then look for an algorithm that satisfied those criteria. You can also make a reasonably large test set and run all algorithms on that test set too see what the trade offs are with respect to time, number of positives, false positives, false negatives and negatives, the classes of errors your system should handle, ect, ect.
If you are still unsure of your choice, you can also setup your system to switch the exact comparison algorithms at run time. This allows you to do an A-B test and see which algorithm works best in practice.
TLDR; which algorithm you want depends on what you need, if you don't know what you need make sure you can change it later on and run tests on the fly.
You are likely need to solve a string-to-string correction problem. Levenshtein distance algorithm is implemented in many languages. Before running it I'd remove all spaces from string, because they don't contain any sensitive information, but may influence two strings difference. For string search prefix trees are also useful, you can have a look in this direction as well. For example here or here. Was already discussed on SO. If spaces are so much significant in your case, just assign a greater weight to them.
Each algorithm is going to focus on a similar, but slightly different aspect of the two strings. Honestly, it depends entirely on what you are trying to accomplish. You say that the algorithm needs to understand words, but should it also understand interactions between those words? If not, you can just break up each string according to spaces, and compare each word in the first string to each word in the second. If they share a word, the commonality factor of the two strings would need to increase.
In this way, you could create your own algorithm that focused only on what you were concerned with. If you want to test another algorithm that someone else made, you can find examples online and run your data through to see how accurate the estimated commonality is with each.
I think http://jtmt.sourceforge.net/ would be a good place to start.
Interesting. Have you thought about a radix sort?
http://en.wikipedia.org/wiki/Radix_sort
The concept behind the radix sort is that it is a non-comparative integer sorting algorithm that sorts data with integer keys by grouping keys by the individual digits. If you convert your string into an array of characters, which will be a number no greater than 3 digits, then your k=3(maximum number of digits) and you n = number of string to compare. This will sort the first digits of all your strings. Then you will have another factor s=the length of the longest string. your worst case scenario for sorting would be 3*n*s and the best case would be (3 + n) * s. Check out some radix sort examples for strings here:
http://algs4.cs.princeton.edu/51radix/LSD.java.html
http://users.cis.fiu.edu/~weiss/dsaajava3/code/RadixSort.java
Did you take a look at the levenshtein distance ?
int org.apache.commons.lang.StringUtils.getLevenshteinDistance(String s, String t)
Find the Levenshtein distance between two Strings.
This is the number of changes needed to change one String into
another, where each change is a single character modification
(deletion, insertion or substitution).
The previous implementation of the Levenshtein distance algorithm was
from http://www.merriampark.com/ld.htm
Chas Emerick has written an implementation in Java, which avoids an
OutOfMemoryError which can occur when my Java implementation is used
with very large strings. This implementation of the Levenshtein
distance algorithm is from http://www.merriampark.com/ldjava.htm
Anyway, I'm curious to know what do you choose in this case !
The A star algorithm is known to be complete. However, all the implementations that I have found searching the web seem to only return only the first (optimal) solution.
For example, this implementation:
A star algoritthm implementation
Since the algorithm always expands the node with the minimum f value, and the implementations seem to stop when the first node is a solution, how would one adapt the aforementioned code so as to output all (or the first n) paths that lead to a goal, without taking into account duplicate actions (that is, paths that contain the same action over and over again)?
For all paths, it probably makes a lot more sense to use breath first search. Alternatively, you can try Dijkstra's algorithm if you want to find the top n shortest paths.
It's complete which means it will find a solution if one exists, but the algorithm specifically only returns one path. A breadth-first search will find all non-cyclical paths between two nodes, however: http://en.wikipedia.org/wiki/Breadth-first_search
Update - Here is the k-shortest paths algorithm which will return a list of n (or in this case, k) shortest paths in order of shortest to longest. http://code.google.com/p/k-shortest-paths/
We have a problem where we want to do substring searching on a large number [1MM - 10MM] of strings ("model numbers") quickly identifying any "model number" that contains the given substring. Model numbers are short strings such as:
ABB1924DEW
WTW9400PDQB
GLEW1874
The goal is simple, given a substring, quickly find all the model numbers match the substring. For example (on the universe of above model numbers), if we searched on the string "EW" the function would return GLEW1874 and ABB1924DEW (because both contained the substring EW within them).
The data structure also needs to be able to support quick searches for model numbers that start with a given substring and/or end with a given substring. For example, we need to be able to quicly do searchs like WTW...B (which would match WTW9400PDQB because it starts with WTW and ends with B)
What I am looking for is an in memory data structure that does that does these searches very efficiently. Ideally, there would also be a nice (simple) implementation in Java already done somewhere that we could use. Simple (and fast) is better than uber complicated and slightly faster. The naive algorithm (just loop over all part numbers doing a substring search on each) is too slow for our purposes, we are looking for something much faster (prepossessing ok)
So, what is the textbook data structure/algorithm for this problem?
What you need is a Suffix Tree. I don't know of a library in Java to recommend so you might have to implement one yourself