Regexp for Word similarity "n letter difference"

Regexp for Word similarity "n letter difference" - java

Assume i have a word like this; mert . I want to to search for all 1 letter difference combinations for that word. aert, ert, meat,mmert, merst,merts etc. are all applicable. So my regular expression is like
[a-z]{0,2}ert OR m[a-z]{0,2}rt OR me[a-z]{0,2}t OR mer[a-z]{0,2}
So for n letter difference, i just replace 2 with n-1 and you can'T get all combinations.
My question is this; Is there any shorter way of writing this regexp?
Thanks

Please check this solution, I have test this code below. It seems to work.
/**
* Then function will return list of the words matched with nth_difference
*
* #param pattern search pattern
* #param data input data
* #param nth_difference difference
* #return
*/
static List<String> getNthDifferenceWords(String pattern, String[] data, int nth_difference) {
Map<Character, Integer> frequencyTable = new HashMap<>();
List<String> matchedWords = new ArrayList<>();
//Code complexity : O(n)
for (int i = 0; i < pattern.length(); ++i) {
frequencyTable.put(pattern.charAt(i), 1);
}
//Code complexity : O(m) where m is size of entire input;
for (String input : data) {
int matchCounter = 0;
for (int j=0; j<input.length(); ++j){
if(frequencyTable.containsKey(input.charAt(j))){
++matchCounter;
}
}
//System.out.println("matched=" + matchCounter);
if(input.length() <= pattern.length() && (matchCounter == pattern.length() - nth_difference)){
matchedWords.add(input);
}
if((input.length() - pattern.length() == 1) && (matchCounter >= input.length() - nth_difference)){
matchedWords.add(input);
}
}
return matchedWords;
}
public static void main(String[] args) {
int nth_difference = 1;
String pattern = "mert";
String[] data = new String[]{"aert", "ert", "meat", "mmert", "merst", "merts","meritos"};
System.out.println(getNthDifferenceWords(pattern,data,nth_difference));
nth_difference = 2;
pattern = "merit";
data = new String[]{"aert", "ert", "meat", "mmert", "merst", "merts","demerit","merito", "meritos"};
System.out.println(getNthDifferenceWords(pattern,data,nth_difference));
}

For a 1-letter difference, pre-build a table in the following way. Build a 2-column lexicon with the 'word' in the second column, and the following in the first column: One position at a time, remove one letter from the word.
Example: "meat" is the word; here are the rows for it in the table:
`col1` `col2`
------ ------
meat meat
eat meat
mat meat
met meat
mea meat
For "meet" (note the dup letter):
meet meet
eet meet
met meet -- only needed once
mee meet
Then test in a similar way. When searching for "mert", do
WHERE col1 IN ('mert', 'ert', 'mrt', 'met', 'ert')
Note that you will get both "meat" and "meet" from the above example. Note also what will happen with "met" and "meets".
And, it checks for simple transpositions. Searching for "meta":
WHERE col1 IN ('meta', 'eta', 'mta', 'mea', 'met')
will find "meat", "meet" (and other words like met, mean, ...) Arguably, "meta" -> "mean" is a 2-letter distance, but oh well.
Checking your test cases-- mert vs
aert -- via "ert"
ert -- via "ert"
meat -- via "met"
mmert -- via "mert"
merst -- via "mert"
merts -- via "mert"
Meanwhile, have PRIMARY KEY(col1, col2), INDEX(col2) on that table.

Related

Find top k frequent words in real time data stream

I am trying to solve an algorithms problem using java tree set.
The problem as follows:
Find top k frequent words in realtime data stream.
Implement three methods for Topk Class:
TopK(k). The constructor.
add(word). Add a new word.
topk(). Get the current top k frequent words.
And my thought was to use a hashmap to remember frequencies and a treeset as a buffer.
My implementation passed most of the case, except one:
TopK(10)
add("aw")
add("fb")
add("fb")
topk()
The answer supposed to be [fb,aw] but now it's [fb,aw, fb]
However, my code passed test case like:
TopK(10)
add("iiiiii")
add("fb")
add("fb")
topk()
and
TopK(10)
add("fb")
add("fb")
topk()
I have no idea what's wrong, so I printed some value when the comparator is called. It gave me this:
aw aw
11111111
fb aw
33333333
fb aw
33333333
fb aw
222222222
fb aw
222222222
Which means, the second "fb" was compared to "aw" twice and the comparator is done. I spent hours to debug and I have found nothing so far.
Here is my implementation:
public class TopK {
int size;
HashMap<String, Integer> map;
TreeSet<String> seen;
public TopK(int k) {
// do intialization if necessary
size = k;
seen = new TreeSet<String>(new Comparator<String>(){
#Override
public int compare(String str1, String str2){
System.out.println(str1 + " "+ str2);
if (str1.equals(str2)){
System.out.println("11111111");
return 0;
}
// important !https://www.jiuzhang.com/qa/7646/
// 128 以后integer就不同了
int number1 = map.get(str1);
int number2 = map.get(str2);
if (number1 != number2){
System.out.println("222222222");
return map.get(str1) - map.get(str2);
} else {
System.out.println("33333333");
return str2.compareTo(str1);
}
}
});
map = new HashMap<String, Integer>();
}
/*
* #param word: A string
* #return: nothing
*/
public void add(String word) {
// write your code here
if (!map.containsKey(word)){
map.put(word, 0);
}
map.put(word, map.get(word) + 1);
if (seen.contains(word)){
seen.remove(word);
seen.add(word);
} else {
seen.add(word);
if (seen.size() > size){
seen.pollFirst();
}
}
}
/*
* #return: the current top k frequent words.
*/
public List<String> topk() {
// Write your code here
List<String> results = new ArrayList<String>();
Iterator it = seen.iterator();
while(it.hasNext()) {
String str = (String)it.next();
results.add(0, str);
}
return results;
}
}

Our first clue is that case:
aw
fb
fb
Fails but:
iiiii
fb
fb
Succeed.
This can happen only because of the line: return str2.compareTo(str1); - if the number of appearances are different order by string compare (this could be check easily - please do that).
The only explanation I can think of is the contains function of java TreeSet has "optimization" of searching only until where the element should be - if you have order and the element is not where it should be then consider it as none exist in the TreeSet (think about array that should be sort for checking for number you run in log(n) but no on all array - so if he exist in wrong position you will miss him).
Notice that you change the place where the element should be before checking the contains function. So let look at the 2th iteration - we have fb and aw both with value of 1 in the map. On the TreeSet the are as [fb,aw] (because string compare as explain before). Now you change the map and fb have value of 2 -> it should be in the last place but the contains function compare to aw and think it should be after it - but he is the last element so it assume fb does not exist and just add him -> That why you see 2 compares between fb and aw - one for the contain and one for the add.
Hope that was understandable....

Find all valid words when given a string of characters (Recursion / Binary Search)

I'd like some feedback on a method I tried to implement that isn't working 100%. I'm making an Android app for practice where the user is given 20 random letters. The user then uses these letters to make a word of whatever size. It then checks a dictionary to see if it is a valid English word.
The part that's giving me trouble is with showing a "hint". If the user is stuck, I want to display the possible words that can be made. I initially thought recursion. However, with 20 letters this can take quite a long time to execute. So, I also implemented a binary search to check if the current recursion path is a a prefix to anything in the dictionary. I do get valid hints to be output however it's not returning all possible words. Do I have a mistake here in my recursion thinking? Also, is there a recommended, faster algorithm? I've seen a method in which you check each word in a dictionary and see if the characters can make each word. However, I'd like to know how effective my method is vs. that one.
private static void getAllWords(String letterPool, String currWord) {
//Add to possibleWords when valid word
if (letterPool.equals("")) {
//System.out.println("");
} else if(currWord.equals("")){
for (int i = 0; i < letterPool.length(); i++) {
String curr = letterPool.substring(i, i+1);
String newLetterPool = (letterPool.substring(0, i) + letterPool.substring(i+1));
if(dict.contains(curr)){
possibleWords.add(curr);
}
boolean prefixInDic = binarySearch(curr);
if( !prefixInDic ){
break;
} else {
getAllWords(newLetterPool, curr);
}
}
} else {
//Every time we add a letter to currWord, delete from letterPool
//Attach new letter to curr and then check if in dict
for(int i=0; i<letterPool.length(); i++){
String curr = currWord + letterPool.substring(i, i+1);
String newLetterPool = (letterPool.substring(0, i) + letterPool.substring(i+1));
if(dict.contains(curr)) {
possibleWords.add(curr);
}
boolean prefixInDic = binarySearch(curr);
if( !prefixInDic ){
break;
} else {
getAllWords(newLetterPool, curr);
}
}
}
private static boolean binarySearch(String word){
int max = dict.size() - 1;
int min = 0;
int currIndex = 0;
boolean result = false;
while(min <= max) {
currIndex = (min + max) / 2;
if (dict.get(currIndex).startsWith(word)) {
result = true;
break;
} else if (dict.get(currIndex).compareTo(word) < 0) {
min = currIndex + 1;
} else if(dict.get(currIndex).compareTo(word) > 0){
max = currIndex - 1;
} else {
result = true;
break;
}
}
return result;
}

The simplest way to speed up your algorithm is probably to use a Trie (a prefix tree)
Trie data structures offer two relevant methods. isWord(String) and isPrefix(String), both of which take O(n) comparisons to determine whether a word or prefix exist in a dictionary (where n is the number of letters in the argument). This is really fast because it doesn't matter how large your dictionary is.
For comparison, your method for checking if a prefix exists in your dictionary using binary search is O(n*log(m)) where n is the number of letters in the string and m is the number of words in the dictionary.
I coded up a similar algorithm to yours using a Trie and compared it to the code you posted (with minor modifications) in a very informal benchmark.
With 20-char input, the Trie took 9ms. The original code didn't complete in reasonable time so I had to kill it.
Edit:
As to why your code doesn't return all hints, you don't want to break if the prefix is not in your dict. You should continue to check the next prefix instead.

Is there a recommended, faster algorithm?
See Wikipedia article on "String searching algorithm", in particular the section named "Algorithms using a finite set of patterns", where "finite set of patterns" is your dictionary.
The Aho–Corasick algorithm listed first might be a good choice.

java regular expressions: performance and alternative

Recently I have been had to search a number of string values to see which one matches a certain pattern. Neither the number of string values nor the pattern itself is clear until a search term has been entered by the user. The problem is I have noticed each time my application runs the following line:
if (stringValue.matches (rexExPattern))
{
// do something so simple
}
it takes about 40 micro second. No need to say when the number of string values exceeds a few thousands, it'll be too slow.
The pattern is something like:
"A*B*C*D*E*F*"
where A~F are just examples here, but the pattern is some thing like the above. Please note* that the pattern actually changes per search. For example "A*B*C*" may change to W*D*G*A*".
I wonder if there is a better substitution for the above pattern or, more generally, an alternative for java regular expressions.

Regular expressions in Java are compiled into an internal data structure. This compilation is the time-consuming process. Each time you invoke the method String.matches(String regex), the specified regular expression is compiled again.
So you should compile your regular expression only once and reuse it:
Pattern pattern = Pattern.compile(regexPattern);
for(String value : values) {
Matcher matcher = pattern.matcher(value);
if (matcher.matches()) {
// your code here
}
}

Consider the following (quick and dirty) test:
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test3 {
// time that tick() was called
static long tickTime;
// called at start of operation, for timing
static void tick () {
tickTime = System.nanoTime();
}
// called at end of operation, prints message and time since tick().
static void tock (String action) {
long mstime = (System.nanoTime() - tickTime) / 1000000;
System.out.println(action + ": " + mstime + "ms");
}
// generate random strings of form AAAABBBCCCCC; a random
// number of characters each randomly repeated.
static List<String> generateData (int itemCount) {
Random random = new Random();
List<String> items = new ArrayList<String>();
long mean = 0;
for (int n = 0; n < itemCount; ++ n) {
StringBuilder s = new StringBuilder();
int characters = random.nextInt(7) + 1;
for (int k = 0; k < characters; ++ k) {
char c = (char)(random.nextInt('Z' - 'A') + 'A');
int rep = random.nextInt(95) + 5;
for (int j = 0; j < rep; ++ j)
s.append(c);
mean += rep;
}
items.add(s.toString());
}
mean /= itemCount;
System.out.println("generated data, average length: " + mean);
return items;
}
// match all strings in items to regexStr, do not precompile.
static void regexTestUncompiled (List<String> items, String regexStr) {
tick();
int matched = 0, unmatched = 0;
for (String item:items) {
if (item.matches(regexStr))
++ matched;
else
++ unmatched;
}
tock("uncompiled: regex=" + regexStr + " matched=" + matched +
" unmatched=" + unmatched);
}
// match all strings in items to regexStr, precompile.
static void regexTestCompiled (List<String> items, String regexStr) {
tick();
Matcher matcher = Pattern.compile(regexStr).matcher("");
int matched = 0, unmatched = 0;
for (String item:items) {
if (matcher.reset(item).matches())
++ matched;
else
++ unmatched;
}
tock("compiled: regex=" + regexStr + " matched=" + matched +
" unmatched=" + unmatched);
}
// test all strings in items against regexStr.
static void regexTest (List<String> items, String regexStr) {
regexTestUncompiled(items, regexStr);
regexTestCompiled(items, regexStr);
}
// generate data and run some basic tests
public static void main (String[] args) {
List<String> items = generateData(1000000);
regexTest(items, "A*");
regexTest(items, "A*B*C*");
regexTest(items, "E*C*W*F*");
}
}
Strings are random sequences of 1-8 characters with each character occurring 5-100 consecutive times (e.g. "AAAAAAGGGGGDDFFFFFF"). I guessed based on your expressions.
Granted this might not be representative of your data set, but the timing estimates for applying those regular expressions to 1 million randomly generates strings of average length 208 each on my modest 2.3 GHz dual-core i5 was:
Regex Uncompiled Precompiled
A* 0.564 sec 0.126 sec
A*B*C* 1.768 sec 0.238 sec
E*C*W*F* 0.795 sec 0.275 sec
Actual output:
generated data, average length: 208
uncompiled: regex=A* matched=6004 unmatched=993996: 564ms
compiled: regex=A* matched=6004 unmatched=993996: 126ms
uncompiled: regex=A*B*C* matched=18677 unmatched=981323: 1768ms
compiled: regex=A*B*C* matched=18677 unmatched=981323: 238ms
uncompiled: regex=E*C*W*F* matched=25495 unmatched=974505: 795ms
compiled: regex=E*C*W*F* matched=25495 unmatched=974505: 275ms
Even without the speedup of precompiled expressions, and even considering that the results vary wildly depending on the data set and regular expression (and even considering that I broke a basic rule of proper Java performance tests and forgot to prime HotSpot first), this is very fast, and I still wonder if the bottleneck is truly where you think it is.
After switching to precompiled expressions, if you still are not meeting your actual performance requirements, do some profiling. If you find your bottleneck is still in your search, consider implementing a more optimized search algorithm.
For example, assuming your data set is like my test set above: If your data set is known ahead of time, reduce each item in it to a smaller string key by removing repetitive characters, e.g. for "AAAAAAABBBBCCCCCCC", store it in a map of some sort keyed by "ABC". When a user searches for "ABC*" (presuming your regex's are in that particular form), look for "ABC" items. Or whatever. It highly depends on your scenario.

A more efficient way of finding English words that are one letter off from eachother

I wrote a little program that tries to find a connection between two equal length English words. Word A will transform into Word B by changing one letter at a time, each newly created word has to be an English word.
For example:
Word A = BANG
Word B = DUST
Result:
BANG -> BUNG ->BUNT -> DUNT -> DUST
My process:
Load an English wordlist(consist of 109582 words) into a Map<Integer, List<String>> _wordMap = new HashMap();, key will be the word length.
User put in 2 words.
createGraph creates a graph.
calculate the shortest path between those 2 nodes
prints out the result.
Everything works perfectly fine, but I am not satisfied with the time it took in step 3.
See:
Completely loaded 109582 words!
CreateMap took: 30 milsecs
CreateGraph took: 17417 milsecs
(HOISE : HORSE)
(HOISE : POISE)
(POISE : PRISE)
(ARISE : PRISE)
(ANISE : ARISE)
(ANILE : ANISE)
(ANILE : ANKLE)
The wholething took: 17866 milsecs
I am not satisfied with the time it takes create the graph in step 3, here's my code for it(I am using JgraphT for the graph):
private List<String> _wordList = new ArrayList(); // list of all 109582 English words
private Map<Integer, List<String>> _wordMap = new HashMap(); // Map grouping all the words by their length()
private UndirectedGraph<String, DefaultEdge> _wordGraph =
new SimpleGraph<String, DefaultEdge>(DefaultEdge.class); // Graph used to calculate the shortest path from one node to the other.
private void createGraph(int wordLength){
long before = System.currentTimeMillis();
List<String> words = _wordMap.get(wordLength);
for(String word:words){
_wordGraph.addVertex(word); // adds a node
for(String wordToTest : _wordList){
if (isSimilar(word, wordToTest)) {
_wordGraph.addVertex(wordToTest); // adds another node
_wordGraph.addEdge(word, wordToTest); // connecting 2 nodes if they are one letter off from eachother
}
}
}
System.out.println("CreateGraph took: " + (System.currentTimeMillis() - before)+ " milsecs");
}
private boolean isSimilar(String wordA, String wordB) {
if(wordA.length() != wordB.length()){
return false;
}
int matchingLetters = 0;
if (wordA.equalsIgnoreCase(wordB)) {
return false;
}
for (int i = 0; i < wordA.length(); i++) {
if (wordA.charAt(i) == wordB.charAt(i)) {
matchingLetters++;
}
}
if (matchingLetters == wordA.length() - 1) {
return true;
}
return false;
}
My question:
How can I improve my algorithm inorder to speed up the process?
For any redditors that are reading this, yes I created this after seeing the thread from /r/askreddit yesterday.

Here's a starting thought:
Create a Map<String, List<String>> (or a Multimap<String, String> if you've using Guava), and for each word, "blank out" one letter at a time, and add the original word to the list for that blanked out word. So you'd end up with:
.ORSE => NORSE, HORSE, GORSE (etc)
H.RSE => HORSE
HO.SE => HORSE, HOUSE (etc)
At that point, given a word, you can very easily find all the words it's similar to - just go through the same process again, but instead of adding to the map, just fetch all the values for each "blanked out" version.

You probably need to run it through a profiler to see where most of the time is taken, especially since you are using library classes - otherwise you might put in a lot of effort but see no significant improvement.
You could lowercase all the words before you start, to avoid the equalsIgnoreCase() on every comparison. In fact, this is an inconsistency in your code - you use equalsIgnoreCase() initially, but then compare chars in a case-sensitive way: if (wordA.charAt(i) == wordB.charAt(i)). It might be worth eliminating the equalsIgnoreCase() check entirely, since this is doing essentially the same thing as the following charAt loop.
You could change the comparison loop so it finishes early when it finds more than one different letter, rather than comparing all the letters and only then checking how many are matching or different.
(Update: this answer is about optimizing your current code. I realize, reading your question again, that you may be asking about alternative algorithms!)

You can have the list of words of same length sorted, and then have a loop nesting of the kind for (int i = 0; i < n; ++i) for (int j = i + 1; j < n; ++j) { }.
And in isSimilar count the differences and on 2 return false.

java counting repeated key value pairs?

I've a Tuple class which has:
private class Tuple {
private int fileno;
private int position;
public Tuple(int fileno, int position) {
this.fileno = fileno;
this.position = position;
}
}
I also have an hashmap which refrences this list like
Map<String, List<Tuple>> index = new HashMap<String, List<Tuple>>();
Now there is a scenario where i need to count how many words in a file: The data is as follows:
abc 10.txt
abc 10.txt
abc 10.txt
abc 12.txt
abc 12.txt
ghost 15.txt
and so on....
Now how to count number of occurences of the above? It is easy but i've been coding so long and also new to java.
I also learnt duplicates can't go into hashmap! Thanks.
To add data to list:
List<Tuple> idx = index.get(word);
if (idx == null) {
idx = new LinkedList<Tuple>();
index.put(word, idx);
}
idx.add(new Tuple(fileno, pos));
Above code just dumps the data, Now i will compare with words from a string array[].
All i need at end is like this:
abc 10.txt count - 3
abc 12.txt count - 2
ghost 15.txt count - 1
I'm not sure if map helps/i need to use a list again/write a function to do this? Thanks!
I solved the above problem with simple condition statements! Thanks #codeguru
/*
consider all cases and update wc as along
Lesson learnt - Map does not handle duplicates
- List does not work
- spend half a day figuring out
*/
if(wordInstance == null && fileNameInstance == null) {
wordInstance = wordOccurence;
fileNameInstance = files.get(t.fileno);
}
if(wordInstance == wordOccurence && fileNameInstance ==files.get(t.fileno)) {
wc++;
}
if(wordInstance == wordOccurence && fileNameInstance !=files.get(t.fileno)) {
wc=0;
fileNameInstance = files.get(t.fileno);
wc++;
}
if(wordInstance != wordOccurence && fileNameInstance ==files.get(t.fileno)) {
wc=0;
wordInstance = wordOccurence;
wc++;
}

I think you may be making this more complicated than it needs to be. I suggest that you take a step back from the computer. First you should take a simple input example and work out, by hand, what the output should be. Next, write a few sentences (in your native language) describing the steps you took to work out this output. From there, you need to refine your description until you can easily translate it into Java.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regexp for Word similarity "n letter difference" - java

Related

Find top k frequent words in real time data stream

Find all valid words when given a string of characters (Recursion / Binary Search)

java regular expressions: performance and alternative

A more efficient way of finding English words that are one letter off from eachother

java counting repeated key value pairs?

Categories

Resources