Search for words in a dictionary - data structures and approaches

Search for words in a dictionary - data structures and approaches - java

I'm writing an application and I'm faced with the task to find possible words in a dictionary based on an input string and a description of what to search for.
The dictionary is a text file (one word per row) and contains around 220,000 words.
An input string can consist of four things:
Normal characters A-Z
Joker *. This can be any character A-Z
Vowel #. The character must be a vowel
Consonant #. The character must be a consonant
For example, the input string *AT## should return words like "rated", "satin", "later" etc. but not the word "ratio" because it doesn't end with a consonant.
A description is used to tell how the input string should appear in the word. It can be:
Words that begin with. *AT## as input returns words like "material".
Words that end with. *AT## as input returns words like "refrigerator".
Words that contain. *AT## as input returns words like "catered"
Words that fit. *AT## as input returns words like "hater".
The first thing to figure out is the best data structure for the dictionary. Since I have the descriptions to think about, I'm not sure a tree structure is the best way to go. It seems to be good for prefix searching and I can probably create another tree for reversed words to handle suffix searching. I'm not sure about words that contain a sequence of characters though. A tree doesn't feel right. On the other hand I can't think of anything else.
Which data structures shall I use for each of my descriptions?
I'm also thinking about creating a regular expression based on the input string and the description and then match it against every string in the dictionary. However, I haven't used regular expression before so I don't know how expensive this is.
Thanks in advance!

In one of my classes we used a trie data structure to store a dictionary. Each node of the trie has a string that is just its letter and it has children representing any letter that could follow it based on the words in the dictionary.
For example if the letter of the first trie node was 'a' and apple, abraham and acorn were in the dictionary, the node would have child nodes of 'p', 'b' and 'c'. Each node also has a boolean that denotes whether or not it is the final letter of any word the dictionary contains. You then check the words presence in the dictionary by comparing the first and then subsequent letters in your input word with the available child nodes. The advantage is that the worst possible performance you can have is 26 times the number of letters in the word you are searching.

Related

RegEx for matching two words with two capital letters

I am creating a business card parser and am running into issues with the regex. I have a line that I am reading from my file - String s.
I need to be able to grab a line that contains two words and only two capital letters along with not containing certain words. Below is the regex I have used in the past that works, but I am wanting to make this if else statement with the .matches and !.matches
else if ((!s.matches(".*\\b(Technologies|Engineer|Systems|Developer|Company|INC|Analyst|Computers|Technology|#)\\b.*") && (s.matches("^(?!(.*[A-Z]){3,})[a-zA-Z]+ [a-zA-Z]+$"))))
{
getName();
}

I'm not sure, if this RegEx might be what you may be looking for.
Input
Technologies Word Word word
Engineer Word Word word
Systems Word word word
Developer Word word word
Company Word word word
INC Word Word Word
Analyst Word word word
Computers Word word word
Technology Word word word
Output
If not, you can use that same tool and design a RegEx, you only need to add {2} at the end for repeating twice.
For removing certain words, you may not need another matching, you might just add, the list you wish in the beginning of the same RegEx:
^(?!Technologies|Engineer|Anything|Else|You|Wish)([A-Z][a-z]+\s){2}
Output

Regular Expression to find words separated with space, backtracking

I have to find words separated by space. What best practice to do it with the smallest backtracking?
I found this solution:
Regex: \d+\s([a-zA-Z]+\\s{0,1}){1,} in a sentence
Input: 1234 this is words in a sentence
So, this is words - i have to check using regex ([a-zA-Z]+\\s{0,1}){1,} and words in a sentence i have to check by constant words in regex in a sentences.
But in this case regex101.com gives me debug with 4156 steps and this is Catastrophic Backtracking. Any way to avoid it?
I have other more complicated example, where it takes 86000 steps and it does not validate.
Main problem, that i have to find all words separated by space, but in the same time regex contains words separated by space (constants). This is where i have Catastrophic Backtracking.
I have to do this using Java.

You want to find words separated by space.So you should say at least 1 or more space.You can use this instead which takes just 37 steps.
\d+\s([a-zA-Z]+\s+)+in a sentence
See demo.
https://regex101.com/r/tD0dU9/4
For java double escape all ie \d==\\d

You could try splitting the String into a String array, then find the size of the array after eliminating any members of the array that do not match your definition of a word (ex. a whitespace or puncuation)
String[] mySplitString = myOriginalString.split(" ");
for(int x = 0; x < mySplitString.length; x++){
if(mySplitString[x].matches("\\w.*"/*Your regex for a word here*/)) words++;
}
mySplitString is an array of Strings that have been split from an original string. All whitespace characters are removed and substrings that were before, after, or in-between whitespaces are placed into the new String array. The for-loop runs through the split String array and checks to make sure that each array member contains a word (characters or numbers atleast once) and adds it to a total word count.

If I understood it right, you want to match any word separeted by space plus the sentence "in a sentence".
You can try the following solution:
(in a sentence)|(\S+)
As seen in this example on regex101: Exemple
The regex matchs in 61 steps.
You might have problems with punctuation after the "in a sentence" sentence. Make some tests.
I hope I was helpfull.

Find the whole word from a Sentence with matching String

I am trying to filter the whole word from the sentence.
like example
Text : This is a question about programming language.
Search Text is : about pro
Result should be : about programming
Basically i want to get the whole words from the sentence.
I referred this How to find a whole word in a String in java also. but it searching for matching words and not characters
I would really appreciate your help
Thanks

Do it with regex: Something like
about pro.*?\b
Will match about pro and then some characters and then a word boundary (a whitespace or punctuation mark). This way you don't have to make multiple substrings (which is a costly operation).

Regular expression to capture repeated word (more than one 2 repetition in text)

I would like to write a program in JAVA, to capture words which repeated more than 2 times in a text content.
This repetition can be 3, 4 , 5 or many.
The repetition might spread around the text and doesn't have any order.
I need to keep the times of repeat as well.
It should not be case sensitive.
for instance:
the blue book over The red pen is the biggest book I ever seen.
Result: the:3
What can be the proper regular expression pattern for this matter?

Rather than trying to solve this problem by regex I would suggest following algorithm:
Split your sentence into words (using white spaces) and store their lowercase version in a List<String>.
Declare a map as HashMap<String, Integer>.
Iterate over your words List and keep storing in the map.
If Map didn't have an entry of the word then key=word, value=1
Otherwise increment value by 1 giving you frequency of each word.
Every time frequency goes above 2 store that word in your output HashSet<String>
At the end of loop just print HashSet<String>

There is no need for regexes, unless for splitting a text in words. Next you just have to use a Map, with the key being the word, and value being the number or repetitions.
When done, you just scan the Map to find the most repeated word.

How to match one word or 2,3,4,5-consecutive words between two documents?

I have two text documents and want to get the word matches between the two documents. The words can match anywhere - for instance, word#5 of doc1 can match word#3 and word#67 of doc2; and then word#23 of doc1 can again match word#3 and word#67 of doc2 - so I want all the matches. Also, aside from one-word matches I want to similarly get consecutive multiple (2-word, 3-word ....15-word etc) word matches between the two documents. How should I approach this in Java? I have been looking at regular expressions but am still not convinced on the exact approach.

First, you need to split the document into bunches of n words (1 word, 2 words, 3 words, ..., n-words) - those bunches are called n-grams. Refer here.
Secondly, create a Set of n-grams from document A. Then, for each n-gram from document B, check if it is in the set.

I suggest you to maintain a treeset of single words for each document, looping through the firs treeset and checking the matches versus the second should accomplish your task.
For the multiple words part use the same trick only getting two words groups, eg
word1 word2 word3 yay!
take word1 word2 and put it in the treeset, then take word2 word3 and do the same. You could use regexes to remove punctuation, so the algorithm should consist of three steps:
cleaning the documents from punctuation
"indexing" the words and the groups of consecutive words
comparison phase
About point 1 be careful because for example those phrases are the same:
I ate, the cat didn't, I did
I ate the cat, didnt' I? did!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Search for words in a dictionary - data structures and approaches - java

Related

RegEx for matching two words with two capital letters

Regular Expression to find words separated with space, backtracking

Find the whole word from a Sentence with matching String

Regular expression to capture repeated word (more than one 2 repetition in text)

How to match one word or 2,3,4,5-consecutive words between two documents?

Categories

Resources