RegEx for matching two words with two capital letters - java

I am creating a business card parser and am running into issues with the regex. I have a line that I am reading from my file - String s.
I need to be able to grab a line that contains two words and only two capital letters along with not containing certain words. Below is the regex I have used in the past that works, but I am wanting to make this if else statement with the .matches and !.matches
else if ((!s.matches(".*\\b(Technologies|Engineer|Systems|Developer|Company|INC|Analyst|Computers|Technology|#)\\b.*") && (s.matches("^(?!(.*[A-Z]){3,})[a-zA-Z]+ [a-zA-Z]+$"))))
{
getName();
}

I'm not sure, if this RegEx might be what you may be looking for.
Input
Technologies Word Word word
Engineer Word Word word
Systems Word word word
Developer Word word word
Company Word word word
INC Word Word Word
Analyst Word word word
Computers Word word word
Technology Word word word
Output
If not, you can use that same tool and design a RegEx, you only need to add {2} at the end for repeating twice.
For removing certain words, you may not need another matching, you might just add, the list you wish in the beginning of the same RegEx:
^(?!Technologies|Engineer|Anything|Else|You|Wish)([A-Z][a-z]+\s){2}
Output

Related

Java Regex, how to stop matching when a certain symbol is found

I recently got into regex stuff. There is something that is bugging me really badly.
How can set my regex to only match certain words IF THERE IS NOT a specific symbol that follows them.
For example
Say I have a text with some normal words, and some words that end with a capital letter. How do I only get my regex to detect a word, if that word isn't followed by a capital letter.
Just some sample texT with wordS. ThiS should be Matched.
From this I want my reggex to match all the words, except for "texT, wordS and ThiS".
Thank you in advance for any help :)
You could do it in few ways. One could involve word boundaries \b.
\b can will match only places which are
between alphabetic and non-alphabetic characters
at start or end of input
So to make sure we are matching whole words we can surround \w+ with \b like \b\w+\b.
Now to make sure that words you found don't end with uppercase letters we can simply require that last character should be lowercase (in range a-z).
So we can rewrite our regex to \b\w+[a-z]\b

Regular Expression to find words separated with space, backtracking

I have to find words separated by space. What best practice to do it with the smallest backtracking?
I found this solution:
Regex: \d+\s([a-zA-Z]+\\s{0,1}){1,} in a sentence
Input: 1234 this is words in a sentence
So, this is words - i have to check using regex ([a-zA-Z]+\\s{0,1}){1,} and words in a sentence i have to check by constant words in regex in a sentences.
But in this case regex101.com gives me debug with 4156 steps and this is Catastrophic Backtracking. Any way to avoid it?
I have other more complicated example, where it takes 86000 steps and it does not validate.
Main problem, that i have to find all words separated by space, but in the same time regex contains words separated by space (constants). This is where i have Catastrophic Backtracking.
I have to do this using Java.
You want to find words separated by space.So you should say at least 1 or more space.You can use this instead which takes just 37 steps.
\d+\s([a-zA-Z]+\s+)+in a sentence
See demo.
https://regex101.com/r/tD0dU9/4
For java double escape all ie \d==\\d
You could try splitting the String into a String array, then find the size of the array after eliminating any members of the array that do not match your definition of a word (ex. a whitespace or puncuation)
String[] mySplitString = myOriginalString.split(" ");
for(int x = 0; x < mySplitString.length; x++){
if(mySplitString[x].matches("\\w.*"/*Your regex for a word here*/)) words++;
}
mySplitString is an array of Strings that have been split from an original string. All whitespace characters are removed and substrings that were before, after, or in-between whitespaces are placed into the new String array. The for-loop runs through the split String array and checks to make sure that each array member contains a word (characters or numbers atleast once) and adds it to a total word count.
If I understood it right, you want to match any word separeted by space plus the sentence "in a sentence".
You can try the following solution:
(in a sentence)|(\S+)
As seen in this example on regex101: Exemple
The regex matchs in 61 steps.
You might have problems with punctuation after the "in a sentence" sentence. Make some tests.
I hope I was helpfull.

Find the whole word from a Sentence with matching String

I am trying to filter the whole word from the sentence.
like example
Text : This is a question about programming language.
Search Text is : about pro
Result should be : about programming
Basically i want to get the whole words from the sentence.
I referred this How to find a whole word in a String in java also. but it searching for matching words and not characters
I would really appreciate your help
Thanks
Do it with regex: Something like
about pro.*?\b
Will match about pro and then some characters and then a word boundary (a whitespace or punctuation mark). This way you don't have to make multiple substrings (which is a costly operation).

Search for words in a dictionary - data structures and approaches

I'm writing an application and I'm faced with the task to find possible words in a dictionary based on an input string and a description of what to search for.
The dictionary is a text file (one word per row) and contains around 220,000 words.
An input string can consist of four things:
Normal characters A-Z
Joker *. This can be any character A-Z
Vowel #. The character must be a vowel
Consonant #. The character must be a consonant
For example, the input string *AT## should return words like "rated", "satin", "later" etc. but not the word "ratio" because it doesn't end with a consonant.
A description is used to tell how the input string should appear in the word. It can be:
Words that begin with. *AT## as input returns words like "material".
Words that end with. *AT## as input returns words like "refrigerator".
Words that contain. *AT## as input returns words like "catered"
Words that fit. *AT## as input returns words like "hater".
The first thing to figure out is the best data structure for the dictionary. Since I have the descriptions to think about, I'm not sure a tree structure is the best way to go. It seems to be good for prefix searching and I can probably create another tree for reversed words to handle suffix searching. I'm not sure about words that contain a sequence of characters though. A tree doesn't feel right. On the other hand I can't think of anything else.
Which data structures shall I use for each of my descriptions?
I'm also thinking about creating a regular expression based on the input string and the description and then match it against every string in the dictionary. However, I haven't used regular expression before so I don't know how expensive this is.
Thanks in advance!
In one of my classes we used a trie data structure to store a dictionary. Each node of the trie has a string that is just its letter and it has children representing any letter that could follow it based on the words in the dictionary.
For example if the letter of the first trie node was 'a' and apple, abraham and acorn were in the dictionary, the node would have child nodes of 'p', 'b' and 'c'. Each node also has a boolean that denotes whether or not it is the final letter of any word the dictionary contains. You then check the words presence in the dictionary by comparing the first and then subsequent letters in your input word with the available child nodes. The advantage is that the worst possible performance you can have is 26 times the number of letters in the word you are searching.

Extract string without last char if vowel

With regular expressions, how can I extract the whole word except the last character if it is a vowel?
Inputs:
ansia
bello
ansid
Expected output for each:
ansi
bell
ansid
This is what I tried, but it only works if I have a single vowel at the end:
^(.*[^aeiou])
Similar to what #Sotirios Delimanolis wrote in his comment but using word boundaries so it will work if you have multiple words in a line.
\b(\w+?)[aeiou]?\b
This works in the following way :
1) \b matches the start of a word. This will work for the first word on a line or a word preceded by a non-word character (a word character is any alpha-numeric character).
2) (\w+?) matches and captures the part of the word you care about.
2a) \w matches any word character.
2b) + makes the \w be matched one or more times
2c) ? makes the + match as few characters as possible. This is important because if there is a vowel at the end of the word we do not want to match it in the capturing group but instead let (3) take care of it.
3) [aeiou]? matches but does not capture a vowel character if one is present
3a) [aeiou] matches a vowel
3b) ? makes the [aeiou] be matched zero or one times
4) \b matches the end of the word. This will work for a word at the end of a line or a word followed by a non-word character.
You said that the tool you are using uses the Java regex implementation and ansid isn't working for you with my regex. I have tested it with pure Java and it seems to be working for me:
Pattern pattern = Pattern.compile("\\b(\\w+?)[aeiou]?\\b");
Matcher matcher = pattern.matcher("ansia ansid cake cat dog");
while (matcher.find()) {
System.out.println(matcher.group(1));
}
prints
ansi
ansid
cak
cat
dog
Try the regex (\b[a-zA-Z]+?(?=[aeiou]\b))|(\b[a-zA-Z]+?[^aeiou]\b). This captures either a word ending in a consonant OR a word ending in a vowel, and omits the vowel at the end.
this pattern worked for me
^(.*?)(?=[aeiou]$|$)
Demo
in case input is words that can be in a line as pointed out below
use this pattern
\b([a-z]+?)(?=[aeiou]\b|\b)
Demo

Categories

Resources