Java Regex for counting syllables

Java Regex for counting syllables - java

I am writing a regex pattern to count all the syllables in a word but I'm having trouble ignoring the case when an "e" is alone at the end of the word.
My pattern right now is:
[aeiouy]+[^$e]
I have given certain rules that are not completely precise but I need to do it that way for the exercise, the rules are the following:
A syllable is a contiguous sequence of vowels, except for a lonely vowel "e" at the end, vowels are "aeiouy", for example the word "sentence" should be only 2 syllables but my pattern is counting 3, the word "there" should be only one syllable by my pattern is counting 2.
Thanks in advance for any help!
Edit: With Yassin example I've noticed that the main issue is that when "e" is following by another character, question marks, comma, etc. The regex is counting another syllable

As you're having problems with "e" ending words and followed by points or commas etc.
Here is a solution using a 12 syllables sentence.
We are excluding the "e" letter followed by any of the characters below.
Solution
Pattern p = Pattern.compile("[aeiouy]+[^$e(,.:;!?)]");
Matcher m = p.matcher("This is a sentence:this is another sentence.");
int syllables = 0;
while (m.find()){
syllables++;
}
System.out.println(syllables);
Output
12

Related

Match consecutive single characters as whole word

While filtering from list of strings, i want to match consecutive single characters as whole word
e.g. below strings
'm g road'
'some a b c d limited'
in first case should match if user types
"mg" or "m g" or "m g road" or "mg road"
in second case should match if user types
"some abcd" or "some a b c d" or "abcd" or "a b c d"
How i can do that, can i achieve this using regex?
Order of whole words i can handle right now using searching words one by one, but not sure how to treat consecutive single chars as single word
e.g. "mg road" or "road mg" i can handle by searching "mg" and "road" one by one
EDIT
For making requirement more clear, below is my test case
#Test
public void testRemoveSpaceFromConsecutiveSingleCharacters() throws Exception {
Assert.assertTrue(Main.removeSpaceFromConsecutiveSingleCharacters("some a b c d limited").equals("some abcd limited"));
Assert.assertTrue(Main.removeSpaceFromConsecutiveSingleCharacters("m g road").equals("mg road"));
Assert.assertTrue(Main.removeSpaceFromConsecutiveSingleCharacters("bank a b c").equals("bank abc"));
Assert.assertTrue(Main.removeSpaceFromConsecutiveSingleCharacters("bank a b c limited n a").equals("bank abc limited na"));
Assert.assertTrue(Main.removeSpaceFromConsecutiveSingleCharacters("c road").equals("c road"));
}

1.) Strip out spaces within space-surrounded single letters from stringtocheck and userinput.
.replaceAll("(?<=\\b\\w) +(?=\\w\\b)","")
(?<=\b\w) look behind to check if preceded by \b word boundary, \w word character
(?=\\w\\b) look ahead to check if followed by \w word character, \b word boundary
See demo at regexplanet (click Java)
2.) Check if stringtocheck .contains userinput.

Sounds like you simply want to ignore white space. You can easily can do this by stripping out white space from both the target string and the user input before looking for a match.

You're basically wanting each search term to be modified to allow intervening spaces, so
"abcd" becomes regex "\ba ?b ?c ?d\b"
To achieve this, do this to each word before matching:
word = "\\b" + word.replaceAll("(?<=.)(?=.)", " ?") + "\\b";
The word breaks \b are necessary to stop matching "comma bcd" or "abc duck".

This regex will match all single characters separated by one or more spaces
(^(\w\s+)+)|(\s+\w)+$|((\s+\w)+\s+)

The following regex (in multiline mode) could help you out:
^(?<first>\w+)(?<chars>(?:.(?!(?:\b\w{2,}\b)))*)
# assure that it is the beginning of the line
# capture as many word characters as possible in the first group "first"
# the construction afterwards consumes everything up to (not including)
# a word which has at least two characters...
# ... and saves it to the group called "chars"
You would only need to replace the whitespaces in the second group (aka "chars").
See a demo on regex101.com

str = str.replaceAll("\\s","");

Regular Expression to find words separated with space, backtracking

I have to find words separated by space. What best practice to do it with the smallest backtracking?
I found this solution:
Regex: \d+\s([a-zA-Z]+\\s{0,1}){1,} in a sentence
Input: 1234 this is words in a sentence
So, this is words - i have to check using regex ([a-zA-Z]+\\s{0,1}){1,} and words in a sentence i have to check by constant words in regex in a sentences.
But in this case regex101.com gives me debug with 4156 steps and this is Catastrophic Backtracking. Any way to avoid it?
I have other more complicated example, where it takes 86000 steps and it does not validate.
Main problem, that i have to find all words separated by space, but in the same time regex contains words separated by space (constants). This is where i have Catastrophic Backtracking.
I have to do this using Java.

You want to find words separated by space.So you should say at least 1 or more space.You can use this instead which takes just 37 steps.
\d+\s([a-zA-Z]+\s+)+in a sentence
See demo.
https://regex101.com/r/tD0dU9/4
For java double escape all ie \d==\\d

You could try splitting the String into a String array, then find the size of the array after eliminating any members of the array that do not match your definition of a word (ex. a whitespace or puncuation)
String[] mySplitString = myOriginalString.split(" ");
for(int x = 0; x < mySplitString.length; x++){
if(mySplitString[x].matches("\\w.*"/*Your regex for a word here*/)) words++;
}
mySplitString is an array of Strings that have been split from an original string. All whitespace characters are removed and substrings that were before, after, or in-between whitespaces are placed into the new String array. The for-loop runs through the split String array and checks to make sure that each array member contains a word (characters or numbers atleast once) and adds it to a total word count.

If I understood it right, you want to match any word separeted by space plus the sentence "in a sentence".
You can try the following solution:
(in a sentence)|(\S+)
As seen in this example on regex101: Exemple
The regex matchs in 61 steps.
You might have problems with punctuation after the "in a sentence" sentence. Make some tests.
I hope I was helpfull.

Split Strings in Java

I'm trying to clean up this very Noisy (due to OCR) dataset of names and email addresses and one problem is multiple names in one entry, for example
"Fenner, Robert: Fishbume, Howard" should be "Fenner, Robert" and "Fishbume, Howard"
or "Fendrich, Karen N., Ricci, Vincent" should be "Fendrich, Karen N." and "Ricci, Vincent"
How could I use regex to find entries where to strings are separated by a comma or colon, that are themselves separated by a comma and then split the string?
other variations of the problem:
"'Emily Phaup ' Ryan, Thomas M" -> "Emily Phaup", "Ryan, Thomas M"
"A Lilly, Alisia Rudd, Andrew McComb, Daniel Lisbon, David Compton"
->"A Lilly", "Alisia Rudd", "Andrew McComb", "Daniel Lisbon", "David Compton"
"Abigail.Perlmangus.pm.com Jay.Poole#us.pm.com" -> "Abigail.Perlmangus.pm.com", "Jay.Poole#us.pm.com"
and a couple more.
I know that it might not be possible to separate all of these occurences (especially without accidentally sepperating correct names) but separating some of them would definitely help
EDIT: I guess my question is a bit too broad, so I'll narrow it down a bit:
Is there a way to find Strings with the format "string1,string2, string3,string4" (the strings can contain any kind of chars and whitespaces) and split them into two seperate strings: "string1,string2" and "string3,string4"?
and could someone give me some pointers on how to do it, because I'm quite inexperienced with regex.

Well i would have try something like that
public static void main(String[] args) throws URISyntaxException, IOException {
String regex = "(\\w+(,|:|$)\\s*\\w+)(,|:|$)";
Pattern pattern = Pattern.compile(regex);
String [] tests = {
"Fenner, Robert: Fishbume, Howard"
,"string1, string2, string3, string4"
};
for (String test : tests) {
Matcher matcher = pattern.matcher(test);
while(matcher.find()){
System.out.println(matcher.group(1));
}
}
}
Output :
Fenner, Robert
Fishbume, Howard
string1, string2
string3, string4
This won't work for all your cases, but answer to your last edit
What i've done, is searching any word characters (\w+) followed by either , or : or being at the end of string. Followed by any space and other word characters followed again by , or : or end of line.
Regex detail
(\w+(,|:|$)\s*\w+)(,|:|$)
1st Capturing group (\w+(,|:|$)\s*\w+)
\w+ match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
2nd Capturing group (,|:|$)
1st Alternative: ,
, matches the character , literally
2nd Alternative: :
: matches the character : literally
3rd Alternative: $
$ assert position at end of the string
\s* match any white space character [\r\n\t\f ]
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\w+ match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
3rd Capturing group (,|:|$)
1st Alternative: ,
, matches the character , literally
2nd Alternative: :
: matches the character : literally
3rd Alternative: $
$ assert position at end of the string

My honest recommendation is to take a representative sample to an online Regex calculator and play with it until you can stomach the output.
As you've noted, the input is not regular enough to really harness Regex. But you may be able to hack it down a little bit at least. There's probably not gonna be a one true perfect answer to that nastiness.

Extract string without last char if vowel

With regular expressions, how can I extract the whole word except the last character if it is a vowel?
Inputs:
ansia
bello
ansid
Expected output for each:
ansi
bell
ansid
This is what I tried, but it only works if I have a single vowel at the end:
^(.*[^aeiou])

Similar to what #Sotirios Delimanolis wrote in his comment but using word boundaries so it will work if you have multiple words in a line.
\b(\w+?)[aeiou]?\b
This works in the following way :
1) \b matches the start of a word. This will work for the first word on a line or a word preceded by a non-word character (a word character is any alpha-numeric character).
2) (\w+?) matches and captures the part of the word you care about.
2a) \w matches any word character.
2b) + makes the \w be matched one or more times
2c) ? makes the + match as few characters as possible. This is important because if there is a vowel at the end of the word we do not want to match it in the capturing group but instead let (3) take care of it.
3) [aeiou]? matches but does not capture a vowel character if one is present
3a) [aeiou] matches a vowel
3b) ? makes the [aeiou] be matched zero or one times
4) \b matches the end of the word. This will work for a word at the end of a line or a word followed by a non-word character.
You said that the tool you are using uses the Java regex implementation and ansid isn't working for you with my regex. I have tested it with pure Java and it seems to be working for me:
Pattern pattern = Pattern.compile("\\b(\\w+?)[aeiou]?\\b");
Matcher matcher = pattern.matcher("ansia ansid cake cat dog");
while (matcher.find()) {
System.out.println(matcher.group(1));
}
prints
ansi
ansid
cak
cat
dog

Try the regex (\b[a-zA-Z]+?(?=[aeiou]\b))|(\b[a-zA-Z]+?[^aeiou]\b). This captures either a word ending in a consonant OR a word ending in a vowel, and omits the vowel at the end.

this pattern worked for me
^(.*?)(?=[aeiou]$|$)
Demo
in case input is words that can be in a line as pointed out below
use this pattern
\b([a-z]+?)(?=[aeiou]\b|\b)
Demo

Java literate text word parsing regexp

Firstly I was happy with [A-Za-z]+
Now I need to parse words that end with the letter "s", but i should skip words that have 2 or more first letters in upper-case.
I try something like [\n\\ ][A-Za-z]{0,1}[a-z]*s[ \\.\\,\\?\\!\\:]+ but the first part of it [\n\\ ] for some reason doesn't see the beginning of the line.
here is the example
the text is Denis goeS to school every day!
but the only parsed word is goeS
Any Ideas?

What about
\b[A-Z]?[a-z]*x\b
the \b is a word boundary, I assume that what you wanted. the ? is the shorter form of {0,1}

Try this:
Pattern p = Pattern.compile("\\b([A-Z]?[a-z]*[sS])\\b");
Matcher m = p.matcher("Denis goeS to school every day!");
while(m.find())
{
System.out.println( m.group(1) );
}
The regex matches every word that starts with anything but a whitespace or 2 upper case characters, only contains lower case characters in the middle and ends on either s or S.
In your example this would match Denis and goeS. If you want to only match upper case S change the expression to "\\b([A-Z]?[a-z]*[S])\\b" which woudl match goeS and GoeS but not GOeS, gOeSor goES.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java Regex for counting syllables - java

Related

Match consecutive single characters as whole word

Regular Expression to find words separated with space, backtracking

Split Strings in Java

Extract string without last char if vowel

Java literate text word parsing regexp

Categories

Resources