How to calculate syllables in text with regex and Java

How to calculate syllables in text with regex and Java - java

I have text as a String and need to calculate number of syllables in each word. I've tried to split all text into array of words and than processed each word separately. I used regular expressions for that. But pattern for syllables doesn't work as it should. Please advice how to change it to calculate correct number of syllables. My initial code.
public int getNumSyllables()
{
String[] words = getText().toLowerCase().split("[a-zA-Z]+");
int count=0;
List <String> tokens = new ArrayList<String>();
for(String word: words){
tokens = Arrays.asList(word.split("[bcdfghjklmnpqrstvwxyz]*[aeiou]+[bcdfghjklmnpqrstvwxyz]*"));
count+= tokens.size();
}
return count;
}

This question is from a Java Course of UCSD, am I right?
I think you should provide enough information for this question, so that it won't confused people who want to offer some help. And here I have my own solution, which already been tested by the test case from the local program, also the OJ from UCSD.
You missed some important information about the definition of syllable in this question. Actually I think the key point of this problem is how should you deal with the e. For example, let's say there is a combination of te. And if you put te in the middle of a word, of course it should be counted as a syllable; However if it's at the end of a word, the e should be thought as a silent e in English, so it should not be thought as a syllable.
That's it. And I would like to write down my thought with some pseudo code:
if(last character is e) {
if(it is silent e at the end of this word) {
remove the silent e;
count the rest part as regular;
} else {
count++;
} else {
count it as regular;
}
}
You may find that I am not only using regex to deal with this problem. Actually I have thought about it: can this question really be done only using regex? My answer is: nope, I don't think so. At least now, with the knowledge UCSD gives us, it's too difficult to do that. Regex is a powerful tool, it can map the desired characters very fast. However regex is missing some functionality. Take the te as example again, regex won't be able to think twice when it is facing the word like teate (I made up this word just for example). If our regex pattern would count the first te as syllable, then why the last te not?
Meanwhile, UCSD actually have talked about it on the assignment paper:
If you find yourself doing mental gymnastics to come up with a single regex to count syllables directly, that's usually an indication that there's a simpler solution (hint: consider a loop over characters--see the next hint below). Just because a piece of code (e.g. a regex) is shorter does not mean it is always better.
The hint here is that, you should think this problem together with some loop, combining with regex.
OK, I should finally show my code now:
protected int countSyllables(String word)
{
// TODO: Implement this method so that you can call it from the
// getNumSyllables method in BasicDocument (module 1) and
// EfficientDocument (module 2).
int count = 0;
word = word.toLowerCase();
if (word.charAt(word.length()-1) == 'e') {
if (silente(word)){
String newword = word.substring(0, word.length()-1);
count = count + countit(newword);
} else {
count++;
}
} else {
count = count + countit(word);
}
return count;
}
private int countit(String word) {
int count = 0;
Pattern splitter = Pattern.compile("[^aeiouy]*[aeiouy]+");
Matcher m = splitter.matcher(word);
while (m.find()) {
count++;
}
return count;
}
private boolean silente(String word) {
word = word.substring(0, word.length()-1);
Pattern yup = Pattern.compile("[aeiouy]");
Matcher m = yup.matcher(word);
if (m.find()) {
return true;
} else
return false;
}
You may find that besides from the given method countSyllables, I also create two additional methods countit and silente. countit is for counting the syllables inside the word, silente is trying to figure it out that is this word end with a silent e. And it should also be noticed that the definition of not silent e. For example, the should be consider not silent e, while ate is considered silent e.
And here is the status my code has already passed the test, from both local test case and OJ from UCSD:
And from OJ the test result:
P.S: It should be fine to use something like [^aeiouy] directly, because the word is parsed before we call this method. Also change to lowercase is necessary, that would save a lot of work dealing with the uppercase. What we want is only the number of syllables.
Talking about number, an elegant way is to define count as static, so the private method could directly use count++ inside. But now it's fine.
Feel free to contact me if you still don't get the method of this question :)

Using the concept of user5500105, I have developed the following method to calculate the number of Syllables in a word. The rules are:
consecutive vowels are counted as 1 syllable. eg. "ae" "ou" are 1 syllable
Y is considered as a vowel
e at the end is counted as syllable if e is the only vowel: eg: "the" is one syllable, since "e" at the end is the only vowel while "there" is also 1 syllable because "e" is at the end and there is another vowel in the word.
public int countSyllables(String word) {
ArrayList<String> tokens = new ArrayList<String>();
String regexp = "[bcdfghjklmnpqrstvwxz]*[aeiouy]+[bcdfghjklmnpqrstvwxz]*";
Pattern p = Pattern.compile(regexp);
Matcher m = p.matcher(word.toLowerCase());
while (m.find()) {
tokens.add(m.group());
}
//check if e is at last and e is not the only vowel or not
if( tokens.size() > 1 && tokens.get(tokens.size()-1).equals("e") )
return tokens.size()-1; // e is at last and not the only vowel so total syllable -1
return tokens.size();
}

This gives you a number of syllables vowels in a word:
public int getNumVowels(String word) {
String regexp = "[bcdfghjklmnpqrstvwxz]*[aeiouy]+[bcdfghjklmnpqrstvwxz]*";
Pattern p = Pattern.compile(regexp);
Matcher m = p.matcher(word.toLowerCase());
int count = 0;
while (m.find()) {
count++;
}
return count;
}
You can call it on every word in your string array:
String[] words = getText().split("\\s+");
for (String word : words ) {
System.out.println("Word: " + word + ", vowels: " + getNumVowels(word));
}
Update: as freerunner noted, calculating the number of syllables is more complicated than just counting vowels. One need to take into account combinations like ou, ui, oo, the final silent e and possibly something else. As I am not a native English speaker, I am not sure what the correct algorithm would be.

This is how I do it. This is about as simple an algorithm I could come up with.
public static int syllables(String s) {
final Pattern p = Pattern.compile("([ayeiou]+)");
final String lowerCase = s.toLowerCase();
final Matcher m = p.matcher(lowerCase);
int count = 0;
while (m.find())
count++;
if (lowerCase.endsWith("e"))
count--;
return count < 0 ? 1 : count;
}
I use this in combination with a soundex function to determine if words sound alike. The syllable count improves accuracy of my soundex function.
Note: This is strictly for counting the syllables in a word. I assume you can parse your input for words using something like java.util.StringTokenizer.

Your line
String[] words = getText().toLowerCase().split("[a-zA-Z]+");
is splitting ON words, and returning only the space between words! You want to split on the space between words, as follows:
String[] words = getText().toLowerCase().split("\\s+");

you can do it as the following :
public int getNumSyllables()
{
return getSyllables(getTokens("[a-zA-Z]+"));
}
protected List<String> getWordTokens(String word,String pattern)
{
ArrayList<String> tokens = new ArrayList<String>();
Pattern tokSplitter = Pattern.compile(pattern);
Matcher m = tokSplitter.matcher(word);
while (m.find()) {
tokens.add(m.group());
}
return tokens;
}
private int getSyllables(List<String> tokens)
{
int count=0;
for(String word : tokens)
if(word.toLowerCase().endsWith("e") && getWordTokens(word.toLowerCase().substring(0, word.length()-1), "[aeiouy]+").size() > 0)
count+=getWordTokens(word.toLowerCase().substring(0, word.length()-1), "[aeiouy]+").size();
else
count+=getWordTokens(word.toLowerCase(), "[aeiouy]+").size();
return count;
}

I count the the separately, then split the text based on words which are ended with e.
Then counting the syllables, here is my implementation:
int syllables = 0;
word = word.toLowerCase();
if(word.contains("the ")){
syllables ++;
}
String[] split = word.split("e!$|e[?]$|e,|e |e[),]|e$");
ArrayList<String> tokens = new ArrayList<String>();
Pattern tokSplitter = Pattern.compile("[aeiouy]+");
for (int i = 0; i < split.length; i++) {
String s = split[i];
Matcher m = tokSplitter.matcher(s);
while (m.find()) {
tokens.add(m.group());
}
}
syllables += tokens.size();
I've testesd an all test cases are passed.

You are using method split incorrectly. This method recieve separator. Need write something like this:
String[] words = getText().toLowerCase().split(" ");
But if you want to count the number of syllables, it is enough to count the number of vowels:
String input = "text";
Set<Character> vowel = new HashSet<>();
vowel.add('a');
vowel.add('e');
vowel.add('i');
vowel.add('o');
vowel.add('u');
int count = 0;
for (char c : input.toLowerCase().toCharArray()) {
if (vowel.contains(c)){
count++;
}
}
System.out.println("count = " + count);

Related

Check if String contains exact key word

I have a List of Strings. For each string i wish to see if the first occurrence of the word "joe" is present. I am separating by white space as I do not wish to count the word "joey", for example.
My current code counts every occurrence of the word "joe", how do I edit it so it only counts the first occurrence of the word, and then moves onto the next string in the list.
public int counter(List<String> comments) {
int count = 0;
String word = "joe";
for (String comment : comments) {
String a[] = comment.split(" ");
for (int j = 0; j < a.length; j++) {
if (word.equals(a[j])) {
count++;
}
}
System.out.println(comment);
}
System.out.println("count is " + count);
return count;
}
EDIT
str.add("the hello my the name is joe the this joe is a test");
str.add("i was walking down joe then suddenly joe said hi");
I want my code to return 2 for this (joe has appeared in each String)

You can use regular expressions to check whether the entire String contains the word without the need to split it into individual words first.
A regular expression that matches the word "joe" but not "joey" would be the following:
\bjoe\b
The \b matches the bounds of a word, so the whole expression matches the start of a word, then the word, which must be joe and then the end of the word.
In Java this could be realized using the matches(pattern) method on a String:
"hello joe, how are you?".matches(".*\\bjoe\\b.*");
Note that the matches function requires the regular expression to match the entire string to return true, so we have to add .* at the start and the end, which will match any number of arbitrary characters. (The . matches an arbitrary character, the * signals that you want to match the preceding subexpression an arbitrary number of times)
This regular expression has the advantage, that it still works with punctuation. Just splitting on spaces would fail to recognize joe in the String "hello joe, how are you?"
To wrap it all up, this would be the entire solution:
public int countMatches(List<String> comments) {
int numberOfMatches = 0;
for (String comment : comments) {
if (comment.matches(".*\\bjoe\\b.*")) {
numberOfMatches++;
}
}
return numberOfMatches;
}
If you want to match an arbitrary search term, you have to be careful, because some characters have a special meaning in regular expressions. I recommend using Pattern.quote (import java.util.regex.Pattern;):
String pattern = ".*\\b" + Pattern.quote(word) + "\\b.*";
Then you can match comments with comment.matches(pattern).

A regular expression also works and makes this code a little shorter.
public int counter(List<String> comments) {
String regex = "(.* )?joe( .*)?";
return (int) comments.stream().filter(s -> s.matches(regex)).count();
}
Edit:
The regex of #Paelle is slightly better, use .*\\bjoe\\b.* instead.

In your code you just need to add break; line just after count++; line.
Something like the following:
public int counter(List<String> comments) {
int count = 0;
String word = "joe";
for (String comment : comments) {
String a[] = comment.split(" ");
for (int j = 0; j < a.length; j++) {
if (word.equals(a[j])) {
count++;
break;
}
}
System.out.println(comment);
}
System.out.println("count is " + count);
return count;
}

How to preserve the punctuation when converting words to Pig Latin?

I've been working on a Java program to convert English words to Pig Latin. I've done all the basic rules such as appending -ay, -way, etc., and special cases like question -> estionquay, rhyme -> ymerhay, and I also dealt with capitalization (Thomas -> Omasthay). However, I have one problem that I can't seem to solve: I need to preserve before-and-after punctuation. For example, What? -> Atwhay? Oh!->Ohway! "hello" -> "ellohay" and "Hello!" -> "Ellohay!" This is not a duplicate by the way, I've checked tons of pig latin questions and I cannot seem to figure out how to do it.
Here is my code so far (I've removed all the punctuation but can't figure out how to put it back in):
public static String scrub(String s)
{
String punct = ".,?!:;\"(){}[]<>";
String temp = "";
String pigged = "";
int index, index1, index2, index3 = 0;
for(int i = 0; i < s.length(); i++)
{
if(punct.indexOf(s.charAt(i)) == -1) //if s has no punctuation
{
temp+= s.charAt(i);
}
} //temp equals word without punctuation
pigged = pig(temp); //pig is the piglatin-translator method that I have already written,
//didn't want to put it here because it's almost 200 lines
for(int x = 0; x < s.length(); x++)
{
if(s.indexOf(punct)!= -1)//punctuation exists
{
index = x;
}
}
}
I get that in theory you could search the string for punctuation and that it should be near the beginning or end, so you would have to store the index and replace it after it is "piglatenized", but I keep getting confused about the for loop part. if you do index = x inside the for-loop, you're just replacing index every time the loop runs.
Help would be appreciated greatly! Also, please keep in mind I can't use shortcuts, I can use String methods and such but not things like Collections or ArrayLists (not that you'd need them here), I have to reinvent the wheel, basically. By the way, in case it wasn't clear, I already have the translating-to-piglatin thing down. I only need to preserve the punctuation before and after translating.

If you are allowed to use regular expressions, you can use the following code.
String pigSentence(String sentence) {
Matcher m = Pattern.compile("\\p{L}+").matcher(sentence);
StringBuffer sb = new StringBuffer();
while (m.find()) {
m.appendReplacement(pig(m.group()));
}
m.appendTail();
return sb.toString();
}
In plain English, the above code is:
for each word in the sentence:
replace it with pig(word)
But if regular expressions are forbidden, you can try this:
String pigSentence(String sentence) {
char[] chars = sentence.toCharArray();
int i = 0, len = chars.length;
StringBuilder sb = new StringBuilder();
while (i < len) {
while (i < len && !Character.isLetter(chars[i]))
sb.append(chars[i++]);
int wordStart = i;
while (i < len && Character.isLetter(chars[i]))
i++;
int wordEnd = i;
if (wordStart != wordEnd) {
String word = sentence.substring(wordStart, wordEnd - wordStart);
sb.append(pig(word));
}
}
return sb.toString();
}

What you need to do is: remove punctuation if it exists, convert to pig latin, add punctuation back.
Assuming punctuation is always and the end of the string, You can check for punctuation with the following:
String punctuation = "";
for (int i = str.length() - 1; i > 0; i--) {
if (!Character.isLetter(str.charAt(i))) {
punctuation = str.charAt(i) + punctuation;
} else {
break; // Found all punctuation
}
}
str = str.substring(0, str.length() - punctuation.length()); // Remove punctuation
// Convert str to pig latin
// Append punctuation to str

I'd find it troublesome to handle punctuation separate from the translation. For punctuation at the very beginning or very end, you can save them and tag them back on after translating.
But if you remove the punctuations from the middle of the word, it will be rather difficult to replace them back to their correct location. Their indices change from the original word to the pigged word, and by a variable amount. (For some a random example, consider "Hel'lo" and "Quest'ion". The apostrophe shifts left by either 1 or 2, and you won't know which.)
How does your translation method handle punctuation? Do you really need to remove all punctuation before passing it to the translator? I'd suggest having your pigging method handle at least the punctuation in the middle of the word.

Java: Finding the number of word matches in a given string

I am trying to find the number of word matches for a given string and keyword combination, like this:
public int matches(String keyword, String text){
// ...
}
Example:
Given the following calls:
System.out.println(matches("t", "Today is really great, isn't that GREAT?"));
System.out.println(matches("great", "Today is really great, isn't that GREAT?"));
The result should be:
0
2
So far I found this: Find a complete word in a string java
This only returns if the given keyword exists but not how many occurrences. Also, I am not sure if it ignores case sensitivity (which is important for me).
Remember that substrings should be ignored! I only want full words to be found.
UPDATE
I forgot to mention that I also want keywords that are separated via whitespace to match.
E.g.
matches("today is", "Today is really great, isn't that GREAT?")
should return 1

Use a regular expression with word boundaries. It's by far the easiest choice.
int matches = 0;
Matcher matcher = Pattern.compile("\\bgreat\\b", Pattern.CASE_INSENSITIVE).matcher(text);
while (matcher.find()) matches++;
Your milage may vary on some foreign languages though.

How about taking advantage of indexOf ?
s1 = s1.toLowerCase(Locale.US);
s2 = s2.toLowerCase(Locale.US);
int count = 0;
int x;
int y = s2.length();
while((x=s1.indexOf(s2)) != -1){
count++;
s1 = s1.substr(x,x+y);
}
return count;
Efficient version
int count = 0;
int y = s2.length();
for(int i=0; i<=s1.length()-y; i++){
int lettersMatched = 0;
int j=0;
while(s1[i]==s2[j]){
j++;
i++;
lettersMatched++;
}
if(lettersMatched == y) count++;
}
return count;
For more efficient solution, you will have to modify KMP algorithm a little. Just google it, its simple.

well,you can use "split" to separate the words and find if there exists a word matches exactly.
hope that helps!

one option would be RegEx. Basically it sounds like you are looking to match a word with any punctuation on the left or right. so:
" great."
" great!"
" great "
" great,"
"Great"
would all match, but
"greatest"
wouldn't

Find whole words without regex

I need to find whole words in a sentence, but without using regular expressions. So if I wanted to find the word "the" in this sentence: "The quick brown fox jumps over the lazy dog", I'm currently using:
String text = "the, quick brown fox jumps over the lazy dog";
String keyword = "the";
Matcher matcher = Pattern.compile("\\b"+keyword+"\\b").matcher(text);
Boolean contains = matcher.find();
but if I used:
Boolean contains = text.contains(keyword);
and pad the keyword with a space, it won't find the first "the" in the sentence, both because it doesn't have surround whitespaces and the punctuations.
To be clear, I'm building an Android app, and I'm getting memory leaks and it might be because I'm using a regular-expression in a ListView, so it's performing a regular-expression match X number of times, depending on the items in the Listview.

If you needed to check for multiple words and do it without regular expressions you could use StringTokenizer with a space as the delimiter.
You could then build a custom search method. Otherwise, the other solutions using String.contains() or String.indexOf() qualify.

What you do is search for "the". Then for each match you test to see if the surrounding characters are white space (or punctuation), or if the match is at the beginning / end of the string respectively.

public int findWholeWorld(final String text, final String searchString) {
return (" " + text + " ").indexOf(" " + searchString + " ");
}
This will give you the index of the first occurrence of the word "the" or -1 if the word "the" doesn't exist.

Split the string on space, and then see if the resulting array contains your word.

Simply iterate over the characters and keep storing them in a char buffer. Every time you see a whitespace, empty the buffer into a list of words and go on till you reach the end.

In the comments of the StringTokenizer.class:
StringTokenizer is a legacy class that is retained for
compatibility reasons although its use is discouraged in new code. It is
recommended that anyone seeking this functionality use the split
method of String or the java.util.regex package instead.
The following example illustrates how the String.split
method can be used to break up a string into its basic tokens:
String[] result = "this is a test".split("\\s");
for (int x=0; x<result.length; x++)
System.out.println(result[x]);
prints the following output:
this
is
a
test
Iterate through your resulting string array and test for equality and keep a count.
for (String s : result)
{
count++;
}
If this is a homework assignment, tell your lecturer to read up on Java, times have changed. I remember having the exact same stupid questions during school and it does nothing to prepare you for the real world.

I have a project that requires whole word matching, but I can't use regular expressions(because regular expressions escape some keywords), I tried to write my own code to simulate it with non-regular expressions (\bxxx\b), I only know C# and it worked fine.
public static class Finder
{
public static bool Find(string? input, string? pattern, bool isMatchCase = false, bool isMatchWholeWord = false, bool isMatchRegex = false)
{
if (String.IsNullOrWhiteSpace(input) || String.IsNullOrWhiteSpace(pattern))
{
return false;
}
if (!isMatchCase && !isMatchRegex)
{
input = input.ToLower();
pattern = pattern.ToLower();
}
if (isMatchWholeWord && !isMatchRegex)
{
int len = pattern.Length;
int suffix = 0;
while (true)
{
int start = input.IndexOf(pattern, suffix);
if (start == -1)
{
return false;
}
int end = start + len - 1;
int prefix = start - 1;
suffix = end + 1;
bool isPrefixMatched, isSuffixMatched;
if (start == 0)
{
isPrefixMatched = true;
}
else
{
isPrefixMatched = IsWord(input[prefix]) != IsWord(input[start]);
}
if (end == input.Length - 1)
{
isSuffixMatched = true;
}
else
{
isSuffixMatched = IsWord(input[suffix]) != IsWord(input[end]);
}
if (isPrefixMatched && isSuffixMatched)
{
return true;
}
}
}
if (isMatchRegex)
{
if (isMatchWholeWord)
{
if (!pattern.StartsWith(#"\b"))
{
pattern = $#"\b{pattern}";
}
if (!pattern.EndsWith(#"\b"))
{
pattern = $#"{pattern}\b";
}
}
return Regex.IsMatch(input, pattern, isMatchCase ? RegexOptions.None : RegexOptions.IgnoreCase);
}
return input.Contains(pattern);
}
private static bool IsWord(char ch)
{
return Char.IsLetterOrDigit(ch) || ch == '_';
}
}

Count no. of words using Regular expressions in java

How to count the number of times each word appear in a String in Java using Regular Expression?

I don't think a regex can solve your problem completely.
You want to
split a string into words, a regular expression can do this for a very simple definition of word, "parts of a string seperated by whitespace or punctuation", which is not a very good definition even if you just stick to English text
Count the number of occurances of each word derived from step 1. To do that you must store some kind of Mapping, and regexes neither store nor count.
A workable approach could be to
split the inputstring (by either regex or other means) into an array of word-strings
iterate over the array, and building a Map to keep count of each word
iterate over the map to output a list of words and the number of occurances.
If your input is limited to English you still have to consider how you want your algorithm to behave in case of things like they're <->they are etc and compound words. Add other languages to the mix for additional kinds of headaches (different ways of writing the same word, words split into parts, difference in writing depending on where in a sentence the word occurs, etc)

I would split your task into a) identify words and b) count number of each unique word in text.
a) could be solved with splitting the text with a regex.
b) could be solved by building a map with the result from a).
String text = "I like good mules. Mules are good :)";
String[] words = text.split("([\\W\\s]+)");
Map<String, Integer> counts = new HashMap<String, Integer>();
for (String word: words) {
if (counts.containsKey(word)) {
counts.put(word, counts.get(word) + 1);
} else {
counts.put(word, 1);
}
}
result: {Mules=1, are=1, good=2, mules=1, like=1, I=1}

Pattern p = Pattern.compile("\\babba\\b");
Matcher m = p.matcher("abba is abba with abbabba and abba doing abba");
int count = 0;
while(m.find()){
count++;
}
System.out.println(count); //4

Using Guava, this is a one-liner:
Multiset<String> countOfEachWord =
HashMultiset.create(Splitter.on(" ").omitEmptyStrings().split(myString));
then to get the count of "dog" for example you would say:
countOfEachWord.count("dog")

Must you use a regex? If not this might help:
public static int count(final String string, final String substring)
{
int count = 0;
int idx = 0;
while ((idx = string.indexOf(substring, idx)) != -1)
{
idx++;
count++;
}
return count;
}

int CountWords(String t){
return t.split("([[a-z][A-Z][0-9][\\Q-\\E]]+)",-1).length+(t.replaceAll("([[a-z][A-Z][0-9][\\W]]*)", "")).length()-1;
}
English Words(chemical names)+Chinese words

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to calculate syllables in text with regex and Java - java

Your line String[] words = getText().toLowerCase().split("[a-zA-Z]+"); is splitting ON words, and returning only the space between words! You want to split on the space between words, as follows: String[] words = getText().toLowerCase().split("\\s+");

Related

Check if String contains exact key word

How to preserve the punctuation when converting words to Pig Latin?

Java: Finding the number of word matches in a given string

Find whole words without regex

Count no. of words using Regular expressions in java

Categories

Resources