Count no. of words using Regular expressions in java - java

How to count the number of times each word appear in a String in Java using Regular Expression?

I don't think a regex can solve your problem completely.
You want to
split a string into words, a regular expression can do this for a very simple definition of word, "parts of a string seperated by whitespace or punctuation", which is not a very good definition even if you just stick to English text
Count the number of occurances of each word derived from step 1. To do that you must store some kind of Mapping, and regexes neither store nor count.
A workable approach could be to
split the inputstring (by either regex or other means) into an array of word-strings
iterate over the array, and building a Map to keep count of each word
iterate over the map to output a list of words and the number of occurances.
If your input is limited to English you still have to consider how you want your algorithm to behave in case of things like they're <->they are etc and compound words. Add other languages to the mix for additional kinds of headaches (different ways of writing the same word, words split into parts, difference in writing depending on where in a sentence the word occurs, etc)

I would split your task into a) identify words and b) count number of each unique word in text.
a) could be solved with splitting the text with a regex.
b) could be solved by building a map with the result from a).
String text = "I like good mules. Mules are good :)";
String[] words = text.split("([\\W\\s]+)");
Map<String, Integer> counts = new HashMap<String, Integer>();
for (String word: words) {
if (counts.containsKey(word)) {
counts.put(word, counts.get(word) + 1);
} else {
counts.put(word, 1);
}
}
result: {Mules=1, are=1, good=2, mules=1, like=1, I=1}

Pattern p = Pattern.compile("\\babba\\b");
Matcher m = p.matcher("abba is abba with abbabba and abba doing abba");
int count = 0;
while(m.find()){
count++;
}
System.out.println(count); //4

Using Guava, this is a one-liner:
Multiset<String> countOfEachWord =
HashMultiset.create(Splitter.on(" ").omitEmptyStrings().split(myString));
then to get the count of "dog" for example you would say:
countOfEachWord.count("dog")

Must you use a regex? If not this might help:
public static int count(final String string, final String substring)
{
int count = 0;
int idx = 0;
while ((idx = string.indexOf(substring, idx)) != -1)
{
idx++;
count++;
}
return count;
}

int CountWords(String t){
return t.split("([[a-z][A-Z][0-9][\\Q-\\E]]+)",-1).length+(t.replaceAll("([[a-z][A-Z][0-9][\\W]]*)", "")).length()-1;
}
English Words(chemical names)+Chinese words

Related

How do I get the last substring starting with +,-,*,/?

If I have expression in a string variable like this 20+567-321, so how can I extract last number 321 from it where operator can be +,-,*,/
If the string expression is just 321, I have to get 321, here there is no operator in the expression
You can do this by splitting your string based on your operators as following:
String[] result = myString.split("[-+*/]");
[+|-|*|/] is Regex that specifies the points from where your string should be split. Here, result[result.length-1] is your required string.
EDIT
As suggested by #ElliotFrisch we need to escape - in regex while specifying it. So following pattern should also work:
String[] result = myString.split("[+|\\-|*|/]");
Here is the list of characters they need to be escaped.
Link.
This seems to be an assignment for learning programming and algo, and also I doubt splitting using Regex would be efficient in a case where only last substring is required.
Start from end, and iterate until the length of the string times.
Declare a empty string say Result
While looping, if any of those operator is found, return Result, else prepend the traversed character to the string Result.
Return Result
String[] output = s.split("[+-/*]");
String ans = output[output.length-1];
Assumption here that there will be no spaces and the string contains only numbers and arithmetic operators.
[+-/*] is a regular expression that matches only the characters we provide inside the square brackets. We are splitting based on those characters.
If you wanna do it with StringTokenizer:
public static void main(String args[])
{
String expression = "20+567-321";
StringTokenizer tokenizer = new StringTokenizer(expression, "+-*/");
int count = tokenizer.countTokens();
if( count > 0){
for(int i=0; i< count; i++){
if(i == count - 1 ){
System.out.println(tokenizer.nextToken());
}else{
tokenizer.nextToken();
}
}
}
}
Recall you can specify multiple delimiters in StringTokenizer.

How to count a word having an apostrophe as two separate words using Java regular expressions

I have a string which is having a word with an apostrophe.
Ex- He is a very very good boy, isn't he?
public class Solution {
public static void main(String[] args) {
String s = "He is a very very good boy, isn't he?";
String[] words = s.split("\\s+");
int itemCount = words.length;
System.out.println(itemCount);
for (int i = 0; i < itemCount; i++) {
String word = words[i];
System.out.println(word);
}
}
}
Output I'm getting is 9 words. But I want the count as 10, by separating isn't as 2 words. How to do it using the above Regular Expression?
It would be more reliable to use the \w construct:
Pattern p = Pattern.compile("(\\w)+");
Matcher m = p.matcher("He is a very very good boy, isn't he?");
while (m.find()) {
System.out.println(m.group(0));
}
Otherwise, you need to handle too many situations manually, for instance: "He's a very good boy.Isn't he?".
You can try using p{Punct}, which ignores characters like ?!
String s = "He is a very very good boy, isn't he?";
String[] words = s.split("[\\p{Punct}\\s]+");
int itemCount = words.length;
System.out.println(itemCount);
for (int i = 0; i < itemCount; i++) {
String word = words[i];
System.out.println(word);
}
Split on non-word chars:
String[] words = s.split("\\W+")
I think you want isn't to be is not and so count them as 2 separate words and not single one.
You can have or (|) in split regular expression,
\\s+|'t
This will only for 't and it will avoid to count for sentence like my friend's birthday.. here apostrophe should not be considered for another word.
But that's not just an end of the story. There are lot of other contractions are there which should be consider in such expression.
i.e.
't : isn't, aren't, wasn't, weren't, wouldn't, didn't etc.
's : it's, that's, etc. (This is difficult one)
'd : I'd, you'd etc.
'll : I'll, they'll etc.
...
So ultimately following regular expression will solve 90% of the problem counting word.
\\s+|'t|'d|'ll
Problem with 's(apostrophe S) is it comes with subject like Dog's, Cat's etc. which shows possession and these should not be considered as two separate words. On the other end some time we use 's to write It is, That is(That's, It's) etc. You can add the expressions in existing regular expression to differentiate between contractions and apostrophe which shows possession.
Note : This is only for counting the words and it will split isn't as isn and (space), 't will be removed.

How to calculate syllables in text with regex and Java

I have text as a String and need to calculate number of syllables in each word. I've tried to split all text into array of words and than processed each word separately. I used regular expressions for that. But pattern for syllables doesn't work as it should. Please advice how to change it to calculate correct number of syllables. My initial code.
public int getNumSyllables()
{
String[] words = getText().toLowerCase().split("[a-zA-Z]+");
int count=0;
List <String> tokens = new ArrayList<String>();
for(String word: words){
tokens = Arrays.asList(word.split("[bcdfghjklmnpqrstvwxyz]*[aeiou]+[bcdfghjklmnpqrstvwxyz]*"));
count+= tokens.size();
}
return count;
}
This question is from a Java Course of UCSD, am I right?
I think you should provide enough information for this question, so that it won't confused people who want to offer some help. And here I have my own solution, which already been tested by the test case from the local program, also the OJ from UCSD.
You missed some important information about the definition of syllable in this question. Actually I think the key point of this problem is how should you deal with the e. For example, let's say there is a combination of te. And if you put te in the middle of a word, of course it should be counted as a syllable; However if it's at the end of a word, the e should be thought as a silent e in English, so it should not be thought as a syllable.
That's it. And I would like to write down my thought with some pseudo code:
if(last character is e) {
if(it is silent e at the end of this word) {
remove the silent e;
count the rest part as regular;
} else {
count++;
} else {
count it as regular;
}
}
You may find that I am not only using regex to deal with this problem. Actually I have thought about it: can this question really be done only using regex? My answer is: nope, I don't think so. At least now, with the knowledge UCSD gives us, it's too difficult to do that. Regex is a powerful tool, it can map the desired characters very fast. However regex is missing some functionality. Take the te as example again, regex won't be able to think twice when it is facing the word like teate (I made up this word just for example). If our regex pattern would count the first te as syllable, then why the last te not?
Meanwhile, UCSD actually have talked about it on the assignment paper:
If you find yourself doing mental gymnastics to come up with a single regex to count syllables directly, that's usually an indication that there's a simpler solution (hint: consider a loop over characters--see the next hint below). Just because a piece of code (e.g. a regex) is shorter does not mean it is always better.
The hint here is that, you should think this problem together with some loop, combining with regex.
OK, I should finally show my code now:
protected int countSyllables(String word)
{
// TODO: Implement this method so that you can call it from the
// getNumSyllables method in BasicDocument (module 1) and
// EfficientDocument (module 2).
int count = 0;
word = word.toLowerCase();
if (word.charAt(word.length()-1) == 'e') {
if (silente(word)){
String newword = word.substring(0, word.length()-1);
count = count + countit(newword);
} else {
count++;
}
} else {
count = count + countit(word);
}
return count;
}
private int countit(String word) {
int count = 0;
Pattern splitter = Pattern.compile("[^aeiouy]*[aeiouy]+");
Matcher m = splitter.matcher(word);
while (m.find()) {
count++;
}
return count;
}
private boolean silente(String word) {
word = word.substring(0, word.length()-1);
Pattern yup = Pattern.compile("[aeiouy]");
Matcher m = yup.matcher(word);
if (m.find()) {
return true;
} else
return false;
}
You may find that besides from the given method countSyllables, I also create two additional methods countit and silente. countit is for counting the syllables inside the word, silente is trying to figure it out that is this word end with a silent e. And it should also be noticed that the definition of not silent e. For example, the should be consider not silent e, while ate is considered silent e.
And here is the status my code has already passed the test, from both local test case and OJ from UCSD:
And from OJ the test result:
P.S: It should be fine to use something like [^aeiouy] directly, because the word is parsed before we call this method. Also change to lowercase is necessary, that would save a lot of work dealing with the uppercase. What we want is only the number of syllables.
Talking about number, an elegant way is to define count as static, so the private method could directly use count++ inside. But now it's fine.
Feel free to contact me if you still don't get the method of this question :)
Using the concept of user5500105, I have developed the following method to calculate the number of Syllables in a word. The rules are:
consecutive vowels are counted as 1 syllable. eg. "ae" "ou" are 1 syllable
Y is considered as a vowel
e at the end is counted as syllable if e is the only vowel: eg: "the" is one syllable, since "e" at the end is the only vowel while "there" is also 1 syllable because "e" is at the end and there is another vowel in the word.
public int countSyllables(String word) {
ArrayList<String> tokens = new ArrayList<String>();
String regexp = "[bcdfghjklmnpqrstvwxz]*[aeiouy]+[bcdfghjklmnpqrstvwxz]*";
Pattern p = Pattern.compile(regexp);
Matcher m = p.matcher(word.toLowerCase());
while (m.find()) {
tokens.add(m.group());
}
//check if e is at last and e is not the only vowel or not
if( tokens.size() > 1 && tokens.get(tokens.size()-1).equals("e") )
return tokens.size()-1; // e is at last and not the only vowel so total syllable -1
return tokens.size();
}
This gives you a number of syllables vowels in a word:
public int getNumVowels(String word) {
String regexp = "[bcdfghjklmnpqrstvwxz]*[aeiouy]+[bcdfghjklmnpqrstvwxz]*";
Pattern p = Pattern.compile(regexp);
Matcher m = p.matcher(word.toLowerCase());
int count = 0;
while (m.find()) {
count++;
}
return count;
}
You can call it on every word in your string array:
String[] words = getText().split("\\s+");
for (String word : words ) {
System.out.println("Word: " + word + ", vowels: " + getNumVowels(word));
}
Update: as freerunner noted, calculating the number of syllables is more complicated than just counting vowels. One need to take into account combinations like ou, ui, oo, the final silent e and possibly something else. As I am not a native English speaker, I am not sure what the correct algorithm would be.
This is how I do it. This is about as simple an algorithm I could come up with.
public static int syllables(String s) {
final Pattern p = Pattern.compile("([ayeiou]+)");
final String lowerCase = s.toLowerCase();
final Matcher m = p.matcher(lowerCase);
int count = 0;
while (m.find())
count++;
if (lowerCase.endsWith("e"))
count--;
return count < 0 ? 1 : count;
}
I use this in combination with a soundex function to determine if words sound alike. The syllable count improves accuracy of my soundex function.
Note: This is strictly for counting the syllables in a word. I assume you can parse your input for words using something like java.util.StringTokenizer.
Your line
String[] words = getText().toLowerCase().split("[a-zA-Z]+");
is splitting ON words, and returning only the space between words! You want to split on the space between words, as follows:
String[] words = getText().toLowerCase().split("\\s+");
you can do it as the following :
public int getNumSyllables()
{
return getSyllables(getTokens("[a-zA-Z]+"));
}
protected List<String> getWordTokens(String word,String pattern)
{
ArrayList<String> tokens = new ArrayList<String>();
Pattern tokSplitter = Pattern.compile(pattern);
Matcher m = tokSplitter.matcher(word);
while (m.find()) {
tokens.add(m.group());
}
return tokens;
}
private int getSyllables(List<String> tokens)
{
int count=0;
for(String word : tokens)
if(word.toLowerCase().endsWith("e") && getWordTokens(word.toLowerCase().substring(0, word.length()-1), "[aeiouy]+").size() > 0)
count+=getWordTokens(word.toLowerCase().substring(0, word.length()-1), "[aeiouy]+").size();
else
count+=getWordTokens(word.toLowerCase(), "[aeiouy]+").size();
return count;
}
I count the the separately, then split the text based on words which are ended with e.
Then counting the syllables, here is my implementation:
int syllables = 0;
word = word.toLowerCase();
if(word.contains("the ")){
syllables ++;
}
String[] split = word.split("e!$|e[?]$|e,|e |e[),]|e$");
ArrayList<String> tokens = new ArrayList<String>();
Pattern tokSplitter = Pattern.compile("[aeiouy]+");
for (int i = 0; i < split.length; i++) {
String s = split[i];
Matcher m = tokSplitter.matcher(s);
while (m.find()) {
tokens.add(m.group());
}
}
syllables += tokens.size();
I've testesd an all test cases are passed.
You are using method split incorrectly. This method recieve separator. Need write something like this:
String[] words = getText().toLowerCase().split(" ");
But if you want to count the number of syllables, it is enough to count the number of vowels:
String input = "text";
Set<Character> vowel = new HashSet<>();
vowel.add('a');
vowel.add('e');
vowel.add('i');
vowel.add('o');
vowel.add('u');
int count = 0;
for (char c : input.toLowerCase().toCharArray()) {
if (vowel.contains(c)){
count++;
}
}
System.out.println("count = " + count);

Check amount of particular words in string

If I want to compare the amount one word is used to the other, how would I do that?
It wouldn't be str.contains("cat") > str.contains("dog")
So for example:
if(str.contains("cat") == str.contains("dog")){
System.out.println("true");
}
else
system.out.print("false");
This would hypothetically print true if cat and dog appear the same amount of times. But obviously it doesn't, what would I have to d, to get it to check?
To count number of ocurrences of a String in another String create a function (extracted from here):
The "split and count" method:
public class CountSubstring {
public static int countSubstring(String subStr, String str){
// the result of split() will contain one more element than the delimiter
// the "-1" second argument makes it not discard trailing empty strings
return str.split(Pattern.quote(subStr), -1).length - 1;
}
The "remove and count the difference" method:
public static int countSubstring(String subStr, String str){
return (str.length() - str.replace(subStr, "").length()) / subStr.length();
}
Then you just have to compare:
return countSubstring("dog", phrase) > countSubstring("cat", phrase);
ADDITIONAL INFORMATION
To compare strings use String::equals or String::equalsIgnoreCase if you don't mean uppercase and lowercase.
string1.equals(string2);
string1.equalsIgnoreCase(string2);
To find number of ocurrences of a string in another string use indexOf
string1.indexOf(string2, index);
To see if a string contains another string use contains
string1.contains(string2);
String#contains() will return true if the searched string is found at least once, which is done for performance reasons. Thus str.contains("cat") == str.contains("dog") would be true if both cat and dog are found independent of how often they are found.
What you could do is use 2 regular expressions and check the number of matches:
int countWords(String input, String word ) {
Pattern p = Pattern.compile( "\\b" + word + "\\b" );
int count = 0;
Matcher m = p.matcher( input );
while( m.find() ) {
count++;
}
return count;
}
Usage:
String str = "dog eats dog but cat eats hotdog";
System.out.println("dogs: " + countWords( str, "dog"));
System.out.println("cats: " + countWords( str, "cat"));
Output:
dogs: 2
cats: 1
There are many possible solutions for your problem. I won't show you a full solution, but will try to guide you. Here are some:
split the String according to space(s), iterate over the resulted array and increment the counter when you match the String you're looking for.
use a regex that matches exactly the word you're looking for, there are many useful methods in the Matcher and Pattern classes, go through them.
Java 8 Stream tools has countless methods, you can get the result in one line.
You could get the count in the following way for dog
index = -1;
dogcount = 0;
do {
index = str.indexOf("dog",index+1);
if(index > -1)
dogcount++;
}while(index == -1);
similarly get cat count
Use Apache Commons StringUtils.CountMatches: - counts the number of occurrences of one String in another
https://commons.apache.org/proper/commons-lang/javadocs/api-2.6/org/apache/commons/lang/StringUtils.html#countMatches%28java.lang.String,%20java.lang.String%29

Java: Finding the number of word matches in a given string

I am trying to find the number of word matches for a given string and keyword combination, like this:
public int matches(String keyword, String text){
// ...
}
Example:
Given the following calls:
System.out.println(matches("t", "Today is really great, isn't that GREAT?"));
System.out.println(matches("great", "Today is really great, isn't that GREAT?"));
The result should be:
0
2
So far I found this: Find a complete word in a string java
This only returns if the given keyword exists but not how many occurrences. Also, I am not sure if it ignores case sensitivity (which is important for me).
Remember that substrings should be ignored! I only want full words to be found.
UPDATE
I forgot to mention that I also want keywords that are separated via whitespace to match.
E.g.
matches("today is", "Today is really great, isn't that GREAT?")
should return 1
Use a regular expression with word boundaries. It's by far the easiest choice.
int matches = 0;
Matcher matcher = Pattern.compile("\\bgreat\\b", Pattern.CASE_INSENSITIVE).matcher(text);
while (matcher.find()) matches++;
Your milage may vary on some foreign languages though.
How about taking advantage of indexOf ?
s1 = s1.toLowerCase(Locale.US);
s2 = s2.toLowerCase(Locale.US);
int count = 0;
int x;
int y = s2.length();
while((x=s1.indexOf(s2)) != -1){
count++;
s1 = s1.substr(x,x+y);
}
return count;
Efficient version
int count = 0;
int y = s2.length();
for(int i=0; i<=s1.length()-y; i++){
int lettersMatched = 0;
int j=0;
while(s1[i]==s2[j]){
j++;
i++;
lettersMatched++;
}
if(lettersMatched == y) count++;
}
return count;
For more efficient solution, you will have to modify KMP algorithm a little. Just google it, its simple.
well,you can use "split" to separate the words and find if there exists a word matches exactly.
hope that helps!
one option would be RegEx. Basically it sounds like you are looking to match a word with any punctuation on the left or right. so:
" great."
" great!"
" great "
" great,"
"Great"
would all match, but
"greatest"
wouldn't

Categories

Resources