Java: Finding the number of word matches in a given string

Java: Finding the number of word matches in a given string - java

I am trying to find the number of word matches for a given string and keyword combination, like this:
public int matches(String keyword, String text){
// ...
}
Example:
Given the following calls:
System.out.println(matches("t", "Today is really great, isn't that GREAT?"));
System.out.println(matches("great", "Today is really great, isn't that GREAT?"));
The result should be:
0
2
So far I found this: Find a complete word in a string java
This only returns if the given keyword exists but not how many occurrences. Also, I am not sure if it ignores case sensitivity (which is important for me).
Remember that substrings should be ignored! I only want full words to be found.
UPDATE
I forgot to mention that I also want keywords that are separated via whitespace to match.
E.g.
matches("today is", "Today is really great, isn't that GREAT?")
should return 1

Use a regular expression with word boundaries. It's by far the easiest choice.
int matches = 0;
Matcher matcher = Pattern.compile("\\bgreat\\b", Pattern.CASE_INSENSITIVE).matcher(text);
while (matcher.find()) matches++;
Your milage may vary on some foreign languages though.

How about taking advantage of indexOf ?
s1 = s1.toLowerCase(Locale.US);
s2 = s2.toLowerCase(Locale.US);
int count = 0;
int x;
int y = s2.length();
while((x=s1.indexOf(s2)) != -1){
count++;
s1 = s1.substr(x,x+y);
}
return count;
Efficient version
int count = 0;
int y = s2.length();
for(int i=0; i<=s1.length()-y; i++){
int lettersMatched = 0;
int j=0;
while(s1[i]==s2[j]){
j++;
i++;
lettersMatched++;
}
if(lettersMatched == y) count++;
}
return count;
For more efficient solution, you will have to modify KMP algorithm a little. Just google it, its simple.

well,you can use "split" to separate the words and find if there exists a word matches exactly.
hope that helps!

one option would be RegEx. Basically it sounds like you are looking to match a word with any punctuation on the left or right. so:
" great."
" great!"
" great "
" great,"
"Great"
would all match, but
"greatest"
wouldn't

Related

String manipulation of function names

For this Kata, i am given random function names in the PEP8 format and i am to convert them to camelCase.
(input)get_speed == (output)getSpeed ....
(input)set_distance == (output)setDistance
I have a understanding on one way of doing this written in pseudo-code:
loop through the word,
if the letter is an underscore
then delete the underscore
then get the next letter and change to a uppercase
endIf
endLoop
return the resultant word
But im unsure the best way of doing this, would it be more efficient to create a char array and loop through the element and then when it comes to finding an underscore delete that element and get the next index and change to uppercase.
Or would it be better to use recursion:
function camelCase takes a string
if the length of the string is 0,
then return the string
endIf
if the character is a underscore
then change to nothing,
then find next character and change to uppercase
return the string taking away the character
endIf
finally return the function taking the first character away
Any thoughts please, looking for a good efficient way of handing this problem. Thanks :)

I would go with this:
divide given String by underscore to array
from second word until end take first letter and convert it to uppercase
join to one word
This will work in O(n) (go through all names 3 time). For first case, use this function:
str.split("_");
for uppercase use this:
String newName = substring(0, 1).toUpperCase() + stre.substring(1);
But make sure you check size of the string first...
Edited - added implementation
It would look like this:
public String camelCase(String str) {
if (str == null ||str.trim().length() == 0) return str;
String[] split = str.split("_");
String newStr = split[0];
for (int i = 1; i < split.length; i++) {
newStr += split[i].substring(0, 1).toUpperCase() + split[i].substring(1);
}
return newStr;
}
for inputs:
"test"
"test_me"
"test_me_twice"
it returns:
"test"
"testMe"
"testMeTwice"

It would be simpler to iterate over the string instead of recursing.
String pep8 = "do_it_again";
StringBuilder camelCase = new StringBuilder();
for(int i = 0, l = pep8.length(); i < l; ++i) {
if(pep8.charAt(i) == '_' && (i + 1) < l) {
camelCase.append(Character.toUpperCase(pep8.charAt(++i)));
} else {
camelCase.append(pep8.charAt(i));
}
}
System.out.println(camelCase.toString()); // prints doItAgain

The question you pose is whether to use an iterative or a recursive approach. For this case I'd go for the recursive approach because it's straightforward, easy to understand doesn't require much resources (only one array, no new stackframe etc), though that doesn't really matter for this example.
Recursion is good for divide-and-conquer problems, but I don't see that fitting the case well, although it's possible.
An iterative implementation of the algorithm you described could look like the following:
StringBuilder buf = new StringBuilder(input);
for(int i = 0; i < buf.length(); i++){
if(buf.charAt(i) == '_'){
buf.deleteCharAt(i);
if(i != buf.length()){ //check fo EOL
buf.setCharAt(i, Character.toUpperCase(buf.charAt(i)));
}
}
}
return buf.toString();
The check for the EOL is not part of the given algorithm and could be ommitted, if the input string never ends with '_'

How to calculate syllables in text with regex and Java

I have text as a String and need to calculate number of syllables in each word. I've tried to split all text into array of words and than processed each word separately. I used regular expressions for that. But pattern for syllables doesn't work as it should. Please advice how to change it to calculate correct number of syllables. My initial code.
public int getNumSyllables()
{
String[] words = getText().toLowerCase().split("[a-zA-Z]+");
int count=0;
List <String> tokens = new ArrayList<String>();
for(String word: words){
tokens = Arrays.asList(word.split("[bcdfghjklmnpqrstvwxyz]*[aeiou]+[bcdfghjklmnpqrstvwxyz]*"));
count+= tokens.size();
}
return count;
}

This question is from a Java Course of UCSD, am I right?
I think you should provide enough information for this question, so that it won't confused people who want to offer some help. And here I have my own solution, which already been tested by the test case from the local program, also the OJ from UCSD.
You missed some important information about the definition of syllable in this question. Actually I think the key point of this problem is how should you deal with the e. For example, let's say there is a combination of te. And if you put te in the middle of a word, of course it should be counted as a syllable; However if it's at the end of a word, the e should be thought as a silent e in English, so it should not be thought as a syllable.
That's it. And I would like to write down my thought with some pseudo code:
if(last character is e) {
if(it is silent e at the end of this word) {
remove the silent e;
count the rest part as regular;
} else {
count++;
} else {
count it as regular;
}
}
You may find that I am not only using regex to deal with this problem. Actually I have thought about it: can this question really be done only using regex? My answer is: nope, I don't think so. At least now, with the knowledge UCSD gives us, it's too difficult to do that. Regex is a powerful tool, it can map the desired characters very fast. However regex is missing some functionality. Take the te as example again, regex won't be able to think twice when it is facing the word like teate (I made up this word just for example). If our regex pattern would count the first te as syllable, then why the last te not?
Meanwhile, UCSD actually have talked about it on the assignment paper:
If you find yourself doing mental gymnastics to come up with a single regex to count syllables directly, that's usually an indication that there's a simpler solution (hint: consider a loop over characters--see the next hint below). Just because a piece of code (e.g. a regex) is shorter does not mean it is always better.
The hint here is that, you should think this problem together with some loop, combining with regex.
OK, I should finally show my code now:
protected int countSyllables(String word)
{
// TODO: Implement this method so that you can call it from the
// getNumSyllables method in BasicDocument (module 1) and
// EfficientDocument (module 2).
int count = 0;
word = word.toLowerCase();
if (word.charAt(word.length()-1) == 'e') {
if (silente(word)){
String newword = word.substring(0, word.length()-1);
count = count + countit(newword);
} else {
count++;
}
} else {
count = count + countit(word);
}
return count;
}
private int countit(String word) {
int count = 0;
Pattern splitter = Pattern.compile("[^aeiouy]*[aeiouy]+");
Matcher m = splitter.matcher(word);
while (m.find()) {
count++;
}
return count;
}
private boolean silente(String word) {
word = word.substring(0, word.length()-1);
Pattern yup = Pattern.compile("[aeiouy]");
Matcher m = yup.matcher(word);
if (m.find()) {
return true;
} else
return false;
}
You may find that besides from the given method countSyllables, I also create two additional methods countit and silente. countit is for counting the syllables inside the word, silente is trying to figure it out that is this word end with a silent e. And it should also be noticed that the definition of not silent e. For example, the should be consider not silent e, while ate is considered silent e.
And here is the status my code has already passed the test, from both local test case and OJ from UCSD:
And from OJ the test result:
P.S: It should be fine to use something like [^aeiouy] directly, because the word is parsed before we call this method. Also change to lowercase is necessary, that would save a lot of work dealing with the uppercase. What we want is only the number of syllables.
Talking about number, an elegant way is to define count as static, so the private method could directly use count++ inside. But now it's fine.
Feel free to contact me if you still don't get the method of this question :)

Using the concept of user5500105, I have developed the following method to calculate the number of Syllables in a word. The rules are:
consecutive vowels are counted as 1 syllable. eg. "ae" "ou" are 1 syllable
Y is considered as a vowel
e at the end is counted as syllable if e is the only vowel: eg: "the" is one syllable, since "e" at the end is the only vowel while "there" is also 1 syllable because "e" is at the end and there is another vowel in the word.
public int countSyllables(String word) {
ArrayList<String> tokens = new ArrayList<String>();
String regexp = "[bcdfghjklmnpqrstvwxz]*[aeiouy]+[bcdfghjklmnpqrstvwxz]*";
Pattern p = Pattern.compile(regexp);
Matcher m = p.matcher(word.toLowerCase());
while (m.find()) {
tokens.add(m.group());
}
//check if e is at last and e is not the only vowel or not
if( tokens.size() > 1 && tokens.get(tokens.size()-1).equals("e") )
return tokens.size()-1; // e is at last and not the only vowel so total syllable -1
return tokens.size();
}

This gives you a number of syllables vowels in a word:
public int getNumVowels(String word) {
String regexp = "[bcdfghjklmnpqrstvwxz]*[aeiouy]+[bcdfghjklmnpqrstvwxz]*";
Pattern p = Pattern.compile(regexp);
Matcher m = p.matcher(word.toLowerCase());
int count = 0;
while (m.find()) {
count++;
}
return count;
}
You can call it on every word in your string array:
String[] words = getText().split("\\s+");
for (String word : words ) {
System.out.println("Word: " + word + ", vowels: " + getNumVowels(word));
}
Update: as freerunner noted, calculating the number of syllables is more complicated than just counting vowels. One need to take into account combinations like ou, ui, oo, the final silent e and possibly something else. As I am not a native English speaker, I am not sure what the correct algorithm would be.

This is how I do it. This is about as simple an algorithm I could come up with.
public static int syllables(String s) {
final Pattern p = Pattern.compile("([ayeiou]+)");
final String lowerCase = s.toLowerCase();
final Matcher m = p.matcher(lowerCase);
int count = 0;
while (m.find())
count++;
if (lowerCase.endsWith("e"))
count--;
return count < 0 ? 1 : count;
}
I use this in combination with a soundex function to determine if words sound alike. The syllable count improves accuracy of my soundex function.
Note: This is strictly for counting the syllables in a word. I assume you can parse your input for words using something like java.util.StringTokenizer.

Your line
String[] words = getText().toLowerCase().split("[a-zA-Z]+");
is splitting ON words, and returning only the space between words! You want to split on the space between words, as follows:
String[] words = getText().toLowerCase().split("\\s+");

you can do it as the following :
public int getNumSyllables()
{
return getSyllables(getTokens("[a-zA-Z]+"));
}
protected List<String> getWordTokens(String word,String pattern)
{
ArrayList<String> tokens = new ArrayList<String>();
Pattern tokSplitter = Pattern.compile(pattern);
Matcher m = tokSplitter.matcher(word);
while (m.find()) {
tokens.add(m.group());
}
return tokens;
}
private int getSyllables(List<String> tokens)
{
int count=0;
for(String word : tokens)
if(word.toLowerCase().endsWith("e") && getWordTokens(word.toLowerCase().substring(0, word.length()-1), "[aeiouy]+").size() > 0)
count+=getWordTokens(word.toLowerCase().substring(0, word.length()-1), "[aeiouy]+").size();
else
count+=getWordTokens(word.toLowerCase(), "[aeiouy]+").size();
return count;
}

I count the the separately, then split the text based on words which are ended with e.
Then counting the syllables, here is my implementation:
int syllables = 0;
word = word.toLowerCase();
if(word.contains("the ")){
syllables ++;
}
String[] split = word.split("e!$|e[?]$|e,|e |e[),]|e$");
ArrayList<String> tokens = new ArrayList<String>();
Pattern tokSplitter = Pattern.compile("[aeiouy]+");
for (int i = 0; i < split.length; i++) {
String s = split[i];
Matcher m = tokSplitter.matcher(s);
while (m.find()) {
tokens.add(m.group());
}
}
syllables += tokens.size();
I've testesd an all test cases are passed.

You are using method split incorrectly. This method recieve separator. Need write something like this:
String[] words = getText().toLowerCase().split(" ");
But if you want to count the number of syllables, it is enough to count the number of vowels:
String input = "text";
Set<Character> vowel = new HashSet<>();
vowel.add('a');
vowel.add('e');
vowel.add('i');
vowel.add('o');
vowel.add('u');
int count = 0;
for (char c : input.toLowerCase().toCharArray()) {
if (vowel.contains(c)){
count++;
}
}
System.out.println("count = " + count);

java regex replace any double letter in word with single

I've been searching for hours but can't find an answer, I apologize if this has been answered before.
I'm trying to check each word in a message for any double letters and remove the extra letter, words like wall or doll for example would become wal or dol. the purpose is for a fake language translation for a game, so far I've gottan as far as identifying the double letters but don't know how to replace them.
here's my code so far:
public String[] removeDouble(String[] words){
Pattern pattern = Pattern.compile("(\\w)\\1+");
for (int i = 0; i < words.length; i++){
Matcher matcher = pattern.matcher(words[i]);
if (matcher.find()){
words[i].replaceAll("what to replace with?");
}
}
return words;
}

You can do the whole replacement operation in one statement if you use back references:
for (int i = 0; i < words.length; i++)
words[i] = words[i].replaceAll("(.)\\1", "$1");
Note that you must assign the value returned from string methods that (appear to) change strings, because they return new strings rather than mutate the string.

String.replaceAll does not modify the string in-place. (Java String is immutable) You need assign the returned value back.
And the String.replaceAll accepts two parameters.
Replace following line:
words[i].replaceAll("what to replace with?");
with:
words[i] = "what to replace with?";

Efficient Regular Expression for big data, if a String contains a word

I have a code that works but is extremely slow. This code determines whether a string contains a keyword. The requirements I have need to be efficient for hundreds of keywords that I will search for in thousands of documents.
What can I do to make finding the keywords (without falsely returning a word that contains the keyword) efficiently?
For example:
String keyword="ac";
String document"..." //few page long file
If i use :
if(document.contains(keyword) ){
//do something
}
It will also return true if document contains a word like "account";
so I tried to use regular expression as follows:
String pattern = "(.*)([^A-Za-z]"+ keyword +"[^A-Za-z])(.*)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(document);
if(m.find()){
//do something
}
Summary:
This is the summary: Hopefully it will be useful to some one else:
My regular expression would work but extremely impractical while
working with big data. (it didn't terminate)
#anubhava perfected the regular expression. it was easy to
understand and implement. It managed to terminate which is a big
thing. but it was still a bit slow. (Roughly about 240 seconds)
#Tomalak solution is abit complex to implement and understand but it
was the fastest solution. so hats off mate.(18 seconds)
so #Tomalak solution was ~15 times faster than #anubhava.

Don't think you need to have .* in your regex.
Try this regex:
String pattern = "\\b"+ Pattern.quote(keyword) + "\\b";
Here \\b is used for word boundary. If the keyword can contain special characters, make sure they are not at the start or end of the word, or the word boundaries will fail to match.
Also you must be using Pattern.quote if your keyword contains special regex characters.
EDIT: You might use this regex if your keywords are separated by space.
String pattern = "(?<=\\s|^)"+ Pattern.quote(keyword) + "(?=\\s|$)";

The fastest-possible way to find substrings in Java is to use String.indexOf().
To achieve "entire-word-only" matches, you would need to add a little bit of logic to check the characters before and after a possible match to make sure they are non-word characters:
public class IndexOfWordSample {
public static void main(String[] args) {
String input = "There are longer strings than this not very long one.";
String search = "long";
int index = indexOfWord(input, search);
if (index > -1) {
System.out.println("Hit for \"" + search + "\" at position " + index + ".");
} else {
System.out.println("No hit for \"" + search + "\".");
}
}
public static int indexOfWord(String input, String word) {
String nonWord = "^\\W?$", before, after;
int index, before_i, after_i = 0;
while (true) {
index = input.indexOf(word, after_i);
if (index == -1 || word.isEmpty()) break;
before_i = index - 1;
after_i = index + word.length();
before = "" + (before_i > -1 ? input.charAt(before_i) : "");
after = "" + (after_i < input.length() ? input.charAt(after_i) : "");
if (before.matches(nonWord) && after.matches(nonWord)) {
return index;
}
}
return -1;
}
}
This would print:
Hit for "long" at position 44.
This should perform better than a pure regular expressions approach.
Think if ^\W?$ already matches your expectation of a "non-word" character. The regular expression is a compromise here and may cost performance if your input string contains many "almost"-matches.
For extra speed, ditch the regex and work with the Character class, checking a combination of the many properties it provides (like isAlphabetic, etc.) for before and after.
I've created a Gist with an alternative implementation that does that.

Count no. of words using Regular expressions in java

How to count the number of times each word appear in a String in Java using Regular Expression?

I don't think a regex can solve your problem completely.
You want to
split a string into words, a regular expression can do this for a very simple definition of word, "parts of a string seperated by whitespace or punctuation", which is not a very good definition even if you just stick to English text
Count the number of occurances of each word derived from step 1. To do that you must store some kind of Mapping, and regexes neither store nor count.
A workable approach could be to
split the inputstring (by either regex or other means) into an array of word-strings
iterate over the array, and building a Map to keep count of each word
iterate over the map to output a list of words and the number of occurances.
If your input is limited to English you still have to consider how you want your algorithm to behave in case of things like they're <->they are etc and compound words. Add other languages to the mix for additional kinds of headaches (different ways of writing the same word, words split into parts, difference in writing depending on where in a sentence the word occurs, etc)

I would split your task into a) identify words and b) count number of each unique word in text.
a) could be solved with splitting the text with a regex.
b) could be solved by building a map with the result from a).
String text = "I like good mules. Mules are good :)";
String[] words = text.split("([\\W\\s]+)");
Map<String, Integer> counts = new HashMap<String, Integer>();
for (String word: words) {
if (counts.containsKey(word)) {
counts.put(word, counts.get(word) + 1);
} else {
counts.put(word, 1);
}
}
result: {Mules=1, are=1, good=2, mules=1, like=1, I=1}

Pattern p = Pattern.compile("\\babba\\b");
Matcher m = p.matcher("abba is abba with abbabba and abba doing abba");
int count = 0;
while(m.find()){
count++;
}
System.out.println(count); //4

Using Guava, this is a one-liner:
Multiset<String> countOfEachWord =
HashMultiset.create(Splitter.on(" ").omitEmptyStrings().split(myString));
then to get the count of "dog" for example you would say:
countOfEachWord.count("dog")

Must you use a regex? If not this might help:
public static int count(final String string, final String substring)
{
int count = 0;
int idx = 0;
while ((idx = string.indexOf(substring, idx)) != -1)
{
idx++;
count++;
}
return count;
}

int CountWords(String t){
return t.split("([[a-z][A-Z][0-9][\\Q-\\E]]+)",-1).length+(t.replaceAll("([[a-z][A-Z][0-9][\\W]]*)", "")).length()-1;
}
English Words(chemical names)+Chinese words

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java: Finding the number of word matches in a given string - java

Use a regular expression with word boundaries. It's by far the easiest choice. int matches = 0; Matcher matcher = Pattern.compile("\\bgreat\\b", Pattern.CASE_INSENSITIVE).matcher(text); while (matcher.find()) matches++; Your milage may vary on some foreign languages though.

well,you can use "split" to separate the words and find if there exists a word matches exactly. hope that helps!

one option would be RegEx. Basically it sounds like you are looking to match a word with any punctuation on the left or right. so: " great." " great!" " great " " great," "Great" would all match, but "greatest" wouldn't

Related

String manipulation of function names

How to calculate syllables in text with regex and Java

java regex replace any double letter in word with single

Efficient Regular Expression for big data, if a String contains a word

Count no. of words using Regular expressions in java

Categories

Resources