Matcher not finding overlapping words? - java

I'm trying to take a string:
String s = "This is a String!";
And return all 2-word pairs within that string. Namely:
{"this is", "is a", "a String"}
But right now, all I can get it to do is return:
{"this is", "a String"}
How can I define my while loop such that I can account for this lack of overlapping words? My code is as follows: (Really, I'd be happy with it just returning an int representing how many string subsets it found...)
int count = 0;
while(matcher.find()) {
count += 1;
}
Thanks all.

I like the two answers already posted, counting words and subtracting one, but if you just need a regex to find overlapping matches:
Pattern pattern = Pattern.compile('\\S+ \\S+');
Matcher matcher = pattern.matcher(inputString);
int matchCount = 0;
boolean found = matcher.find();
while (found) {
matchCount += 1;
// search starting after the last match began
found = matcher.find(matcher.start() + 1);
}
In reality, you'll need to be a little more clever than simply adding 1, since trying this on "the force" will match "he force" and then "e force". Of course, this is overkill for counting words, but this may prove useful if the regex is more complicated than that.

Run a for loop from i = 0 to the number of words - 2, then the words i and i+1 will make up a single 2-word string.
String[] splitString = string.split(" ");
for(int i = 0; i < splitString.length - 1; i++) {
System.out.println(splitString[i] + " " + splitString[i+1]);
}
The number of 2-word strings within a sentence is simply the number of words minus one.
int numOfWords = string.split(" ").length - 1;

Total pair count = Total number of words - 1
And you already know how to count total number of words.

I tried with group of pattern.
String s = "this is a String";
Pattern pat = Pattern.compile("([^ ]+)( )([^ ]+)");
Matcher mat = pat.matcher(s);
boolean check = mat.find();
while(check){
System.out.println(mat.group());
check = matPOS.find(mat.start(3));
}
from the pattern ([^ ]+)( )([^ ]+)
...........................|_______________|
..................................group(0)
..........................|([^ ]+)| <--group(1)
......................................|( )| <--group(2)
............................................|([^ ]+)| <--group(3)

Related

Fast way of counting number of occurrences of a word in a string using Java

I want to find number of times a word appears in a string in a fast and efficient way using Java.
The words are separated by space and I am looking for complete words.
Example:
string: "the colored port should be black or white or brown"
word: "or"
output: 2
for the above example, "colored" and "port" are not counted, but "or" is counted.
I considered using substring() and contains() and iterating over the string. But then we need to check for the surrounding spaces which I suppose is not efficient. Also StringUtils.countMatches() is not efficient.
The best way I tried is splitting the string over space and iterating over the words, and then matching them against the given word:
String string = "the colored port should be black or white or brown";
String[] words = string.split(" ");
String word = "or";
int occurrences = 0;
for (int i=0; i<words.length; i++)
if (words[i].equals(word))
occurrences++;
System.out.println(occurrences);
But I am expecting some efficient way using Matcher and regex.
So I tested the following code:
String string1 = "the colored port should be black or white or brown or";
//String string2 = "the color port should be black or white or brown or";
String word = "or";
Pattern pattern = Pattern.compile("\\s(" + word + ")|\\s(" + word + ")|(" + word + ")\\s");
Matcher matcher = pattern.matcher(string1);
//Matcher matcher = pattern.matcher(string2);
int count = 0;
while (matcher.find()){
match=matcher.group();
count++;
}
System.out.println("The word \"" + word + "\" is mentioned " + count + " times.");
It is supposed to be fast enough, and gives me the right answer for string1, but not for string2 (commented). There seems to need a little change in the regex.
Any ideas?
I experimented and evaluated three answers; split based and Matcher based (as mentioned in the question), and Collections.frequency() based (as mentioned in a comment above by #4castle). Each time I measured the total time in a loop repeated 10 million times. As a result, the split based answer tends to be the most efficient way:
String string = "the colored port should be black or white or brown";
String[] words = string.split(" ");
String word = "or";
int occurrences = 0;
for (int i=0; i<words.length; i++)
if (words[i].equals(word))
occurrences++;
System.out.println(occurrences);
Then there is Collections.frequency() based answer with a little longer running time (~5% slower):
String string = "the colored port should be black or white or brown or";
String word = "or";
int count = Collections.frequency(Arrays.asList(string.split(" ")), word);
System.out.println("The word \"" + word + "\" is mentioned " + count + " times.");
The Matcher based solution (mentioned in the question) is a lot slower (~5 times more running time).
public class Test {
public static void main(String[] args) {
String str= "the colored port should be black or white or brown";
Pattern pattern = Pattern.compile(" or ");
Matcher matcher = pattern.matcher(str);
int count = 0;
while (matcher.find())
count++;
System.out.println(count);
}
}
How about this? Assuming word wont have spaces.
string.split("\\s"+word+"\\s").length - 1;

How do I make my program count the number of sentences starting with a capital letter in a string?

My program currently only counts the number of capital letters in the whole string, not the ones after a period mark.
Desired output:
Enter essay:
I like Cats. Hey.
Sentences starting with capital letter: 2
Current output:
Enter essay:
I like Cats. Hey.
Sentences starting with capital letter: 3
Here's my code so far:
static int sentencesChecker(String shortEssay) {
int count = 0;
for (int i = 0; i < shortEssay.length(); i++) {
if (isUpperCase(shortEssay.charAt(i)) ) {
count++;
}
} System.out.println("Sentences starting with capital letter: " + count);
return count;
}
public static void main(String[] args) {
Scanner input = new Scanner (System.in);
System.out.println("Enter essay: ");
String essay = input.nextLine();
sentencesChecker(essay);
}
Some more easy way than counting over the char array of the String would probably be the usage of String#split:
public static void main(String[] args) {
Scanner input = new Scanner(System.in);
System.out.println("Enter essay: ");
String essay = input.nextLine();
String[] Uppcasesentences = essay.split("\\.\\s*[A-Z]");
if(Uppcasesentences[0].matches("^\\s*[A-Z].*")) {
System.out.println("You have got " + essay.split("\\.\\s*[A-Z]").length + " sentences starting uppercase");
}
else {
System.out.println("You have got " + (essay.split("\\.\\s*[A-Z]").length-1) + " sentences starting uppercase");
}
}
O/P
Enter essay:
Sentence 1. sentence 2. Sentence 3. Sentence 4. sentence 5. Sentence 6
You have got 4 sentences starting uppercase
What is happening here is, the regex splits the String on each occasion of a dot followed by 0-n whitespaces followed by an uppercase letter. The length of the array you did just created should equal the amount of sentences starting uppercase now.
Edit: the split did ignore the first sentence, and would produce 2 for the input sentence 2. Sentence 2. Checking if the first array element starts with uppercase now. If not subtract 1.
You can use a regular expression:
A regular expression, regex or regexp (sometimes called a rational expression)
is, in theoretical computer science and formal language theory, a sequence of characters that define a search pattern. Usually this pattern is then used by string searching algorithms for "find" or "find and replace" operations on strings.
The regular expression to use in this context is:
^[A-Z]|\. *[A-Z]
That regular expression means (underlined the portion described in the right):
^[A-Z]|\. *[A-Z] Any uppercase letter from A to Z at the starting of line
------
^[A-Z]|\. *[A-Z] or
-
^[A-Z]|\. *[A-Z] The character . followed by any number of
--------- spaces, followed by an uppercase letter in the range A to Z
This can be used as follow in java
public int countSentencesStartingWithUppercase(String line) {
String regexp = "^[A-Z]|\\. *[A-Z]"; // Note the additional \ this is
// done because \ is a special
// character in strings
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(line);
int count = 0;
while (matcher.find()) {
count++;
}
return count;
}
Here a link to the tutorial on regular expressions in java.
Here is the code.
String str ="I like Cats. Hey.";
//Scanner s = new Scanner(System.in);
//System.out.println("Would you like to play again?");
String[] strs = str.split("[.][ ]");
int count =0;
for(String string : strs){
if(Character.isUpperCase( str.charAt(0) )){
count++;
}
}
System.out.println("the count :"+count);

Split by last space including tab space in java

I have a string as follows
Esophageal have not 45.3
The end is nigh 23
Maybe (just) (maybe>32) 45.2
Every line ends in a number (both with and without decimal points)
I want to split the line before the last number
I have tried this regex:
myarray[]=null;
myarray=match.split("/\\s+(?=\\S*+$)/");
but it doesn't split it
You can use this
String Str = "Maybe (just) (maybe>32) 45.2";
for (String retval: Str.split("\\s(?=\\b(\\d+(?:\\.\\d+)?)$)")){
System.out.println(retval);
}
Ideone Demo
Just split the string and get the last value of array:
String[] array = "Maybe (just) (maybe>32) 45.2".split(" ");
String firstPart = array[0];
String lastPart = array[array.length-1];
for (int i = 1; i < array.length - 1; i++) {
firstPart += " " + array[i];
}
System.out.println(firstPart);
System.out.println(lastPart);
match.split("[ \\t](?=[.\\d]+([\\n\\r]+|$))");
Results should look like:
Esophageal have not
45.3 The end is nigh
23 Maybe (just) (maybe>32)
45.2

Find the number of characters till nth word in Java?

What is the easiest way to find the number of characters in a sentence before nth word ?
For eg. :
String sentence = "Mary Jane and Peter Parker are friends."
int trigger = 5; //"Parker"
Output would be
Character count = 20
Any help will be appreciated.
Thanks.
Easiest way would just be to loop around the characters in the String and count the number of white-spaces.
The following increments a length variable for every character. When a white-space is encountered, we decrement the number of remaining white-spaces to read, and when that number reaches 1, it means we hit the wanted word, so we break out of the loop.
public static void main(String[] args) {
String sentence = "Mary Jane and Peter Parker are friends.";
int trigger = 5; //"Parker"
int length = 0;
for (char c : sentence.toCharArray()) {
if (trigger == 1) break;
if (c == ' ') trigger--;
length++;
}
System.out.println(length); // prints 20
}
public int getNumberOfCharacters(int nthWord){
String sentence = "Mary Jane and Peter Parker are friends.";
String[] wordArray = sentence.split(" ");
int count = 0;
for(int i=0; i<=nthWord-2 ; i++){
count = count + wordArray[i].length();
}
return count + (nthWord-1);
}`
try this it should work
Using regex can be done like this:
public static void main(String[] args) {
String sentence = "Mary Jane and Peter Parker are friends.";
int trigger = 5;
Pattern pattern = Pattern.compile(String.format("(?:\\S+\\s+){%d}(\\S+)", trigger - 1));
Matcher matcher = pattern.matcher(sentence);
matcher.find();
System.out.println(matcher.group().lastIndexOf(" ") + 1);
}
I am going through all the trouble of finding the exact work instead of simply indexOf("Parker") because of possible duplicates.
The regex will match N words without capturing and capture the N+1 word. In your case it will match all previous words up to the one you want and capture the next one.

Java Regular expression with variable string

I want to find the occurrences of all the words in a ListArray comparing it with a String. So far, I am able to do it as a for loop, where I store all the possible combinations and run them using a matches i.e.
for(String temp_keywords: keywords){
final_keywords_list.add(" "+ temp_keywords+ " ");
final_keywords_list.add(" "+ temp_keywords+".");
final_keywords_list.add(" "+ temp_keywords+ ",");
final_keywords_list.add(" "+ temp_keywords+ "!");
final_keywords_list.add(" "+ temp_keywords+ "/");
final_keywords_list.add(" "+ temp_keywords+ "?");
}
for (String temp_keywords : final_keywords_list) {
String add_space = temp_keywords.toLowerCase();
p = Pattern.compile(add_space);
m = p.matcher(handler_string);
int count = 0;
while (m.find()) {
count += 1;
}
However, I want to remove the manual addition for the combinations and do a regex. I've seen examples of words with regex but how do I add a variable string to the regex? Sorry, I am a beginner java learner.
Is this what you need?
String inputString = ....
String[] keywords = ....
StringBuilder sb = new StringBuilder();
for(String keyword: keywords)
sb.append("(?<= )").append(keyword).append("(?=[ .,!/?])").append("|");
sb.setLength(sb.length() - 1); //Removes trailing "|". Assumes keywords.size() > 0.
Pattern p = Pattern.compile(sb.toString());
Matcher m = p.matcher(inputString);
int count = 0;
while (m.find())
count++;
It creates a single regex, compiles it, and then counts the matches.

Categories

Resources