Regular expression for counting words in a sentence - java

public static int getWordCount(String sentence) {
return sentence.split("(([a-zA-Z0-9]([-][_])*[a-zA-Z0-9])+)", -1).length
+ sentence.replaceAll("([[a-z][A-Z][0-9][\\W][-][_]]*)", "").length() - 1;
}
My intention is to count the number of words in a sentence. The input to this function is the the lengthy sentence. It may have 255 words.
The word should take hyphens or underscores in between
Function should only count valid words means special character should not be counted eg. &&&& or #### should not count as a word.
The above regular expression is working fine, but when hyphen or underscore comes in between the word eg: co-operation, the count returning as 2, it should be 1. Can anyone please help?

Instead of using .split and .replaceAll which are quite expensive operations, please use an approach with constant memory usage.
Based on your specifications, you seem to look for the following regex:
[\w-]+
Next you can use this approach to count the number of matches:
public static int getWordCount(String sentence) {
Pattern pattern = Pattern.compile("[\\w-]+");
Matcher matcher = pattern.matcher(sentence);
int count = 0;
while (matcher.find())
count++;
return count;
}
online jDoodle demo.
This approach works in (more) constant memory: when splitting, the program constructs an array, which is basically useless, since you never inspect the content of the array.
If you don't want words to start or end with hyphens, you can use the following regex:
\w+([-]\w+)*

This part ([-][_])* is wrong. The notation [xyz] means "any single one of the characters inside the brackets" (see http://www.regular-expressions.info/charclass.html). So effectively, you allow exactly the character - and exactly the character _, in that order.
Fixing your group makes it work:
[a-zA-Z0-9]+([-_][a-zA-Z0-9]+)*
and it can be further simplified using \w to
\w+(-\w+)*
because \w matches 0..9, A..Z, a..z and _ (http://www.regular-expressions.info/shorthand.html) and so you only need to add -.

if you can use java 8:
long wordCount = Arrays.stream(sentence.split(" ")) //split the sentence into words
.filter(s -> s.matches("[\\w-]+")) //filter only matching words
.count();

With java 8
public static int getColumnCount(String row) {
return (int) Pattern.compile("[\\w-]+")
.matcher(row)
.results()
.count();
}

Related

How can I remove words with more vowels then consonants using regular expressions

I am working on an app that removes from the text words which contain more vowels than consonants. For example:
StringBuilder text = new StringBuilder("I quite hate regular expressions");
I have to write code that will return text without the words "quite" and "I", because these words contain more vowels than consonants. Also it should work with other text samples.
I am quite bad at Java regular expressions, so I hope you guys will help me. I have tried
public String removeWordsWithMoreVowels(final StringBuilder text) {
Pattern pattern = Pattern.compile("regular expression goes here");
Matcher matcher = pattern.matcher(text);
System.out.println(matcher.replaceAll(""));
return matcher.replaceAll("");
}
How can I achieve that? All hints and advice are welcome. Thanks in advance.
That cannot really be done with regex. The problem is that vncn is not suited for such stateless, context free grammar as in regex. Using the regex with with a lambda,
one can use a bit of code.
public String removeWordsWithMoreVowels(final StringBuilder text) {
Pattern pattern = Pattern.compile("(?i)\\b[a-z]+\\b");
Matcher matcher = pattern.matcher(text);
return matcher.replaceAll(mr -> {
int vowels = mr.group().replaceAll("(?i)[^aeiou]", "").length();
return vowels > mr.group().length() - vowels ? "" : mr.group();
});
}
The above is a slight simplification as it does not deal with removing whitespace by the deletion.
(?i) case insensitive
[^aeiou] - not a vowel; consonant (about y: maybe one should remove them first)
Here is one solution. It does use a single regular expression for the vowels.
remove all the vowels from the word. Let the new Length be NC for number of consonants.
subtract NC from the orig word length. That is the number of vowels, VC
If VC <= NC, keep the word. Note that this keeps words where the number of vowels equal the number of consonants.
String[] words = { "radar", "hello", "saygoodbyeeee","coolbeans" };
// or
String[] words = "I quite hate regular expressions".split("\\s+");
Then use this
List<String> keep = new ArrayList<>();
for (String word : words) {
int nocons = word.replaceAll("(?i:[aeiou])","").length();
if (word.length()-nocons <= nocons) {
keep.add(word);
}
}
System.out.println(keep);
It's impossible in a general case: finite-state machines, which regexes are, cannot count n matches to a possibly infinite limit.
You can do what you want up to a finite number of consonants c and a finite number of vowels v, but you cannot create a general regex to express all matches of c < v such that c -> infinity and v -> infinity.
Your problem can be expressed by a context-sensitive matcher (which is a linear-bounded automata).
You'd best be served by manually counting the number of vowels vs. consonants per word and then using a comparison to filter out the words -- use a lambda expression.

Replacing consecutive repeated characters in java

I am working on twitter data normalization. Twitter users frequently uses terms like ts I looooooove it in order to emphasize the word love. I want to such repeated characters to a proper English word by replacing repeat characters till I get a proper meaningful word (I am aware that I can not differentiate between good and god by this mechanism).
My strategy would be
identify existence of such repeated strings. I would look for more than 2 same characters, as probably there is no English word with more than two repeat characters.
String[] strings = { "stoooooopppppppppppppppppp","looooooove", "good","OK", "boolean", "mee", "claaap" };
String regex = "([a-z])\\1{2,}";
Pattern pattern = Pattern.compile(regex);
for (String string : strings) {
Matcher matcher = pattern.matcher(string);
if (matcher.find()) {
System.out.println(string+" TRUE ");
}
}
Search for such words in a Lexicon like Wordnet
Replace all but two such repeat characters and check in Lexicon
If not there in the Lexicon remove one more repeat character (Otherwise treat it as misspelling).
Due to my poor Java knowledge I am unable to manage 3 and 4. Problem is I can not replace all but two repeated consecutive characters.
Following code snippet replace all but one repeated characters System.out.println(data.replaceAll("([a-zA-Z])\\1{2,}", "$1"));
Help is required to find out
A. How to replace all but 2 consecutive repeat characters
B. How to remove one more consecutive character from the output of A
[I think B can be managed by the following code snippet]
System.out.println(data.replaceAll("([a-zA-Z])\\1{1,}", "$1"));
Edit: Solution provided by Wiktor Stribiżew works perfectly in Java. I was wondering what changes are required to get the same result in python.
Python uses re.sub.
Your regex ([a-z])\\1{2,} matches and captures an ASCII letter into Group 1 and then matches 2 or more occurrences of this value. So, all you need to replace with a backreference, $1, that holds the value captured. If you use one $1, the aaaaa will be replaced with a single a and if you use $1$1, it will be replaced with aa.
String twoConsecutivesOnly = data.replaceAll(regex, "$1$1");
String noTwoConsecutives = data.replaceAll(regex, "$1");
See the Java demo.
If you need to make your regex case insensitive, use "(?i)([a-z])\\1{2,}" or even "(\\p{Alpha})\\1{2,}". If any Unicode letters must be handled, use "(\\p{L})\\1{2,}".
BONUS: In a general case, to replace any amount of any repeated consecutive chars use
text = text.replaceAll("(?s)(.)\\1+", "$1"); // any chars
text = text.replaceAll("(.)\\1+", "$1"); // any chars but line breaks
text = text.replaceAll("(\\p{L})\\1+", "$1"); // any letters
text = text.replaceAll("(\\w)\\1+", "$1"); // any ASCII alnum + _ chars
/*This code checks a character in a given string repeated consecutively 3 times
if you want to check for 4 consecutive times change count==2--->count==3 OR
if you want to check for 2 consecutive times change count==2--->count==1*/
public class Test1 {
static char ch;
public static void main(String[] args) {
String str="aabbbbccc";
char[] charArray = str.toCharArray();
int count=0;
for(int i=0;i<charArray.length;i++){
if(i!=0 ){
if(charArray[i]==ch)continue;//ddddee
if(charArray[i]==charArray[i-1]) {
count++;
if(count==2){
System.out.println(charArray[i]);
count=0;
ch=charArray[i];
}
}
else{
count=0;//aabb
}
}
}
}
}

Use regex to replace sequences in a string with modified characters

I am trying to solve a codingbat problem using regular expressions whether it works on the website or not.
So far, I have the following code which does not add a * between the two consecutive equal characters. Instead, it just bulldozes over them and replaces them with a set string.
public String pairStar(String str) {
Pattern pattern = Pattern.compile("([a-z])\\1", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(str);
if(matcher.find())
matcher.replaceAll(str);//this is where I don't know what to do
return str;
}
I want to know how I could keep using regex and replace the whole string. If needed, I think a recursive system could help.
This works:
while(str.matches(".*(.)\\1.*")) {
str = str.replaceAll("(.)\\1", "$1*$1");
}
return str;
Explanation of the regex:
The search regex (.)\\1:
(.) means "any character" (the .) and the brackets create a group - group 1 (the first left bracket)
\\1, which in regex is \1 (a java literal String must escape a backslash with another backslash) means "the first group" - this kind of term is called a "back reference"
So together (.)\1 means "any repeated character"
The replacement regex $1*$1:
The $1 term means "the content captured as group 1"
Recursive solution:
Technically, the solution called for on that site is a recursive solution, so here is recursive implementation:
public String pairStar(String str) {
if (!str.matches(".*(.)\\1.*")) return str;
return pairStar(str.replaceAll("(.)\\1", "$1*$1"));
}
FWIW, here's a non-recursive solution:
public String pairStar(String str) {
int len = str.length();
StringBuilder sb = new StringBuilder(len*2);
char last = '\0';
for (int i=0; i < len; ++i) {
char c = str.charAt(i);
if (c == last) sb.append('*');
sb.append(c);
last = c;
}
return sb.toString();
}
I dont know java, but I believe there is replace function for string in java or with regular expression. Your match string would be
([a-z])\\1
And the replace string would be
$1*$1
After some searching I think you are looking for this,
str.replaceAll("([a-z])\\1", "$1*$1").replaceAll("([a-z])\\1", "$1*$1");
This is my own solutions.
Recursive solution (which is probably more or less the solution that the problem is designed for)
public String pairStar(String str) {
if (str.length() <= 1) return str;
else return str.charAt(0) +
(str.charAt(0) == str.charAt(1) ? "*" : "") +
pairStar(str.substring(1));
}
If you want to complain about substring, then you can write a helper function pairStar(String str, int index) which does the actual recursion work.
Regex one-liner one-function-call solution
public String pairStar(String str) {
return str.replaceAll("(.)(?=\\1)", "$1*");
}
Both solution has the same spirit. They both check whether the current character is the same as the next character or not. If they are the same then insert a * between the 2 identical characters. Then we move on to check the next character. This is to produce the expected output a*a*a*a from input aaaa.
The normal regex solution of "(.)\\1" has a problem: it consumes 2 characters per match. As a result, we failed to compare whether the character after the 2nd character is the same character. The look-ahead is used to resolve this problem - it will do comparison with the next character without consuming it.
This is similar to the recursive solution, where we compare the next character str.charAt(0) == str.charAt(1), while calling the function recursively on the substring with only the current character removed pairStar(str.substring(1).

Java regex matching

I need to parse through a file(built from a string) searching for occurences of a single or multiline text. Will this solution always work ? If not - how should I change it ?
private int parseString(String s){
Pattern p = Pattern.compile(searchableText);
Matcher m = p.matcher(s);
int count = 0;
while(m.find()) {
count++;
}
return count;
}
Consider Pattern.quote if text can contain regex metacharacters.
Consider java.util.Scanner and findWithinHorizon
For multiline strings you need to adjust the newline codes to what is the file, or (probably better) use a regular expression to be able to accept all the various combinations of \n and \r.
Consider searching for a\s+b\s+c\s+ where a, b, and c are the sequence of words (or characters if you wish) you are searching for. It may be (over) greedy, but hopefully it should give an insight. You can also perform a normalization of the text first.

codingbat wordEnds using regex

I'm trying to solve wordEnds from codingbat.com using regex.
Given a string and a non-empty word string, return a string made of each char just before and just after every appearance of the word in the string. Ignore cases where there is no char before or after the word, and a char may be included twice if it is between two words.
wordEnds("abcXY123XYijk", "XY") → "c13i"
wordEnds("XY123XY", "XY") → "13"
wordEnds("XY1XY", "XY") → "11"
wordEnds("XYXY", "XY") → "XY"
This is the simplest as I can make it with my current knowledge of regex:
public String wordEnds(String str, String word) {
return str.replaceAll(
".*?(?=word)(?<=(.|^))word(?=(.|$))|.+"
.replace("word", java.util.regex.Pattern.quote(word)),
"$1$2"
);
}
replace is used to place in the actual word string into the pattern for readability. Pattern.quote isn't necessary to pass their tests, but I think it's required for a proper regex-based solution.
The regex has two major parts:
If after matching as few characters as possible ".*?", word can still be found "(?=word)", then lookbehind to capture any character immediately preceding it "(?<=(.|^))", match "word", and lookforward to capture any character following it "(?=(.|$))".
The initial "if" test ensures that the atomic lookbehind captures only if there's a word
Using lookahead to capture the following character doesn't consume it, so it can be used as part of further matching
Otherwise match what's left "|.+"
Groups 1 and 2 would capture empty strings
I think this works in all cases, but it's obviously quite complex. I'm just wondering if others can suggest a simpler regex to do this.
Note: I'm not looking for a solution using indexOf and a loop. I want a regex-based replaceAll solution. I also need a working regex that passes all codingbat tests.
I managed to reduce the occurrence of word within the pattern to just one.
".+?(?<=(^|.)word)(?=(.?))|.+"
I'm still looking if it's possible to simplify this further, but I also have another question:
With this latest pattern, I simplified .|$ to just .? successfully, but if I similarly tried to simplify ^|. to .? it doesn't work. Why is that?
Based on your solution I managed to simplify the code a little bit:
public String wordEnds(String str, String word) {
return str.replaceAll(".*?(?="+word+")(?<=(.|^))"+word+"(?=(.|$))|.+","$1$2");
}
Another way of writing it would be:
public String wordEnds(String str, String word) {
return str.replaceAll(
String.format(".*?(?="+word+")(?<=(.|^))"+word+"(?=(.|$))|.+",word),
"$1$2");
}
With this latest pattern, I simplified .|$ to just .? successfully, but if I similarly tried to simplify ^|. to .? it doesn't work. Why is that?
In Oracle's implementation, the behavior of look-behind is as follow:
By "studying" the regex (with study() method in each node), it knows the maximum length and minimum length of the pattern in look-behind group. (The study() method is what allows for obvious look-behind length)
It verifies the look-behind by starting a match at every position from index (current - min_length) to position (current - max_length) and exits early if the condition is satisfied.
Effectively, it will try to verify the look-behind on the shortest string first.
The implementation multiplies the matching complexity by O(k) factor.
This explains why changing ^|. to .? doesn't work: due to the starting position, it effectively checks for word before .word. The quantifier doesn't have a say here, since the ordering is imposed by the match range.
You can check the code of match method in Pattern.Behind and Pattern.NotBehind inner classes to verify what I said above.
In .NET's flavor, look-behind is likely implemented by the reverse matching feature, which means that no extra factor is incurred on the matching complexity.
My suspicion comes from the fact that the capturing group in (?<=(a+))b matches all a's in aaaaaaaaaaaaaab. The quantifier is shown to have free reign in look-behind group.
I have tested that ^|. can be simplified to .? in .NET and the regex works correctly.
I am working in .NET's regex but I was able to change your pattern to:
.+?(?<=(\w?)word)(?=(\w?))|.+
with the positive results. You know its a word (alphanumeric) type character, why not give a valid hint to the parser of that fact; instead of any character its an optional alpha numeric character.
It may answer why you don't need to specify the anchors of ^ and $, for what exactly is $ - is it \r or \n or other? (.NET has issues with $, and maybe you are not exactly capturing a Null of $, but the null of \r or \n which allowed you to change to .? for $)
Another solution to look at...
public String wordEnds(String str, String word) {
if(str.equals(word)) return "";
int i = 0;
String result = "";
int stringLen = str.length();
int wordLen = word.length();
int diffLen = stringLen - wordLen;
while(i<=diffLen){
if(i==0 && str.substring(i,i+wordLen).equals(word)){
result = result + str.charAt(i+wordLen);
}else if(i==diffLen && str.substring(i,i+wordLen).equals(word)){
result = result + str.charAt(i-1);
}else if(str.substring(i,i+wordLen).equals(word)){
result = result + str.charAt(i-1) + str.charAt(i+wordLen) ;
}
i++;
}
if(result.length()==1) result = result + result;
return result;
}
Another possible solution:
public String wordEnds(String str, String word) {
String result = "";
if (str.contains(word)) {
for (int i = 0; i < str.length(); i++) {
if (str.startsWith(word, i)) {
if (i > 0) {
result += str.charAt(i - 1);
}
if ((i + word.length()) < str.length()) {
result += str.charAt(i + word.length());
}
}
}
}
return result;
}

Categories

Resources