Replace repeated letters in a word with exception

Replace repeated letters in a word with exception - java

I would like to have a regex expression that (in java) will replace every repeated consonant into single letter, all repeated consonants but an initial "inn".
I explain myself better with some examples:
asso > aso
assso > aso
assocco > asoco
innasso > innaso
I found a way to replace all repeated letters with
Pattern.compile("([^aeiou])+\1").matcher(text).replaceAll("$1")
I found a way to recognize if a word does not start with "inn":
Pattern.compile("^(?!inn).+").matcher(text).matches()
but I don't know how to merge them, ie, degeminate all geminates consonants but the initial 'nn' if the word starts with 'inn'.
Anyone can help me? (I would like to solve this with a regex, in order to apply replaceAll)
Thank you

I'm not sure why you must do this all with a single regexp, but if you must... try using negative lookbehind:
Pattern.compile("((?<!^i(?=nn))[^aeiou])+\\1")
This gobbledygook broken down:
(?=X) means: Don't consume anything, just check if X occurs here. If not, it's not a match.
(?<!X) means 'negative lookbehind': It doesn't consume any characters, but it fails to match if X occurs at this exact spot. So, if at this exact spot we're on the first character in the text, and it is an 'i', then it's a failure no matter what.
(?<!^i(?=nn)) does not consume anything, but it fails for any position where the following holds: Immediately before the 'cursor' there is an i, and before that, the start of the string. After the 'cursor' there are 2 n's. If that all holds, fail. Otherwise do nothing (continue processing).
The rest is then just what you wrote already.

One option could be to capture a word that starts with inn using a negative lookbehind (?<!\S) in group 1, and capture matching [^aeiou] in group 2 and repeat the backreference to that group 1 or more times.
(?<!\S)(inn)|([^aeiou\r\n])\2+
Explanation
(?<!\S) Negative lookbehind, assert what is on the left is not a non whitespace char
(inn) Capture group 1, match inn
| Or
( Capture group 2
[^aeiou\r\n] Match any char except the listed
)\2+ Close group and repeat 1+ times what was captured in group 2
Regex demo | Java demo
In the replacement use the 2 capturing groups $1$2
For example
final String regex = "(?<!\\S)(inn)|([^aeiou\\r\\n])\\2+";
final String string = "asso\n"
+ "assso\n"
+ "assocco\n"
+ "innasso";
final String subst = "$1$2";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);
final String result = matcher.replaceAll(subst);
System.out.println(result);
Output
aso
aso
asoco
innaso

Related

How to match all combinations of numbers in a string that do not start with an English letter in regular matching in Java

I have a String like
String str = "305556710S or 100596269C OR CN111111111";
I just want to match the characters in this string that start with numbers or start with numbers and end with English letters,
Then prefix the matched characters add with two "??" characters.
I write a Patern like
Pattern pattern = Pattern.compile("^[0-9]{1,10}[A-Z]{0,1}", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
int start = matcher.start();
int end = matcher.end();
String matchStr = matcher.group();
System.err.println(matchStr);
}
But it can only match the first character "305556710S".
But If I modify the Pattern
Pattern pattern = Pattern.compile("[0-9]{1,10}[A-Z]{0,1}", Pattern.CASE_INSENSITIVE);
It will matches "305556710S","100596269C","111111111".But the prefix of "111111111" is English character "CN" which is not my goal.
I only want match the "305556710S" and "100596269C" and add two "??" characters before the matched Characters.Can somebody help me ?

First, you should avoid the ^ in this particular regexp. As you noticed, you can't return more than one result, as "^" is an instruction for "match the beginning of the string"
Using \b can be a solution, but you may get invalid results. For example
305556710S or -100596269C OR CN111111111
The regexp "\\b[0-9]{1,10}[A-Z]{0,}\\b" will match 100596269C (because the hyphen is not word character, so there is a word boundary between - and 1)
The following regexp matches exactly what you want: all numbers, that may be followed by some English chars, either at the beginning of the string or after a space, and either followed by a space or at the end of the string.
(?<=^| )[0-9]{1,10}[A-Z]*(?= |$)
Explanations:
(?<=^| ) is a lookbehind. It makes sure that there is either ^ (string start) or a space behind actual location. Note that lookbehinds don't add matching chars to the result: the space won't be part of the result
[0-9]{1,10}[A-Z]* matches digits (at least one, up to ten), then one or more letters.
(?= |$) is a lookahead. It makes sure that there will be either a space or $ (end of string) after this match. Like lookbehinds, the chars aren't added to the results and position remains the same : the space read here for example can also be read by the lookbehind of the next captured string
Examples : 305556710S or 100596269C OR CN111111111 matches: at index 0 [305556710S], at index 15 [100596269C]; 100596269C123does not match.

I think you need to use word boundaries \b. Try this changed pattern:
"\\b[0-9]{1,10}[A-Z]{0,1}\\b"
This prints out:
305556710S
100596269C
Why it works:
The difference here is that it will check only those character sequences that are within a pair of word boundaries. In the earlier pattern you used, a character sequence even from the middle of a word may be used to match against the pattern due to which even 11111... from CN1111... was matched against the pattern and it passed.
A word boundary also matches the end of the string input. So, even if a candidate word appears at the end of the line, it will get picked up.
If more than one English alphabet can come at the end, then remove the max occurrence indicator, 1 in this case:
"\\b[0-9]{1,10}[A-Z]{0,}\\b"

positive lookbehind not behaving correctly

The code snippet for positive lookbehind is below
public class PositiveLookBehind {
public static void main(String[] args) {
String regex = "[a-z](?<=9)";
String input = "a9es m9x us9s w9es";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
System.out.println("===starting====");
while(matcher.find()) {
System.out.println("found:"+matcher.group()
+" start index:"+matcher.start()
+" end index is "+matcher.end());
}
System.out.println("===ending=====");
}
}
I was expecting that I should have 4 matches but to my surprise the output shows no match.
Can anyone point out my mistake?
As far as my understanding goes the regex here is alphabet preceded by digit 9 which is satisfied in 4 locations.

Problem
Notice that (?<=9) is placed after [a-z]. What it means?
Lets consider data like "a9c".
At start regex-engine places its "cursor" at start of the string which it iterates, here:
|a9c
^-regex cursor is here
Then regex-engine is trying to match each part of regex-pattern from left to right. So in case of [a-z](?<=9) it first will try to find match for [a-z] and after successfully finding that match for it, it will try to move to evaluation of (?<=9) part.
So match for [a-z] will happen here:
a9c
*<-- match for `[a-z]`
After that match regex will move cursor here:
a|9c
*^--- regex-engine cursor
^---- match for [a-z]
So now (?<=9) will be evaluated (notice position of cursor |). (?<=subregex) checks if immediately before cursor exist text which can be matched by subregex. But here since cursor is directly after a (?<=9) look-behind "sees"/includes that a as data which subexpression should test. But since a can't be matched by 9 evaluation fails.
Solution(s)
You probably wanted to check if 9 is placed before acceptable letter. To achieve that you can modify your regex in many ways:
with [a-z](?<=9.) you make look-behind test two previous characters
a9c|
^^
9. - `9` matches 9, `.` matches any character (one directly before cursor)
or simpler (?<=9)[a-z] to first look for 9 and then look for [a-z] which will let regex match 9c if cursor will be at 9|c.

Your cuurent pattern: [a-z](<=9) means: match lowercase letter and assure, that position right after the letter is preceeded by 9, which is contradiciton.
If you want to match letter preceeded by 9 use: (<=9)[a-z], which now means: assure what preceeds is 9, if so, match lowercase letter.

Regex in Java: Capture last {n} words

Hi I am trying to do regex in java, I need to capture the last {n} words. (There may be a variable num of whitespaces between words). Requirement is it has to be done in regex.
So e.g. in
The man is very tall.
For n = 2, I need to capture
very tall.
So I tried
(\S*\s*){2}$
But this does not match in java because the initial words have to be consumed first. So I tried
^(.*)(\S*\s*){2}$
But .* consumes everything, and the last 2 words are ignored.
I have also tried
^\S?\s?(\S*\s*){2}$
Anyone know a way around this please?

You had almost got it in your first attempt.
Just change + to *.
The plus sign means at least one character, because there wasn't any space the match had failed.
On the other hand the asterisk means from zero to more, so it will work.
Look it live here: (?:\S*\s*){2}$
Using replaceAll method, you could try this regex: ((?:\\S*\\s*){2}$)|.

Your regex contains - as you already mention - a greedy subpattern that eats up the whole string and sine (\S*\s*){2} can match an empty string, it matches an empty location at the end of the input string.
Lazy dot matching (changing .* to .*?) won't do the whole job since the capturing group is quantified, and the Matcher.group(1) will be set to the last captured non-whitespaces with optional whitespaces. You need to set the capturing group around the quantified group.
Since you most likely are using Matcher#matches, you can use
String str = "The man is very tall.";
Pattern ptrn = Pattern.compile("(.*?)((?:\\S*\\s*){2})"); // no need for `^`/`$` with matches()
Matcher matcher = ptrn.matcher(str);
if (matcher.matches()) { // Group 2 contains the last 2 "words"
System.out.println(matcher.group(2)); // => very tall.
}
See IDEONE demo

finding repeated characters in a row (3 times or more) in a string

Here is the code for finding repeated character like A in AAbbbc
String stringToMatch = "abccdef";
Pattern p = Pattern.compile("((\\w)\\2+)+");
Matcher m = p.matcher(tweet);
while (m.find())
{
System.out.println("Duplicate character " + m.group(0));
}
Now the problem is that I want to find the characters that are repeated but 3 times or more in a row,
when I change 2 to 3 in the above code it does not work,
Can anyone help?

You shouldn't change 2 to 3 because it's the number of capture groups, not it's frequency.You can use two group references here :
"((\\w)\\2\\2)+"
But still your regex doesn't match strings like your example! Since it just match repeated characters.For that aim you can use following regex :
"((\\w)\\2+\\2)+.*"

You may use the repetation quantifier.
Pattern p = Pattern.compile("(\\w)\\1{2,}");
Matcher m = p.matcher(tweet);
while (m.find())
{
System.out.println("Duplicate character " + m.group(1));
}
Now the duplicate character is captured by index 1 not index 0 which refers the whole match. Just change the number inside the repeatation quantifier to match the char which repeats n or more times like "(\\w)\\1{5,}" ..

That original regex is flawed. It only finds "word" characters (alpha, numeric, underscore). The requirement is "find characters that repeat 3 or more times in a row." The dot is the any-character metacharacter.
(?=(.)\1{3})(\1+)
So, that will find a character that occurs 4 or more consecutive times (i.e., meets your requirement of a character that "repeats" three or more times). If you really meant "occurs," change the 3 to 2. Anyway, it does a non-consuming "zero-length assertion" before capturing any data, so should be more efficient. It will only consume and capture data once you've found your minimum requirement (a single character that repeats at least 3 times). You can then consume it with the one-or-more '+' quantifier because you know it's a match you want; further quantification is redundant--your positive lookahead has already assured (asserted) that. Your results are in capture group 2 "(\1+)" and you can refer to it as \2.
Note: I tested that with perl command-line utility, so that's the raw regex. It looks like you may need to escape certain characters prior to using it in the programming language you're using.

IP and hostname detection

I know basics of java but I am not too experienced with regex or patterns, so please excuse me if im asking something super simple..
Im writing a method that detects IP addresses and hostnames. I used the regex from this answere here. The problem I am encountering though is that sentences without symbols are counted as host names
Heres my code:
Pattern validHostname = Pattern.compile("^(([a-z]|[a-z][a-z0-9-]*[a-z0-9]).)*([a-z]|[a-z][a-z0-9-]*[a-z0-9])$",Pattern.CASE_INSENSITIVE);
Pattern validIpAddress = Pattern.compile("^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])([:]\\d\\d*\\d*\\d*\\d*)*$",Pattern.CASE_INSENSITIVE);
String msg = c.getMessage();
boolean found=false;
//Randomly picks from a list to replace the detected ip/hostname
int rand=(int)(Math.random()*whitelisted.size());
String replace=whitelisted.get(rand);
Matcher matchIP = validIpAddress.matcher(msg);
Matcher matchHost = validHostname.matcher(msg);
while(matchIP.find()){
if(adreplace)
msg=msg.replace(matchIP.group(),replace);
else
msg=msg.replace(matchIP.group(),"");
found=true;
c.setMessage(msg);
}
while(matchHost.find()){
if(adreplace)
msg=msg.replace(matchHost.group(),replace);
else
msg=msg.replace(matchHost.group(),"");
found=true;
c.setMessage(msg);
}
return c;

Description
Without sample text and desired output, I'll try my best to answer your question.
I would rewrite you host name expression like this:
A: ^(?:[a-z][a-z0-9-]*[a-z0-9](?=\.[a-z]|$)\.?)+$ will allow single word names like abcdefg
B: ^(?=(?:.*?\.){2})(?:[a-z][a-z0-9-]*[a-z0-9](?=\.[a-z]|$)\.?)+$ requires the string to contain at least two period like abc.defg.com. This will not allow a period to appear at the beginning or end, or sequential periods. The number inside the lookahead {2} describes the minimum number of dots which must appear. You can change this number as you see fit.
^ match the start of the string anchor
(?: start non-capture group improves performance
[a-z][a-z0-9-]*[a-z0-9] match text, taken from your original expression
(?=\.[a-z]|$) look ahead to see if the next character is a dot followed by an a-z character, or the end of the string
\.? consume a single dot if it exists
) close the capture group
+ require the contents of the capture group to exist 1 or more times
$ match the end of the string anchor
Host names:
A Allows host name without dots
B Requires host name to have a dot
Live Demo with a sentence with no symbols
I would also rewrite the IP expression
^(?:(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])(?::\d*)?$
The major differences here are that I:
removed the multiple \d* from the end because expression \d*\d*\d*\d*\d*\d* is equivalent to \d*
changed the character class [:] to a single character :
I turned the capture groups (...) into non-capture groups (?...) which performs a little better.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Replace repeated letters in a word with exception - java

Related

How to match all combinations of numbers in a string that do not start with an English letter in regular matching in Java

positive lookbehind not behaving correctly

Regex in Java: Capture last {n} words

finding repeated characters in a row (3 times or more) in a string

IP and hostname detection

Categories

Resources