Regex to match/group repeating characters in a string

Regex to match/group repeating characters in a string - java

I need a regular expression that will match groups of characters in a string. Here's an example string:
qwwwwwwwwweeeeerrtyyyyyqqqqwEErTTT
It should match
(match group) "result"
(1) "q"
(2) "wwwwwwwww"
(3) "eeeee"
(4) "rr"
(5) "t"
(6) "yyyyy"
(7) "qqqq"
(8) "w"
(9) "EE"
(10) "r"
(11) "TTT"
after doing some research, this is the best I could come up with
/(.)(\1*)/g
The problem I'm having is that the only way to use the \1 back-reference is to capture the character first. If I could reference the result of a non capturing group I could solve this problem but after researching I don't think it's possible.

How about /((.)(\2*))/g? That way, you match the group as a whole (I'm assuming that that's what you want, and that's what's lacking from the solution you found).

Looks like you need to use a Matcher in a loop:
Pattern p = Pattern.compile("((.)\\2*)");
Matcher m = p.matcher("qwwwwwwwwweeeeerrtyyyyyqqqqwEErTTT");
while (m.find()) {
System.out.println(m.group(1));
}
Outputs:
q
wwwwwwwww
eeeee
rr
t
yyyyy
qqqq
w
EE
r
TTT

Assuming what #cruncher said as a premise is true: "we want to catch repeating letter groups without knowing beforehand which letter should be repeating" then:
/((a*?+)|(b*?+)|(c*?+)|(d*?+)|(e*?+)|(f*?+)|(g*?+)|(h*?+))/
The above RegEx should allow the capture of repeating letter groups without hardcoding a particular order in which they would occur.
The ?+ is a reluctant possesive quantifier which helps us not waste RAM space by not saving previously valid backtracking cases if the current case is valid.

Since you did tag java, I'll give an alternative non-regex solution(I believe in requirements being the end product, not the method by which you get there).
String repeat = "";
char c = '';
for(int i = 0 ; i < s.length() ; i++) {
if(s.charAt(i) == c) {
repeat += c;
} else {
if(!repeat.isEmpty())
doSomething(repeat); //add to an array if you want
c = s.charAt(i);
repeat = "" + c;
}
}
if(!repeat.isEmpty())
doSomething(repeat);

Related

How to split a string and save the 2 characters that I split with?

I am trying to split a given string using the java split method while the string should be devided by two different characters (+ and -) and I am willing to save the characters inside the array aswell in the same index the string has been saven.
for example :
input : String s = "4x^2+3x-2"
output :
arr[0] = 4x^2
arr[1] = +3x
arr[2] = -2
I know how to get the + or - characters in a different index between the numbers but it is not helping me,
any suggestions please?

You can face this problem in many ways. I´m sure there are clever and fancy ways to split this expression. I will show you the simplest problem-solving process that can help you.
State the problem you need to solve, the input and output
Problem: Split a math expression into subexpressions at + and - signals
Input: 4x^2+3x-2
Output: 4x^2,+3x,-2
Create a pseudo code with some logic you might think works
Given an expression string
Create an empty list of expressions
Create a subExpression string
For each character in the expression
Check if the character is + ou - then
add the subExpression in the list and create a new empty subexpression
otherwise, append the character in the subExpression
In the end, add the left subexpression in the list
Implement the pseudo-code in the programming language of your choice
String expression = "4x^2+3x-2";
List<String> expressions = new ArrayList();
StringBuilder subExpression = new StringBuilder();
for (int i = 0; i < expression.length(); i++) {
char character = expression.charAt(i);
if (character == '-' || character == '+') {
expressions.add(subExpression.toString());
subExpression = new StringBuilder(String.valueOf(character));
} else {
subExpression.append(String.valueOf(character));
}
}
expressions.add(subExpression.toString());
System.out.println(expressions);
Output
[4x^2, +3x, -2]
You will end with one algorithm that works for your problem. You can start to improve it.

Try this code:
String s = "4x^2+3x-2";
s = s.replace("+", "#+");
s = s.replace("-", "#-");
String[] ss = s.split("#");
for (int i = 0; i < ss.length; i++) {
Log.e("XOP",ss[i]);
}
This code replaces + and - with #+ and #- respectively and then splits the string with #. That way the + and - operators are not lost in the result.
If you require # as input character then you can use any other Unicode character instead of #.

Try this one:
String s = "4x^2+3x-2";
String[] arr = s.split("[\\+-]");
for(int i=0;i<arr.length;i++){
System.out.println(arr[i]);
}

Personally I like it better to have positive matches of patterns, especially if the split pattern itself is empty.
So for instance you could use a Pattern and Matcher like this:
Pattern p = Pattern.compile("(^|[+-])([^+-]*)");
Matcher m = p.matcher("4x^2+3x-2");
while (m.find()) {
System.out.printf("%s or %s %s%n", m.group(), m.group(1), m.group(2));
}
This matches the start of the string or a plus or minus: ^|[+-], followed by any amount of characters that are not a plus or minus: [^+-]*.
Do note that the ^ first matches the start of the string, and is then used to negate a character class when used between brackets. Regular expressions are tricky like that.
Bonus: you can also use the two groups (within the parenthesis in the pattern) to match the operators - if any.
All this is presuming that you want to use/test regular expressions; generally things like this require a parser rather than a regular expression.
A one-liner for persons thinking that this is too complex:
var expressions = Pattern.compile("^|[+-][^+-]*")
.matcher("4x^2+3x-2")
.results()
.map(r -> r.group())
.collect(Collectors.toList());

How to retain all occurrences of X when using the greedy quantifier X* in a java regexp?

I have a regular expression that I use to find matches of a list of coma-separated words between <> inside a string, like "Hello <a1> sqjsjqk <b1,b2> dsjkfjkdsf <c1,c2,c3> ffsd" in the example
I want to use capturing groups to retain each word between the braces:
Here is my expression: < (\w+) (?: ,(\w+) )* > (spaces are added for readability but not a part of the pattern)
Parenthesis are for creating capturing groups, (?: ) is for creating a non capturing group, because I don't want to retain the coma.
Here is my test code:
#Test
public void test() {
String patternString = "<(\\w+)(?:,(\\w+))*>";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher("Hello <a1> sqjsjqk <b1,b2> dsjkfjkdsf <c1,c2,c3> ffsd");
while(matcher.find()) {
System.out.println("== Match ==");
MatchResult matchResult = matcher.toMatchResult();
for(int i = 0; i < matchResult.groupCount(); i++) {
System.out.println(" " + matchResult.group(i + 1));
}
}
}
This is the output produced:
== Match ==
a1
null
== Match ==
b1
b2
== Match ==
c1
c3
And here is what I wanted:
== Match ==
a1
== Match ==
b1
b2
== Match ==
c1
c2
c3
From this I understand that there is exactly as many groups as the number of capturing groups in my expression, but this is not what I want, because I need all the substring that were recognized as the \w+
Is there any chance to get what I want with a single RegExp, or should I finish the job with split(","), trim(), etc...

As far as I know .NET has the only regex engine out there, that can return multiple captures for a single capturing group. So what you are asking for is not possible in Java (at least not the way you asked for).
In your case this problem can however be solved to a certain extent. If you can be sure that there will never be an unmatched closing >, you can make the stuff you want to capture the full match, and require the correct position through a lookahead:
"\\w+(?=(?:,\\w+)*>)"
This can never match "words" outside of <...>, because they cannot get past the opening < to match the closing >. Of course that makes it hard to distinguish between elements from different sets of <...>.
Alternatively (and I suppose that is even better, because it's safer, and more readable), go for a two-step algorithm. First match
"<([\\w,]*)>"
Then split every result's first capture at ,.

Java regex and pattern matching: finding "blanks" in pattern which do not include them?

So, I need to write a compiler scanner for a homework, and thought it'd be "elegant" to use regex. Fact is, I seldomly used them before, and it was a long time ago. So I forgot most of the stuff about them and needed to have a look around. I used them successfully for the identifiers (or at least I think so, I still need to do some further tests but for now they all look ok), but I have a problem with the numbers-recognition.
The function nextCh() reads the next character on the input (lookahead char). What I'd like to do here is to check if this char matches the regex [0-9]*. I append every matching char in the str field of my current token, then I read the int value of this field. It recognizes a single number input such as "123", but the problem I have is that for the input "123 456", the final str will be "123 456" while I should get 2 separate tokens with fields "123" and "456". Why is the " " being matched?
private void readNumber(Token t) {
t.str = "" + ch; // force conversion char --> String
final Pattern pattern = Pattern.compile("[0-9]*");
nextCh(); // get next char and check if it is a digit
Matcher match = pattern.matcher("" + ch);
while (match.find() && ch != EOF) {
t.str += ch;
nextCh();
match = pattern.matcher("" + ch);
}
t.kind = Kind.number;
try {
int value = Integer.parseInt(t.str);
t.val = value;
} catch(NumberFormatException e) {
error(t, Message.BIG_NUM, t.str);
}
Thank you!
PS: I did solve my problem using the code below. Nevertheless, I'd like to understand where the flaw is in my regex expression.
t.str = "" + ch;
nextCh(); // get next char and check if it is a number
while (ch>='0' && ch<='9') {
t.str += ch;
nextCh();
}
t.kind = Kind.number;
try {
int value = Integer.parseInt(t.str);
t.val = value;
} catch(NumberFormatException e) {
error(t, Message.BIG_NUM, t.str);
}
EDIT: turns out my regex also doesn't work for the identifiers recognition (again, includes blanks), so I had to switch to a system similar to my "solution" (while with a lot of conditions). Guess I'll need to study the regex again :O

I'm not 100% sure whether this is relevant in your case, but this:
Pattern.compile("[0-9]*");
matches zero or more numbers anywhere in the string, because of the asterisk. I think the space gets matched because it is a match for 'zero numbers'. If you wanted to make sure the char was a number, you would have to match one or more, using the plus sign:
Pattern.compile("[0-9]+");
or, since you are only comparing a single char at a time, just match one number:
Pattern.compile("^[0-9]$");

You should be using the matches method rather than the find method. From the documentation:
The matches method attempts to match the entire input sequence against the pattern
The find method scans the input sequence looking for the next subsequence that matches the pattern.
So in other words, by using find, if the string contains a digit anywhere at all, you'll get a match, but if you use matches the entire string must match the pattern.
For example, try this:
Pattern p = Pattern.compile("[0-9]*");
Matcher m123abc = p.matcher("123 abc");
System.out.println(m123abc.matches()); // prints false
System.out.println(m123abc.find()); // prints true

Use a simpler regex like
/\d+/
Where
\d means a digit
+ means one or more
In code:
final Pattern pattern = Pattern.compile("\\d+");

Author and time matching regex

I would to use a regex in my Java program to recognize some feature of my strings.
I've this type of string:
`-Author- has wrote (-hh-:-mm-)
So, for example, I've a string with:
Cecco has wrote (15:12)
and i've to extract author, hh and mm fields. Obviously I've some restriction to consider:
hh and mm must be numbers
author hasn't any restrictions
I've to consider space between "has wrote" and (
How can I can use regex?
EDIT: I attach my snippet:
String mRegex = "(\\s)+ has wrote \\((\\d\\d):(\\d\\d)\\)";
Pattern mPattern = Pattern.compile(mRegex);
String[] str = {
"Cecco CQ has wrote (14:55)", //OK (matched)
"yesterday you has wrote that I'm crazy", //NO (different text)
"Simon has wrote (yesterday)", // NO (yesterday isn't numbers)
"John has wrote (22:32)", //OK
"James has wrote(22:11)", //NO (missed space between has wrote and ()
"Tommy has wrote (xx:ss)" //NO (xx and ss aren't numbers)
};
for(String s : str) {
Matcher mMatcher = mPattern.matcher(s);
while (mMatcher.find()) {
System.out.println(mMatcher.group());
}
}

homework?
Something like:
(.+) has wrote \((\d\d):(\d\d)\)
Should do the trick
() - mark groups to capture (there are three in the above)
.+ - any chars (you said no restrictions)
\d - any digit
\(\) escape the parens as literals instead of a capturing group
use:
Pattern p = Pattern.compile("(.+) has wrote \\((\\d\\d):(\\d\\d)\\)");
Matcher m = p.matcher("Gareth has wrote (12:00)");
if( m.matches()){
System.out.println(m.group(1));
System.out.println(m.group(2));
System.out.println(m.group(3));
}
To cope with an optional (HH:mm) at the end you need to start to use some dark regex voodoo:
Pattern p = Pattern.compile("(.+) has wrote\\s?(?:\\((\\d\\d):(\\d\\d)\\))?");
Matcher m = p.matcher("Gareth has wrote (12:00)");
if( m.matches()){
System.out.println(m.group(1));
System.out.println(m.group(2));
System.out.println(m.group(3));
}
m = p.matcher("Gareth has wrote");
if( m.matches()){
System.out.println(m.group(1));
// m.group(2) == null since it didn't match anything
}
The new unescaped pattern:
(.+) has wrote\s?(?:\((\d\d):(\d\d)\))?
\s? optionally match a space (there might not be a space at the end if there isn't a (HH:mm) group
(?: ... ) is a none capturing group, i.e. allows use to put ? after it to make is optional
I think #codinghorror has something to say about regex

The easiest way to figure out regular expressions is to use a testing tool before coding.
I use an eclipse plugin from http://www.brosinski.com/regex/
Using this I came up with the following result:
([a-zA-Z]*) has wrote \((\d\d):(\d\d)\)
Cecco has wrote (15:12)
Found 1 match(es):
start=0, end=23
Group(0) = Cecco has wrote (15:12)
Group(1) = Cecco
Group(2) = 15
Group(3) = 12
An excellent turorial on regular expression syntax can be found at http://www.regular-expressions.info/tutorial.html

Well, just in case you didn't know, Matcher has a nice function that can draw out specific groups, or parts of the pattern enclosed by (), Matcher.group(int). Like if I wanted to match for a number between two semicolons like:
:22:
I could use the regex ":(\\d+):" to match one or more digits between two semicolons, and then I can fetch specifically the digits with:
Matcher.group(1)
And then its just a matter of parsing the String into an int. As a note, group numbering starts at 1. group(0) is the whole match, so Matcher.group(0) for the previous example would return :22:
For your case, I think the regex bits you need to consider are
"[A-Za-z]" for alphabet characters (you could probably also safely use "\\w", which matchers alphabet characters, as well as numbers and _).
"\\d" for digits (1,2,3...)
"+" for indicating you want one or more of the previous character or group.

How can I find repeated characters with a regex in Java?

Can anyone give me a Java regex to identify repeated characters in a string? I am only looking for characters that are repeated immediately and they can be letters or digits.
Example:
abccde <- looking for this (immediately repeating c's)
abcdce <- not this (c's seperated by another character)

Try "(\\w)\\1+"
The \\w matches any word character (letter, digit, or underscore) and the \\1+ matches whatever was in the first set of parentheses, one or more times. So you wind up matching any occurrence of a word character, followed immediately by one or more of the same word character again.
(Note that I gave the regex as a Java string, i.e. with the backslashes already doubled for you)

String stringToMatch = "abccdef";
Pattern p = Pattern.compile("(\\w)\\1+");
Matcher m = p.matcher(stringToMatch);
if (m.find())
{
System.out.println("Duplicate character " + m.group(1));
}

Regular Expressions are expensive. You would probably be better off just storing the last character and checking to see if the next one is the same.
Something along the lines of:
String s;
char c1, c2;
c1 = s.charAt(0);
for(int i=1;i<s.length(); i++){
char c2 = s.charAt(i);
// Check if they are equal here
c1=c2;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex to match/group repeating characters in a string - java

How about /((.)(\2*))/g? That way, you match the group as a whole (I'm assuming that that's what you want, and that's what's lacking from the solution you found).

Looks like you need to use a Matcher in a loop: Pattern p = Pattern.compile("((.)\\2*)"); Matcher m = p.matcher("qwwwwwwwwweeeeerrtyyyyyqqqqwEErTTT"); while (m.find()) { System.out.println(m.group(1)); } Outputs: q wwwwwwwww eeeee rr t yyyyy qqqq w EE r TTT

Related

How to split a string and save the 2 characters that I split with?

How to retain all occurrences of X when using the greedy quantifier X* in a java regexp?

Java regex and pattern matching: finding "blanks" in pattern which do not include them?

Author and time matching regex

How can I find repeated characters with a regex in Java?

Categories

Resources