How to match the middle character in a string with regex? - java

In an odd number length string, how could you match (or capture) the middle character?
Is this possible with PCRE, plain Perl or Java regex flavors?
With .NET regex you could use balancing groups to solve it easily (that could be a good example). By plain Perl regex I mean not using any code constructs like (??{ ... }), with which you could run any code and of course do anything.
The string could be of any odd number length.
For example in the string 12345 you would want to get the 3, the character at the center of the string.
This is a question about the possibilities of modern regex flavors and not about the best algorithm to do that in some other way.

With PCRE and Perl (and probably Java) you could use:
^(?:.(?=.*?(?(1)(?=.\1$))(.\1?$)))*(.)
which would capture the middle character of odd length strings in the 2nd capturing group.
Explained:
^ # beginning of the string
(?: # loop
. # match a single character
(?=
# non-greedy lookahead to towards the end of string
.*?
# if we already have captured the end of the string (skip the first iteration)
(?(1)
# make sure we do not go past the correct position
(?= .\1$ )
)
# capture the end of the string +1 character, adding to \1 every iteration
( .\1?$ )
)
)* # repeat
# the middle character follows, capture it
(.)

Hmm, maybe someone can come up with a pure regex solution, but if not you could always dynamically build the regex like this:
public static void main(String[] args) throws Exception {
String s = "12345";
String regex = String.format(".{%d}3.{%d}", s.length() / 2, s.length() / 2);
Pattern p = Pattern.compile(regex);
System.out.println(p.matcher(s).matches());
}

Related

Replacing repeatedly occuring groups of an anchored regex in java

Using Java 7 and the default RegEx implementatiin in java.util.regex.Pattern, given a regex like this:
^start (m[aei]ddel[0-9] ?)+ tail$
And a string like this:
start maddel1 meddel2 middel3 tail
Is it possible to get an output like this using the anchored regex:
start <match> <match> <match> tail.
I can get every group without anchors like this:
Regex: m[aei]ddel[0-9]
StringBuffer sb = new StringBuffer();
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
matcher.appendReplacement(sb, Matcher.quoteReplacement("<middle>"));
}
The problem is that I'm working on a quite big dataset and being able to anchor the patterns would be a huge performance win.
However when I add the anchors the only API that I can find requires a whole match and accessing the last occurrence of the group. I my case I need to verify that the regex actually matches (i.e. a whole match), but in the replacement step I need to be able to access every group on it's own.
edit I'd like to avoid workarounds like looking for the anchors in a separate step because it would require bigger changes to the code and wrapping it all up in RegExes feels more elegant.
You can use \G for this:
final String regex = "(^start |(?<!^)\\G)m[aei]ddel[0-9] (?=.* tail$)";
final String str = "start maddel1 meddel2 middel3 tail";
String repl = str.replaceAll(regex, "$1<match> ");
//=> start <match> <match> <match> tail
RegEx Demo
\G asserts position at the end of the previous match or the start of the string for the first match.
To do it in one step, you need to use a \G based regex that will do the anchoring. However, you also need a positive lookahead to check if the string ends with the desired pattern.
Here is a regex that should work:
(^start|(?!\A)\G)\s+m[aei]ddel[0-9](?=(?:\s+m[aei]ddel[0-9])*\s+tail$)
See the regex demo
String s = "start maddel1 meddel2 middel3 tail";
String pat = "(^start|(?!\\A)\\G)\\s+(m[aei]ddel[0-9])(?=(?:\\s+m[aei]ddel[0-9])*\\s+tail$)";
System.out.println(s.replaceAll(pat, "$1 <middle>" ));
See the Java online demo
Explanation:
(^start|(?!\A)\G) - match start at the end of string or the end of the previous successful match
\s+ - 1 or more whitespaces
m[aei]ddel[0-9] - m, then either a, e, i, then ddel, then 1 digit
(?=(?:\s+m[aei]ddel[0-9])*\s+tail$) - only if followed with:
(?:\s+m[aei]ddel[0-9])* - zero or more sequences of 1+ whitespaces and middelN pattern
\s+ - 1 or more whitespaces
tail$ - tails substring followed with the end of string.
With the \G anchor, for the find method, you can write it this way:
pat = "\\G(?:(?!\\A) |\\Astart (?=(?:m[aei]ddel[0-9] )+tail\\z))(m\\S+)";
details:
\\G # position after the previous match or at the start of the string
# putting it in factor makes fail the pattern more quickly after the last match
(?:
(?!\\A) [ ] # a space not at the start of the string
# this branch is the first one because it has more chance to succeed
|
\\A start [ ] # "start " at the beginning of the string
(?=(?:m[aei]ddel[0-9] )+tail\\z) # check the string format once and for all
# since this branch will succeed only once
)
( # capture group 1
m\\S+ # the shortest and simplest pattern that matches "m[aei]ddel[0-9]"
# and excludes "tail" (adapt it to your need but keep the same idea)
)
demo

Regex match certain IDs

I need help with some tricky regex to solve (for me!) and hope I can learn something to write some myself in the future.
I need to match all of the following IDs:
#1
#12
#123
#1234
#5069
#316&
#316.
#316;
and do not want to match leading zeros and numbers that end with ] or [ or are between ().
#0155
#0000155
#1123]
#1123[
(#1125)
I have come up with something like this: (#[1-9]\d{0,}), but it matches all of the above. So, I tried a different stuff like:
"(#[1-9]\\d{0,})([\\s,<\\.:&;\\)])"
"(#[1-9]+)([\\s,<\\.])"
"(?m)(#[1-9]+)(.,\(,\))"
But what I really want to do is (#[1-9]\d{0,}) to match all numbers BUT NOT FOLLOWING [ OR ] OR ( OR ).
How can I express something like this in a regex?
P.S.: The Regex needs to be used in Java.
Maybe someone can help to solve this, even better if he can explain how he got the way to the solution, so i can learn something new and help others when they struggle with the same problem.
kind regards!
You can use the following solution:
#[1-9]\d*(?![\[\])])\b[&.;]?
See demo
REGEX:
# - Matches # literally
[1-9] - 1 digit from 1 to 9
\d* - 0 or more digits
(?![\[\])]) - A negative lookahead making sure there is no [, ] or ) after the digits
\b - A word boundary
[&.;]? - An optional (?) character group matching &, . or ; literally.
Sample code:
String str = "#1\n#12\n#123\n#1234\n#5069\n#316&\n#316.\n#316;\nand not matches (leading zeros) and numbers that end with ] or [ or are between ().\n\n#0155\n#0000155\n#1123]\n#1123[\n(#1125)";
String rx = "#[1-9]\\d*(?![\\[\\])])\\b[&.;]?";
Pattern ptrn = Pattern.compile(rx);
Matcher m = ptrn.matcher(str);
while (m.find()) {
System.out.println(m.group(0));
}
See IDEONE demo
UPDATE
You can achieve the expected results with atomic grouping that prevents the regex engine from backtracking into it.
String rx = "#(?>[1-9]\\d*)(?![\\[\\])])[^\\w&&[^\n]]?";
In plain words, the check for brackets will be performed only after the last digit matched. See updated demo.
The [^\\w&&[^\n]]? pattern optionally matches any non-alphanumeric character but a newline. The newline is excluded from the character class using a character class intersection technique.
You may use possesive quantifier.
"#[1-9]\\d*+(?![\\[\\])])"
\\d*+ matches all zero or more character greedily and the + eixts after * won't let the regex engine to backtrack.
Add an optional \\W, if you want to match also the following non-word character.
"#[1-9]\\d*+(?![\\[\\])])\\W?"
DEMO
I am not able to test this in Java at the moment, but how about
"^#[1-9][0-9]*[&.;]?$"
(Any string starting with a '#', then a character from 1-9, then zero or more characters from 0-9, then a '&', '.' or ';' or nothing, end string)
EDIT: This only works if every id to check is in its own string, otherwise you'd need one of examples from other answers.

Simple regex to match strings containing <n> chars

I'm writing this regexp as i need a method to find strings that does not have n dots,
I though that negative look ahead would be the best choice, so far my regexp is:
"^(?!\\.{3})$"
The way i read this is, between start and end of the string, there can be more or less then 3 dots but not 3.
Surprisingly for me this is not matching hello.here.im.greetings
Which instead i would expect to match.
I'm writing in Java so its a Perl like flavor, i'm not escaping the curly braces as its not needed in Java
Any advice?
You're on the right track:
"^(?!(?:[^.]*\\.){3}[^.]*$)"
will work as expected.
Your regex means
^ # Match the start of the string
(?!\\.{3}) # Make sure that there aren't three dots at the current position
$ # Match the end of the string
so it could only ever match the empty string.
My regex means:
^ # Match the start of the string
(?! # Make sure it's impossible to match...
(?: # the following:
[^.]* # any number of characters except dots
\\. # followed by a dot
){3} # exactly three times.
[^.]* # Now match only non-dot characters
$ # until the end of the string.
) # End of lookahead
Use it as follows:
Pattern regex = Pattern.compile("^(?!(?:[^.]*\\.){3}[^.]*$)");
Matcher regexMatcher = regex.matcher(subjectString);
foundMatch = regexMatcher.find();
Your regular expression only matches 'not' three consecutive dots. Your example seems to show you want to 'not' match 3 dots anywhere in the sentence.
Try this: ^(?!(?:.*\\.){3})
Demo+explanation: http://regex101.com/r/bS0qW1
Check out Tims answer instead.

Java regex: Negative lookahead

I'm trying to craft two regular expressions that will match URIs. These URIs are of the format: /foo/someVariableData and /foo/someVariableData/bar/someOtherVariableData
I need two regexes. Each needs to match one but not the other.
The regexes I originally came up with are:
/foo/.+ and /foo/.+/bar/.+ respectively.
I think the second regex is fine. It will only match the second string. The first regex, however, matches both. So, I started playing around (for the first time) with negative lookahead. I designed the regex /foo/.+(?!bar) and set up the following code to test it
public static void main(String[] args) {
String shouldWork = "/foo/abc123doremi";
String shouldntWork = "/foo/abc123doremi/bar/def456fasola";
String regex = "/foo/.+(?!bar)";
System.out.println("ShouldWork: " + shouldWork.matches(regex));
System.out.println("ShouldntWork: " + shouldntWork.matches(regex));
}
And, of course, both of them resolve to true.
Anybody know what I'm doing wrong? I don't need to use Negative lookahead necessarily, I just need to solve the problem, and I think that negative lookahead might be one way to do it.
Thanks,
Try
String regex = "/foo/(?!.*bar).+";
or possibly
String regex = "/foo/(?!.*\\bbar\\b).+";
to avoid failures on paths like /foo/baz/crowbars which I assume you do want that regex to match.
Explanation: (without the double backslashes required by Java strings)
/foo/ # Match "/foo/"
(?! # Assert that it's impossible to match the following regex here:
.* # any number of characters
\b # followed by a word boundary
bar # followed by "bar"
\b # followed by a word boundary.
) # End of lookahead assertion
.+ # Match one or more characters
\b, the "word boundary anchor", matches the empty space between an alphanumeric character and a non-alphanumeric character (or between the start/end of the string and an alnum character). Therefore, it matches before the b or after the r in "bar", but it fails to match between w and b in "crowbar".
Protip: Take a look at http://www.regular-expressions.info - a great regex tutorial.

regex and java : Generating a regex for a String [duplicate]

I have a String like - "Bangalore,India=Karnataka". From this String I would like to extract only the substring "Bangalore". In this case the regex can be - (.+),.*=.*. But the problem is, the String can sometimes come like only "Bangalore". Then in that case the above regex wont work. What will be the regex to get the substring "Bangalore" whatever the String be ?
Try this one:
^(.+?)(?:,.*?)?=.*$
Explanation:
^ # Begining of the string
( # begining of capture group 1
.+? # one or more any char non-greedy
) # end of group 1
(?: # beginig of NON capture group
, # a comma
.*? # 0 or more any char non-greedy
)? # end of non capture group, optional
= # equal sign
.* # 0 or more any char
$ # end of string
Updated:
I thougth OP have to match Bangalore,India=Karnataka or Bangalore=Karnataka but as farr as I understand it is Bangalore,India=Karnataka or Bangalore so the regex is much more simpler :
^([^,]+)
This will match, at the begining of the string, one or more non-comma character and capture them in group 1.
matcher.matches()
tries to match against the entire input string. Look at the javadoc for java.util.regex.Matcher. You need to use:
matcher.find()
Are you somehow forced to solve this using one regexp and nothing else? (Stupid interview question? Extremely inflexible external API?) In general, don't try to make regexes do what plain old programming constructs do better. Just use the obvious regex, and it it doesn't match, return the entire string instead.
Try this regex, This will grab any grouping of characters at the start followed by a comma but not the comma itself.
^.*(?=,)
If you are only interested to check that "Bangalore" is contained in the string then you don't need a regexp for this.
Python:
In [1]: s = 'Bangalorejkdjiefjiojhdu'
In [2]: 'Bangalore' in s
Out[2]: True

Categories

Resources