How to use regex groups in Java

How to use regex groups in Java - java

I need to replace string 'name' with fullName in the following kind of strings:
software : (publisher:abc and name:oracle)
This needs to be replaced as:
software : (publisher:abc and fullName:xyz)
Now, basically, part "name:xyz" can come anywhere inside parenthesis. e.g.
software:(name:xyz)
I am trying to use groups and the regex I built looks :
(\bsoftware\s*?:\s*?\()((.*?)(\s*?(and|or)\s*?))(\bname:.*?\)\s|:.*?\)$)

You may use
\b(software\s*:\s*\([^()]*)\bname:\w+
and replace with $1fullName:xyz. See the regex demo and the regex graph:
Details
\b - word boundary
(software\s*:\s*\([^()]*) - Capturing group 1 ($1 in the replacement pattern is a placeholder for the value captured in this group):
software - a word
\s*:\s* - a : enclosed with 0+ whitespaces
\( - a ( char
[^()]* - 0 or more chars other than ( and )
\bname - whole word name
: - colon
\w+ - 1 or more letters, digits or underscores.
Java sample code:
String result = s.replaceAll("\\b(software\\s*:\\s*\\([^()]*)\\bname:\\w+", "$1fullName:xyz");

Related

REGEX greediness or just wrong syntax

I tried to delete all the [.!?] from quotes in a text and doing so , I want first to catch all my quotes including [.!?] with a regex to delete them after that.
My regex doesn't work, maybe because it's greedy. It takes from my "«" (character at index 569) to the last character which is another "»" (character at index 2730).
My regex was:
Pattern full=Pattern.compile("«.*[.!?].*?»");
Matcher mFull = full.matcher(result);
while(mFull.find()){
System.out.println(mFull.start()+" "+mFull.end());
}
So I got:
569 2731
Also , Same problem of greediness , with catching sentences ( beginning with any [A-Z] and ending with any [.!?].

You may use
s = s.replaceAll("(\\G(?!^)|«)([^«».!?]*)[.!?](?=[^«»]*»)", "$1$2");
See the regex demo
Details
(\G(?!^)|«) - Group 1 (whose value is referred to with $1 from the replacement pattern): either the end of the previous match or «
([^«».!?]*) - Group 2 ($2): any 0+ chars other than «, », !, . and ?
[.!?] - any of the three symbols
(?=[^«»]*») - there must be a » after 0 or more chars other than « and » immediately to the right of the current location.

Using regular expression to validate colon separated inputs

I'm reading in a file for a Java application which has data separated by colons in the format:
test : test : 0 : 0
Where the first two segments are names of something and the last two are digits.
The match should fail if the input is not formatted in that exact way above (aside from the data being different)
test : test : 0 : 0 -----> pass
: test: 0 : 0 -----> fail
0 : test : 0 : test -----> fail
test test : 0 : 0 -----> fail
So the match will fail if there are any segments omitted, if the digits and words do not appear where they should, i.e. word : word : digit : digit, and there has to be 3 colons and 4 segments no more no less as above.
This is where I have gotten so far but it's not quite right:
^\D+(?:\s\:\s\w+)*$

You may use a regex like
^[a-zA-Z]+\s*:\s*[a-zA-Z]+(?:\s*:\s*\d+){2}$
Details
^ - start of string (implicit in String#matches)
[a-zA-Z]+ - 1+ ASCII letters
\s*:\s* - a : enclosed with 0+ whitespaces
[a-zA-Z]+ - 1+ ASCII letters
(?:\s*:\s*\d+){2} - two occurrences of : enclosed with 0+ whitespaces and then 1+ digits
$ - end of string (implicit in String#matches)
NOTE: If there must be an obligatory single space between the items, you need to replace \s* with \s. To match 1 or more whitespaces, \s* must be turned into \s+.
In Java, you may write it as
s.matches("[a-zA-Z]+\\s*:\\s*[a-zA-Z]+(?:\\s*:\\s*\\d+){2}")
See the regex demo

Here you go (demo at Regex101):
[a-zA-Z]+\s+:\s+[a-zA-Z]+\s+:\s+\d+\s+:\s+\d+
Explanation:
[a-zA-Z]+ stands for 1 or more letters (+ is the modifiers allowing to match the previous statement at least once
\s+ stands for 1 or more
: is the : character, literally
\d+ stands for at least one digit (remove the + to match one digit exactly)
Finally, compose those parts according to your needs. You might want to make the Regex make stricter replacing the \s+ with only one empty space .
Validate the String using the method String::matches (don't forget to use two slashes \\):
boolean isValid = string.matches("[a-zA-Z]+\\s+:\\s+[a-zA-Z]+\\s+:\\s+\\d+\\s+:\\s+\\d+");

I would just use String#matches on each line, with the following pattern:
[a-z]+ : [a-z]+ : [0-9]+ : [0-9]+
For example:
String line = "test : test : 0 : 0";
if (line.matches("[a-z]+ : [a-z]+ : [0-9]+ : [0-9]+")) {
System.out.println("Found a match");
}

Java regex pattern group capture

I'm trying to split the string below into 3 groups, but with it doesn't seem to be working as expected with the pattern that I'm using. Namely, when I invoke matcher.group(3), I'm getting a null value instead of *;+g.3gpp.cs-voice;require. What's wrong with the pattern?
String: "*;+g.oma.sip-im;explicit,*;+g.3gpp.cs-voice;require"
Pattern: (\\*;.*)?(\\*;.*?\\+g.oma.sip-im.*?)(,\\*;.*)?
Expected:
Group 1: null,
Group 2: *;+g.oma.sip-im;explicit,
Group 3: ,*;+g.3gpp.cs-voice;require
Actual:
Group 1: null,
Group 2: *;+g.oma.sip-im,
Group 3: null

The result you get does actually match your pattern in a non-greedy way. Group2 is expanded to the shortest possible result
*;+g.oma.sip-im
and then the last group is left out because of the question mark at the very end. It appears to me that you are building a far too complicated regex for your purpose.

The thing is that the (,\*;.*)? does not match as the text you expect is located further in the string. You need to make the third group obligatory by removing the ? at the end, but wrap the whole .*? + Group 3 within an optional non-capturing group:
String pat = "(\\*;.*)?(\\*;.*?\\+g\\.oma\\.sip-im)(?:.*?(,\\*;.*))?";
See the regex demo.
Note that literal dots should be escaped in the regex pattern.
Details:
(\\*;.*)? - Group 1 (optional) capturing
\\*; - a *; string
.* - any zero or more chars other than linebreak symbols, as many as possible
(\\*;.*?\\+g\\.oma\\.sip-im) - Group 2 (obligatory) capturing
\\*; - a *; string
.*? - any zero or more chars other than linebreak symbols, as few as possible
\\+g\\.oma\\.sip-im - a literal string +g.oma.sip-im
(?:.*?(,\\*;.*))? - non-capturing group (optional) matching
.*? - any zero or more chars other than linebreak symbols, as few as possible
(,\\*;.*) - Group 3 (obligatory) capturing the same pattern as in Group 1.

How to regex a string representig a city or its postal code with accent?

I'm trying to write a code of java allowing to show a list of cities depending on the name of the city or its postal code:
I wrote many expressions but they didn't work 100%.
This is my last expression:
([A-Z_]+)(:)([0-9]+)
The expression should match a city name : it could be : Lonéy' ed or its code postal 57000
Does anyone have an idea how to improve my expression?
Thanx.

Since Java7 you can do the following :
Pattern.compile("([\\p{Alpha} '-_]+):(\\d{5})", Pattern.UNICODE_CHARACTER_CLASS)
Keep adding connecting characters (here [ '-_]) to cater for all your needs.
The pattern doesn't make any assumptions about the case of the name of a place as in some non-Latin scripts there are no cases.
EDIT: added 5 digits postal code detection and a SPACE for name detection

I suggest using
"(?U)(\\p{Lu}[\\p{L}\\p{M}\\s'-]*):(\\d{5})\\b"
It means:
(?U) - a Pattern.UNICODE_CHARACTER_CLASS inline flag that makes \b word bounsary and \d digit character class Unicode aware in the pattern
(\\p{Lu}[\\p{L}\\p{M}\\s'_-]*) - Group 1 capturing:
\\p{Lu} - an uppercase Unicode letter
[\\p{L}\\p{M}\\s'_-]* - 0 or more characters that are either Unicode letters (\\p{L}), diacritics (\\p{M}), whitespace (\\s), ', _ or - (NOTE that the hyphen must be at the end of the character class so that it could be treated as a literal hyphen)
: - a literal colon
(\\d{5}) - (Group 2) five digits
\\b - a word boundary so that we only match 5 digits not followed with a word char (not 5 digits in a 110 digit substring), can be replaced with "(?!\\d)"
See Java demo:
String s = "Lonéy' ed:57000";
Pattern pattern = Pattern.compile("(?U)(\\p{Lu}[\\p{L}\\p{M}\\s'-]*):(\\d{5})\\b");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
}

Regular Expressions \w character class and equals sign

I am creating a regular expression to match the string
#servername:port:databasename
and through https://regex101.com/ I came up with
\#(((\w+.*-*)+)?\w+)(:\d+)(:\w+)
which matches
e.g. #CORA-PC:1111:databasename or #111.111.1.111:111:databasename
However when I use this regular expression to pattern match in my java code the String #CORA-PC:1111:database=name is also matched.
Why is \w matching the = equals sign? I also tried [0-9a-zA-Z] but it also matched the = equals sign?
Can anyone help me with this?
Thanks!

The .* is a greedy dot matching subpattern that matches the whole line and then backtracks to accommodate for the subsequent subpatterns. That is why the pattern can match a = symbol (see demo - Group 3 matches that part with =) .
Your pattern is rather fragile, as the first part contains nested quantifiers with optional subpatterns that slows down the regex execution and causes other issues. You need to make it more linear.
#(\w+(?:[-.]\w+)*)?(:\d+)(:\w+)
See the regex demo
The regex will match
# - # symbol
(\w+(?:[-.]\w+)*)? - an optional group matching
\w+ - 1+ word chars
(?:[-.]\w+)* - 0+ sequences of a - or . ([-.]) followed with 1+ word chars
(:\d+) - a : symbol followed with 1+ digits
(:\w+) - a : symbol followed with 1+ word chars
If you need to avoid partial matching, use String#matches() (see demo).
NOTE: In Java, backslashes must be doubled.
Code example (Java):
String s = "#CORA-PC:1111:databasename";
String rx = "#(?:\\w+(?:[-.]\\w+)*)?:\\d+:\\w+";
System.out.println(s.matches(rx));
Code example (JS):
var str = '#CORA-PC:1111:databasename';
alert(/^#(?:\w+(?:[-.]\w+)*)?:\d+:\w+$/.test(str));

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to use regex groups in Java - java

Related

REGEX greediness or just wrong syntax

Using regular expression to validate colon separated inputs

Java regex pattern group capture

How to regex a string representig a city or its postal code with accent?

Regular Expressions \w character class and equals sign

Categories

Resources