REGEX greediness or just wrong syntax

REGEX greediness or just wrong syntax - java

I tried to delete all the [.!?] from quotes in a text and doing so , I want first to catch all my quotes including [.!?] with a regex to delete them after that.
My regex doesn't work, maybe because it's greedy. It takes from my "«" (character at index 569) to the last character which is another "»" (character at index 2730).
My regex was:
Pattern full=Pattern.compile("«.*[.!?].*?»");
Matcher mFull = full.matcher(result);
while(mFull.find()){
System.out.println(mFull.start()+" "+mFull.end());
}
So I got:
569 2731
Also , Same problem of greediness , with catching sentences ( beginning with any [A-Z] and ending with any [.!?].

You may use
s = s.replaceAll("(\\G(?!^)|«)([^«».!?]*)[.!?](?=[^«»]*»)", "$1$2");
See the regex demo
Details
(\G(?!^)|«) - Group 1 (whose value is referred to with $1 from the replacement pattern): either the end of the previous match or «
([^«».!?]*) - Group 2 ($2): any 0+ chars other than «, », !, . and ?
[.!?] - any of the three symbols
(?=[^«»]*») - there must be a » after 0 or more chars other than « and » immediately to the right of the current location.

Related

Regex to validate custom format

I have this format: xx:xx:xx or xx:xx:xx-y, where x can be 0-9 a-f A-F and y can be only 0 or 1.
I come up with this regex: ([0-9A-Fa-f]{2}[:][0-9A-Fa-f]{2}[:][0-9A-Fa-f]{2}|[-][0-1]{1})
(See regexr).
But this matches 0a:0b:0c-3 too, which is not expected.
Is there any way to remove these cases from result?

[:] means a character from the list that contains only :. It is the same as
:. The same for [-] which has the same result as -.
Also, {1} means "the previous piece exactly one time". It does not have any effect, you can remove it altogether.
To match xx:xx:xx or xx:xx:xx-y, the part that matches -y must be optional. The quantifier ? after the optional part mark it as optional.
All in all, your regex should be like this:
[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}(-[01])?
If the regex engine you use can be told to ignore the character case then you can get rid of A-F (or a-f) from all character classes and the regex becomes:
[0-9a-f]{2}:[0-9a-f]{2}:[0-9a-f]{2}(-[01])?
How it works, piece by piece:
[0-9a-f] # any digit or letter from (and including) 'a' to 'f'
{2} # the previous piece exactly 2 times
: # the character ':'
[0-9a-f]
{2}
:
[0-9a-f]
{2}
( # start a group; it does not match anything
- # the character '-'
[01] # any character from the class (i.e. '0' or '1')
) # end of group; the group is needed for the next quantifier
? # the previous piece (i.e. the group) is optional
# it can appear zero or one times
See it in action: https://regexr.com/4rfvr
Update
As #the-fourth-bird mentions in a comment, if the regex must match the entire string then you need to anchor its ends:
^[0-9a-f]{2}:[0-9a-f]{2}:[0-9a-f]{2}(-[01])?$
^ as the first character of a regex matches the beginning of the string, $ as the last character matches the end of the string. This way the regex matches the entire string only (when there aren't other characters before or after the xx:xx:xx or xx:xx:xx-y part).
If you use the regex to find xx:xx:xx or xx:xx:xx-y in a larger string then you don't need to add ^ and $. Of course, you can add only ^ or $ to let the regex match only at the beginning or at the end of the string.

You want
xx:xx:xx or if it is followed by a -, then it must be a 0 or 1 and then it is the end (word boundry).
So you don't want any of these
0a:0b:0c-123
0a:0b:0cd
10a:0b:0c
either.
Then you want "negative lookingahead", so if you match the first part, you don't want it to be followed by a - (the first pattern) and it should end there (word boundary), and if it is followed by a -, then it must be a 0 or 1, and then a word boundary:
/\b([0-9a-f]{2}[:][0-9a-f]{2}[:][0-9a-f]{2}(?!-)\b|\b[0-9a-f]{2}[:][0-9a-f]{2}[:][0-9a-f]{2}-[01]\b)/i
To prevent any digit in front, a word boundary is added to the front as well.
Example: https://regexr.com/4rg42
The following almost worked:
/\b([0-9a-f]{2}[:][0-9a-f]{2}[:][0-9a-f]{2}\b[^-]|\b[0-9a-f]{2}[:][0-9a-f]{2}[:][0-9a-f]{2}-[01]\b)/i
but if it is the end of file and it is 3a:2b:11, then the [^-] will try to match a non - character and it won't match.
Example: https://regexr.com/4rg4q

How to use regex groups in Java

I need to replace string 'name' with fullName in the following kind of strings:
software : (publisher:abc and name:oracle)
This needs to be replaced as:
software : (publisher:abc and fullName:xyz)
Now, basically, part "name:xyz" can come anywhere inside parenthesis. e.g.
software:(name:xyz)
I am trying to use groups and the regex I built looks :
(\bsoftware\s*?:\s*?\()((.*?)(\s*?(and|or)\s*?))(\bname:.*?\)\s|:.*?\)$)

You may use
\b(software\s*:\s*\([^()]*)\bname:\w+
and replace with $1fullName:xyz. See the regex demo and the regex graph:
Details
\b - word boundary
(software\s*:\s*\([^()]*) - Capturing group 1 ($1 in the replacement pattern is a placeholder for the value captured in this group):
software - a word
\s*:\s* - a : enclosed with 0+ whitespaces
\( - a ( char
[^()]* - 0 or more chars other than ( and )
\bname - whole word name
: - colon
\w+ - 1 or more letters, digits or underscores.
Java sample code:
String result = s.replaceAll("\\b(software\\s*:\\s*\\([^()]*)\\bname:\\w+", "$1fullName:xyz");

Java regex. Match any "value" that is no preceded by given string

I need some help with a Java regexp.
I'm working with a file that has JSON similar format:
['zul.wgt.Label','f6DQof',{sclass:'class',style:'font-weight: bold;',prolog:' ',value:'xxxx'},{},[]],
['zul.wgt.Label','f6DQpf',{sclass:'class notranslate',style:'font-weight: bold;',prolog:' ',value:'xxxx'},
['zul.wgt.Label','f6DQof',{sclass:'class',style:'font-weight: bold;',prolog:' ',label:'xxxx'},{},[]]
['zul.wgt.Label','f6DQof',{sclass:'class',style:'font-weight: bold;',prolog:' ',label:'xxxx'},{},[]]
I need to match any label or value data that is not preceded by a "notranslate" value on the sclass property.
I've been working on an almost working Regexp but I need the final push to match only what I've previously wrote
((?!.*?notranslate)sclass:'[\w\s]+'.*?)((value|label):'(.*?)')
Right now it matches anything from sclass that it's not followed by 'notranslate'
Thanks for your help

The values of your current regex are in the 4th capturing group
You could also use 1 capturing group instead of 4:
^(?!.*\bsclass:'[^']*\bnotranslate\b[^']*').*\b(?:label|value):'([^']+)'
Regex demo
That would match:
^ Assert start of the string
(?! Negative lookahead to assert that what is on the right does not
.*\bsclass: Match any character 0+ times followed by class:
'[^']*\bnotranslate\b[^']*' Match notranslate between single quotes and word boundaries
) Close non capturing group
.* match any character 0+ times
\b(?:label|value): Match either label or value followed by :
'([^']+)' Match ', capture in a group matching not ' 1+ times and match '
Java demo

Java regex shortest match

I have the following string, (a.1) (b.2) (c.3) (d.4). I want to change it to (1) (2) (3) (4). I use the following method.
str.replaceAll("\(.*[.](.*)\)","($1)"). And I only get (4). What is the correct method?
Thanks

Couple things here. First, your escapes for the parentheses are incorrect. In Java string literals, backslash itself is an escape character, meaning you need to use \\( to represent \( in regex.
I think your question is how to do non-greedy matches in regex. Use ? to specify non-greedy matching; e.g. *? means "zero or more times, but as few times as possible".
This doesn't negate other answers, but they depend on your test input being as simple as it is in your question. This gives me the correct output without changing the spirit of your original regex (that only the parentheses and dot delimiter are known to be present):
String test = "(a.1) (b.2) (c.3) (d.4)";
String replaced = test.replaceAll("\\(.*?[.](.*?)\\)", "($1)");
System.out.println(replaced); // "(1) (2) (3) (4)"

Root cause
You want to match ()-delimited substrings, but are using .* greedy dot pattern that can match any 0 or more chars (other than line break chars). The \(.*[.](.*)\) pattern will match the first ( in (a.1) (b.2) (c.3) (d.4), then .* will grab the whole string, and backtracking will start trying to accommodate text for the subsequent obligatory subpatterns. [.] will find the last . in the string, the one before the last digit, 4. Then, (.*) will again grab all the rest of the string, but since the ) is required right after, due to backtracking the last (.*) will only capture 4.
Why is lazy / reluctant .*? not a solution?
Even if you use \(.*?[.](.*?)\), if there are (xxx) like substrings inside the string, they will get matched together with expected matches, as . matches any char but line break chars.
Solution
.replaceAll("\\([^()]*\\.([^()]*)\\)", "($1)")
See the regex demo. The [^()] will only match any char BUT a ( and ).
Details
\( - a ( char
[^()]* - a negated character class matching 0 or more chars other than ( and )
\. - a dot
([^()]*) - Group 1 (its value is later referred to with $1 from the replacement pattern): any 0+ chars other than ( and )
\) - a ) char.
Java demo:
List<String> strs = Arrays.asList("(a.1) (b.2) (c.3) (d.4)", "(a.1) (xxxx) (b.2) (c.3) (d.4)");
for (String str : strs)
System.out.println("\"" + str.replaceAll("\\([^()]*\\.([^()]*)\\)", "($1)") + "\"");
Output:
"(1) (2) (3) (4)"
"(1) (xxxx) (2) (3) (4)"

try this one, it will match any alphabets, . and " and replace them all with empty ""
str.replaceAll("[a-zA-Z\\.\"]", "")
Edit:
You can use also [^\\d)(\\s] to match all characters that are not number, space and )( and replace them all with empty "" string
String str = "(a.1) (b.2) (c.3) (d.4)";
System.out.println(str.replaceAll("[^\\d)(\\s]",""));

Try this
str.replaceAll("[A-Za-z0-9]+\.","");
[A-Za-z0-9] will match the upper case, lower case and digits. If you want to match anything before the dot(.) you can use .+ or .* in the place of [A-Za-z0-9]+

Match first occurrence of semicolon in string, only if not preceded by '--'

I'm trying to write a regular expression for Java that matches if there is a semicolon that does not have two (or more) leading '-' characters.
I'm only able to get the opposite working: A semicolon that has at least two leading '-' characters.
([\-]{2,}.*?;.*)
But I need something like
([^([\-]{2,})])*?;.*
I'm somehow not able to express 'not at least two - characters'.
Here are some examples I need to evaluate with the expression:
; -- a : should match
-- a ; : should not match
-- ; : should not match
--; : should not match
-;- : should match
---; : should not match
-- semicolon ; : should not match
bla ; bla : should match
bla : should not match (; is mandatory)
-;--; : should match (the first occuring semicolon must not have two or more consecutive leading '-')

It seems that this regex matches what you want
String regex = "[^-]*(-[^-]+)*-?;.*";
DEMO
Explanation: matches will accept string that:
[^-]* can start with non dash characters
(-[^-]+)*-?; is a bit tricky because before we will match ; we need to make sure that each - do not have another - after it so:
(-[^-]+)* each - have at least one non - character after it
-? or - was placed right before ;
;.* if earlier conditions ware fulfilled we can accept ; and any .* characters after it.
More readable version, but probably little slower
((?!--)[^;])*;.*
Explanation:
To make sure that there is ; in string we can use .*;.* in matches.
But we need to add some conditions to characters before first ;.
So to make sure that matched ; will be first one we can write such regex as
[^;]*;.*
which means:
[^;]* zero or more non semicolon characters
; first semicolon
.* zero or more of any characters (actually . can't match line separators like \n or \r)
So now all we need to do is make sure that character matched by [^;] is not part of --. To do so we can use look-around mechanisms for instance:
(?!--)[^;] before matching [^;] (?!--) checks that next two characters are not --, in other words character matched by [^;] can't be first - in series of two --
[^;](?<!--) checks if after matching [^;] regex engine will not be able to find -- if it will backtrack two positions, in other words [^;] can't be last character in series of --.

How about just splitting the string along -- and if there are two or more sub strings, checking if the last one contains a semicolon?

How about using this regex in Java:
[^;]*;(?<!--[^;]{0,999};).*
Only caveat is that it works with up to 999 character length between -- and ;
Java Regex Demo

I think this is what you're looking for:
^(?:(?!--).)*;.*$
In other words, match from the start of the string (^), zero or more characters (.*) followed by a semicolon. But replacing the dot with (?:(?!--).) causes it to match any character unless it's the beginning of a two-hyphen sequence (--).
If performance is an issue, you can exclude the semicolon as well, so it never has to backtrack:
^(?:(?!--|;).)*;.*$
EDIT: I just noticed your comment that the regex should work with the matches() method, so I padded it out with .*. The anchors aren't really necessary, but they do no harm.

You need a negative lookahead!
This regex will match any string which does not contain your original match pattern:
(?!-{2,}.*?;.*).*?;.*
This Regex matches a string which contains a semicolon, but not one occuring after 2 or more dashes.
Example:

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

REGEX greediness or just wrong syntax - java

Related

Regex to validate custom format

How to use regex groups in Java

Java regex. Match any "value" that is no preceded by given string

Java regex shortest match

Match first occurrence of semicolon in string, only if not preceded by '--'

Categories

Resources