Why is this regex not matching URLs? - java

I have the following regex:
^(?=\w+)(-\w+)(?!\.)
Which I'm attempting to match against the following text:
www-test1.examples.com
The regex should match only the -test1 part of the string and only if it is before the first .and after the start of the expression. www can be any string but it should not be matched.
My pattern is not matching the -test1 part. What am I missing?

Java is one of the only languages that support non-fixed-length look-behinds (which basically means you can use quantifiers), so you can technically use the following:
(?<=^\w+)(-\w+)
This will match for -test without capturing the preceding stuff. However, it's generally not advisable to use non-fixed-length look-behinds, as they are not perfect, nor are they very efficient, nor are they portable across other languages. Having said that.. this is a simple pattern, so if you don't care about portability, sure, go for it.
The better solution though is to group what you want to capture, and reference the captured group (in this case, group 1):
^\w+(-\w+)
p.s. - \w will not match a dot, so no need to look ahead for it.
p.p.s. - to answer your question about why your original pattern ^(?=\w+)(-\w+)(?!\.) doesn't match. There are 2 reasons:
1) you start out with a start of string assertion, and then use a lookahead to see if what follows is one or more word chars. But lookaheads are zero-width assertions, meaning no characters are actually consumed in the match, so the pointer doesn't move forward to the next chars after the match. So it sees that "www" matches it, and moves on to the next part of the pattern, but the actual pointer hasn't moved past the start of string. So, it next tries to match your (-\w+) part. Well your string doesn't start with "-" so the pattern fails.
2) (?!\.) is a negative lookahead. Well your example string shows a dot as the very next thing after your "-test" part. So even if #1 didn't fail it, this would fail it.

The problem you're having is the lookahead. In this case, it's inappropriate if you want to capture what's between the - and the first .. The pattern you want is something like this:
(-\w+)(?=\.)
In this case, the contents of capture group 1 will contain the text you want.
Demo on Regex101

Try this:
(?<=www)\-\w+(?=\.)
Demo: https://regex101.com/r/xEpno7/1

Related

Regex with at least one of two options but in sequence

I am trying to write a regex (for use in a Java Pattern) that will match strings that possibly have a letter that is possibly followed by a space then number, but must have at least one of them. For example, the following strings should be matched:
"a 5"
"b 9"
" 8"
However, it should not match an empty string ("").
Furthermore, I would like to make each of the components part a named capture group.
The following works, but allows the empty string.
"(?<let>\\p{Alpha})?( (?<num>\\p{Digit}))?"
To ensure that there is at least one of them, you can use lookahead (?=\\p{Alpha}| \\p{Digit}) at the beginning:
"(?=\\p{Alpha}| \\p{Digit})(?<let>\\p{Alpha})?( (?<num>\\p{Digit}))?"
In general, to avoid empty strings you can use (?=.).
You can use a negative lookahead to avoid empty input and keep your regex as:
^(?!$)(?<let>\p{L})?(?:\h+(?<num>\p{N}))?$
RegEx Demo
(?!$) is negative lookahead to fail the match for empty strings.
You can solve problem with:
([a-z]? \d)|([a-z] \d?)
You can see this code that covers your test cases in demo here. You can see this code in demo here. This is very basic regular expression knowledge, you should definitely learn more about regular expressions, there are bunch of good tutorials on web (e.g this one).
You can use | for or, then simply repeat "any pattern" to match everything like this.
((?<let>[A-z])|(?<num>\d)\s*)+
That lets you match any number of named patterns in any order.

Password Validation with Regex Java

I am trying to figure out a regex to match a password that contains
one upper case letter.
one number
one special character.
and at least 4 characters of length
the regex that I wrote is
^((?=.*[0-9])(?=.*[A-Z])(?=.*[^A-Za-z0-9])){4,}
however it is not working, and I couldn't figure out why.
So please can someone tell me why this code is not working, where did I mess up, and how to correct this code.
Your regex can be rewritten as
^(
(?=.*[0-9])
(?=.*[A-Z])
(?=.*[^A-Za-z0-9])
){4,}
As you see {4,} applies to group which doesn't let you match any character since look-around is zero-width, which effectively means "4 or more of nothing".
You need to add . before {4,} to let your regex handle "and at least 4 characters of length" point (rest is handled by look-around).
You can remove that capturing group since you don't really need it.
So try with something like
^(?=.*[0-9])(?=.*[A-Z])(?=.*[^A-Za-z0-9]).{4,}
You could come up with sth. like:
^(?=.*[A-Z])(?=.*\d)(?=.*[!"§$%&/()=?`]).{4,}$
In multiline mode, see a demo on regex101.com.
This approach specifies the special characters directly (which could be extended, obviously).
From the following list only the bold ones would satisfy these criteria:
test
Test123!
StrongPassword34?
weakone
Tabaluga"12???
You can still enhance this expression by being more specific and requiring contrary pairs. Just to remind you, the dot-star (.*) brings you down the line and then backtracks eventually. This will almost always require more steps than to directly look for contrary pairs.
Consider the following expression:
^ # bind the expression to the beginning of the string
(?=[^A-Z\n\r]*[A-Z]) # look ahead for sth. that is not A-Z, or newline and require one of A-Z
(?=[^\d\n\r]*\d) # same construct for digits
(?=\w*[^\w\n\r]) # same construct for special chars (\w = _A-Za-z0-9)
.{4,}
$
You'll see a significant reduction in steps as the regex engine does not have to backtrack everytime.

regex with lookbehind weird behavior

I have been trying to resolve this for the past 2 days...
Please help me in understanding why this is happening. My intention is to just select the <HDR> that has a <DTL1 val="92">.....</HDR>
This is my regular expression
(?<=<HDR>).*?<DTL1\sval="3".*?</HDR>
And the input string is:
<HDR>abc<DTL1 val="1"><DTL2 val="2"></HDR><HDR><DTL1 val="92"><DTL2 val="55"></HDR><HDR><DTL1 val="3"><DTL2 val="4"></HDR>
But this regular expression selects
abc<DTL1 val="1"><DTL2 val="2"></HDR><HDR><DTL1 val="92"><DTL2 val="55"></HDR>
Can anyone please help me?
A regex engine will give you always the leftmost match in a string (even if you use a non-greedy quantifier). This is exactly what you obtain.
So, a solution is to forbid the presence of another <HDR> in the parts described by .*? that is too permissive.
You have two technics to do that, you can replace the .*? with:
(?>[^<]+|<(?!/HDR))*
or with:
(?:(?!</HDR).)*+
Most of the time, the first technic is more performant, but if your string contains an high density of <, the second way can give good results too.
The use of a possessive quantifier or an atomic group can reduce the number of steps to obtain a result in particular when the subpattern fails.
Example:
With the first way:
(?<=<HDR>)(?>[^<]+|<(?!/HDR))*<DTL1\sval="3"(?>[^<]+|<(?!/HDR))*</HDR>
or this variant:
(?<=<HDR>)(?:[^<]+|<(?!/HDR|DTL1))*+<DTL1\sval="3"(?:[^<]+|<(?!/HDR))*+</HDR>
With the second way:
(?<=<HDR>)(?:(?!</HDR).)*<DTL1\sval="3"(?:(?!</HDR).)*+</HDR>
or this variant:
(?<=<HDR>)(?:(?!</HDR|DTL1).)*+<DTL1\sval="3"(?:(?!</HDR).)*+</HDR>
Casimir et Hippolyte already gave you a couple of good solutions. I want to elaborate on a few things.
First, why your regex fails to do what you want: (?<=<HDR>).*? tells it to match any number of characters starting with the first character preceded by <HDR>, until it encounters what follows the non-greedy quantifier (<DTL1...). Well, the first character that's preceded by <HDR> is the first a, so it matches everything starting from there until the fixed string <DTL1\sval="3" is encountered.
Casimir et Hippolyte's solutions are for the generalized case, where the contents of the <HDR> tags can be anything other than nested <HDR>'s. You could also do it with a positive look-ahead:
(?<=<HDR>)(.(?!</HDR>))*<DTL1\sval="3".*?</HDR>
However, if the string is guaranteed to be in the structure shown, where the <HDR> tags only contain one or more <DTL1 val="##"> tags, so you know there won't be any closing tags within, you could do it more efficiently by replacing the first .*? with [^/]*:
(?<=<HDR>)[^/]*<DTL1\sval="3".*?</HDR>
A negated character class is more efficient than a zero-width assertion, and if you're using a negated character class, a greedy quantifier becomes more efficient than a lazy one.
Note also that by using a lookbehind to match the opening <HDR>, you're excluding it from the match, but you're including the closing </HDR>. Are you sure that's what you want? You're matching this...
<DTL1 val="3"><DTL2 val="4"></HDR>
...when presumably you want this...
<HDR><DTL1 val="3"><DTL2 val="4"></HDR>
...or this...
<DTL1 val="3"><DTL2 val="4">
So, in the fist case, don't use a lookbehind for the opening tag:
<HDR>(.(?!</HDR>))*<DTL1\sval="3".*?</HDR>
<HDR>[^/]*<DTL1\sval="3".*?</HDR>
In the second case, use a look-ahead for the closing tag:
(?<=<HDR>)(.(?!</HDR>))*<DTL1\sval="3".*?(?=</HDR>)
(?<=<HDR>)[^/]*<DTL1\sval="3".*?(?=</HDR>)

Regular expression non-greedy but still

I have some larger text which in essence looks like this:
abc12..manycharshere...hi - abc23...manyothercharshere...jk
Obviously there are two items, each starting with "abc", the numbers (12 and 23) are interesting as well as the "hi" and "jk" at the end.
I would like to create a regular expression which allows me to parse out the numbers, but only if the two characters at the end match, i.e. I am looking for the number related to "jk", but the following regular expression matches the whole string and thus returns "12", not "23" even when non-greedy matching the area with the following:
abc([0-9]+).*?jk
Is there a way to construct a regular expression which matches text like the one above, i.e. retrieving "23" for items ending in "jk"?
Basically I would need something like "match abc followed by a number, but only if there is "jk" at the end before another instance of "abc followed by a number appears"
Note: the texts/matches are an abstraction here, the actual text is more complicated, espially the things that can appear as "manyothercharactershere", I simplified to show the underlying problem more clearly.
Use a regex like this. .*abc([0-9]+).*?jk
demo here
I think you want something like this,
abc([0-9]+)(?=(?:(?!jk|abc[0-9]).)*jk)
DEMO
You need to use negative lookahead here to make it work:
abc(?!.*?abc)([0-9]+).*?jk
RegEx Demo
Here (?!.*?abc) is negative lookahead that makes sure to match abc where it is NOT followed by another abc thus making sure closes string between abc and jk is matched.
Being non-greedy does not change the rule, that the first match is returned. So abc([0-9]+).*?jk will find the first jk after “abcnumber” rather than the last one, but still match the first “abcnumber”.
One way to solve this is to tell that the dot should not match abc([0-9]+):
abc([0-9]+)((?!abc([0-9]+)).)*jk
If it is not important to have the entire pattern being an exact match you can do it simpler:
.*(abc([0-9]+).*?jk)
In this case, it’s group 1 which contains your intended match. The pattern uses a greedy matchall to ensure that the last possible “abcnumber” is matched within the group.
Assuming that hyphen separates "items", this regex will capture the numbers from the target item:
abc([0-9]+)[^-]*?jk
See demo

Java Regular Expression Pattern Matching

I want to create a regular expression, in Java, that will match the following:
*A*B
where A and B are ANY character except asterisk, and there can be any number of A characters and B characters. A(s) is/are preceded by asterisk, and B(s) is/are preceded by asterisk.
Will the following work? Seems to work when I run it, but I want to be absolutely sure.
Pattern.matches("\\A\\*([^\\*]{1,})\\*([^\\*]{1,})\\Z", someString)
It will work, however you can rewrite it as this (unquoted):
\A\*([^*]+)\*([^*]+)\Z
there is no need to quote the star in a character class;
{1,} and + are the same quantifier (once or more).
Note 1: you use .matches() which automatically anchors the regex at the beginning and end; you may therefore do without \A and \Z.
Note 2: I have retained the capturing groups -- do you actually need them?
Note 3: it is unclear whether you want the same character repeated between the stars; the example above assumes not. If you want the same, then use this:
\A\*(([^*])\2*)\*(([^*])\4*)\Z
If I got it correct.. it can be as simple as
^\\*((?!\\*).)+\\*((?!\\*).)+
If you want a match on *AAA*BBB but not on *ABC*DEF use
^\*([a-zA-Z])\1*\*([a-zA-Z])\2*$
This won't match on this either
*A_$-123*B<>+-321

Categories

Resources