Simple Regex phantom matches? [duplicate] - java

I try to extract the error number from strings like "Wrong parameters - Error 1356":
Pattern p = Pattern.compile("(\\d*)");
Matcher m = p.matcher(myString);
m.find();
System.out.println(m.group(1));
And this does not print anything, that became strange for me as the * means * - Matches the preceding element zero or more times from Wiki
I also went to the www.regexr.com and regex101.com and test it and the result was the same, nothing for this expression \d*
Then I start to test some different things (all tests made on the sites I mentioned):
(\d)* doesn't work
\d{0,} doesn't work
[\d]* doesn't work
[0-9]* doesn't work
\d{4} works
\d+ works
(\d+) works
[0-9]+ works
So, I start to search on the web if I could find an explanation for this. The best I could find was here on the Quantifier section, which states:
\d? Optional digit (one or none).
\d* Eat as many digits as possible (but none if necessary)
\d+ Eat as many digits as possible, but at least one.
\d*? Eat as few digits as necessary (possibly none) to return a match.
\d+? Eat as few digits as necessary (but at least one) to return a match.
The question
As english is not my primary language I'm having trouble to understand the difference (mainly the (but none if necessary) part). So could you Regex expert guys explain this in simple words please?
The closest thing that I find to this question here on SO was this one: Regex: possessive quantifier for the star repetition operator, i.e. \d** but here it is not explained the difference.

The * quantifier matches zero or more occurences.
In practice, this means that
\d*
will match every possible input, including the empty string. So your regex matches at the start of the input string and returns the empty string.

but none if necessary means that it will not break the regex pattern if there is no match. So \d* means it will match zero or more occurrences of digits.
For eg.
\d*[a-z]*
will match
abcdef
but \d+[a-z]*
will not match
abcdef
because \d+ implies that at least one digit is required.

\d* Eat as many digits as possible (but none if necessary)
\d* means it matches a digit zero or more times. In your input, it matches the least possible one (ie, zero times of the digit). So it prints none.
\d+
It matches a digit one or more times. So it should find and match a digit or a digit followed by more digits.

With the pattern /d+ at least one digit will need to be reached, and then the match will return all subsequent characters until a non-digit character is reached.
/d* will match all the empty strings (zero or more), as well at the match. The .Net Regex parser will return all these empty string groups in its set of matches.

Simply:
\d* implies zero or more times
\d+ means one or more times

Related

Regular expression to determine if the String consists of more than 4 numbers

I want to extract URL strings from a log which looks like below:
<13>Mar 27 11:22:38 144.0.116.31 AgentDevice=WindowsDNS AgentLogFile=DNS.log PluginVersion=X.X.X.X Date=3/27/2019 Time=11:22:34 AM Thread ID=11BC Context=PACKET Message= Internal packet identifier=0000007A4843E100 UDP/TCP indicator=UDP Send/Receive indicator=Snd Remote IP=X.X.X.X Xid (hex)=9b01 Query/Response=R Opcode=Q Flags (hex)=8081 Flags (char codes)=DR ResponseCode=NOERROR Question Type=A Question Name=outlook.office365.com
I am looking to extract Name text which contains more that 5 digits.
A possible way suggested is (\d.*?){5,} but does not seem to work, kindly suggest another way get the field.
Example of string match:
outlook12.office345.com
outlook.office12345.com
You can look for the following expression:
Name=([^ ]*\d{5,}[^ ]*)
Explanation:
Name= look for anything that starts with "Name=", than capture if:
[^ ]* any number of characters which is not a space
\d{5,} then 5 digits in a row
[^ ]* then again, all digits up to a white space
This regular expression:
(?<=Name=).*\d{5,}.*?(?=\s|$)
would extract strings like outlook.office365666.com (with 5 or more consecutive digits) from your example input.
Demo: https://regex101.com/r/YQ5l2w/1
Try this pattern: (?=\b.*(?:\d[^\d\s]*){5,})\S*
Explanation:
(?=...) - positive lookahead, assures that pattern inside it is matched somewhere ahead :)
\b - word boundary
(?:...) - non-capturing group
\d[^\d\s]* - match digit \d, then match zero or more of any characters other than whitespace \s or digit \d
{5,} - match preceeding pattern 5 or more times
\S* - match zero or more of any characters other than space to match the string if assertion is true, but I think you just need assertion :)
Demo
If you want only consecutive numbers use simplified pattern (?=\b.*\d{5,})\S*.
Another demo
Of course, you have to add positive lookbehind: (?<=Name=) to assert that you have Name= string preceeding
Try this regex
([a-z0-9]{5,}.[a-z0-9]{5,})+.com
https://regex101.com/r/OzsChv/3
It Groups,
outlook.office365.com
outlook12.office345.com
also all url strings

Extract exactly n digits in a sentence using REGEX

Example
The no.s 1234 65
Input: n
For n=4, the output should be 1234
For n=2, the output should be : 65 (not 12)
Tried \d{n} which gives 12 and \d{n,} gives 1234 but i want the exact matching one.
Pattern p = Pattern.compile("//\d{n,}");
you need negative lookaround assertion: (?<!..): negative look behind, and (?!..): negative look ahead : regex101
(?<!\d)\d{4}(?!\d)
however not all regex engine supports them, maybe a work around may match also the preceeding character and following character (contrary to look-around which are 0 width matches), (\D matches all excpet a digit)
(?:^|\D)(\d{4})(?:\D|$)
I think what you meant is the \b character.
Hence, the regex you're looking for would be (for n=2):
\b\d{2}\b
From what I understand, you're looking for a regex that will match a number in a string which has n digits, taking into into account the spacing between the numbers. If that's the case, you're looking for something like this:
\b\d{4}\b
The \b will ensure the match is constrained to the start/end of a 'word' where a word is the boundary between anything matched by \w (which includes digits) and anything matched by the opposite, \W (which includes spaces).
I don't code in java but I can try to answer this using regex in general.
If your number is in the format d1d2d3d4 d5d6 and you want to extract digits d5d6, create 3 groups as r'([0-9]+)("/s")([0-9]+)' – each set of parenthesis () represent one group. Now, extract the third group only in another object which is your required output.

Java Regex to match String password

I have recently encountered this question in the text book:
I am suppose to write a method to check if a string have:
at least ten characters
only letters and digits
at least three digits
I am trying to solve it by Regx, rather than iterating through every character; this is what I got so far:
String regx = "[a-z0-9]{10,}";
But this only matches the first two conditions. How should I go about the 3rd condition?
You could use a positive lookahead for 3rd condition, like this:
^(?=(?:.*\d){3,})[a-z0-9]{10,}$
^ indicates start of string.
(?= ... ) is the positive lookahead, which will search the whole string to match whatever is between (?= and ).
(?:.*\d){3,} matches at least 3 digits anywhere in the string.
.*\d matches a digit preceded by any (or none) character (if omitted then only consecutive digits would match).
{3,} matches three or more of .*\d.
(?: ... ) is a non-capturing group.
$ indicates end of string.

Regex: Match wildcard followed by variable length of digits

I'm trying to extract the personal number from a stringlike Personal number: 123456 with the following regex:
(Personal number|Personalnummer).*(\d{2,10})
When trying to get the second group, it will only contain the last 2 digits of the personal number. If I change the digit range to {3,10} it will match the last 3 digits of the personal number.
Now I cannot just add the whitespaces as additional group, because I cannot be sure that there will be always whitespaces - there might be none or some other characters, but the personal number will be always at the end.
Is there anyway I could instruct the Parser to get the whole digit string?
.* is working as greedy quantifier for the regex. It ends up eating all the matching characters except the last 2 that it has to leave to match the string.
You have to make it reluctant by applying ?. Like below
(Personal number|Personalnummer).*?(\d{2,10})
Now it should work perfectly.
You can also convert the first group into a non capturing group, then you'll get only the number that you want in the answer like below.
(?:Personal number|Personalnummer).*?(\d{2,10})
Use a reluctant quantifier on the wildcard match (eg *?). For instance .*? will result in the full numeric expression:
Pattern p = Pattern.compile("(Personal number|Personalnummer).*?(\\d{2,10})");//note the ?
Matcher m = p.matcher("Personal number: 123456");
if ( m.find() ){
System.out.println(m.group(2));
}

Help with regex

I'm constructing a regex which will accept at least 1 alpha numerical character and any number of spaces.
Right now I've got...[A-Za-z0-9]+[ \t\r\n]* which I understand to be at least 1 alphanumeric OR at least 1 space. How would I fix this?
EDIT: To answer the comments below I want it to accept strings which contain ATLEAST 1 alphanumeric AND any number of (including no) spaces. Right now it will accept JUST a whitespace.
EDIT2: To clarify, I don't want the any number of whitespace (including 0) to be accepted unless there is at least 1 alphanumeric character
\s*\p{Alnum}[\p{Alnum}\s]*
Your regex, [A-Za-z0-9]+[ \t\r\n]*, requires the string to start with a letter or digit (or, more accurately, it doesn't start matching until it sees one). Adding \s* allows the match to start with whitespace, but you still won't match any alphanumerics after the first whitespace character that follows an alphanumeric (for example, it won't match the xyz in abc xyz. Changing the trailing \s* to [\p{Alnum}\s]* fixes that problem.
On a side note, \p{Alnum} is exactly equivalent to [A-Za-z0-9] in Java, which is not the case in all regex flavors. I used \p{Alnum}, not just because it's shorter, but because it gives more protection from typos like [A-z] (which is syntactically valid, but almost certainly not what the author really meant).
EDIT: Performance should be considered, too. I originally included a + after the first \p{Alnum}, but I realized that wasn't a good idea. If this were part of a longer regex, and the regex didn't match right away, it could end up wasting a lot of time trying to match the same groups of characters with \p{Alnum}+ or [\p{Alnum}\s]*. The leading \s* is okay, though, because \s doesn't match any of the characters that \p{Alnum} matches.
Any one or more word char zero or more whitespace
\w+\s*
Hey try this ([^\s]+\s*) [^\s] means catch everything that is not white space, while \s* means that an white space is optional (if you really want at least one white space put + instead of )
Edit: sory mine catch everithing not only alphanumeric (put ([a-zA-Z0-9]+\s) for alphanumeric)
This should do the trick:
\s*\p{Alnum}+\s*
\p{Alnum} is an alphanumeric character: [\p{Alpha}\p{Digit}]
* says "zero or more times"
+ says "at least one" (not "or" as you seem to believe, or is written |)
| means "or"
\s is a whitespace character: [ \t\n\x0B\f\r]
EDIT: To answer the comments below I want it to accept strings which contain AT LEAST 1 alphanumeric AND any number of (including no) spaces.
The pattern I suggested requires at least one alpha numeric character.
EDIT2: To clarify, I don't want the any number of whitespace (including 0) to be accepted unless there is at least 1 alphanumeric character
The pattern I suggested will not accept only white space characters only.

Categories

Resources