Can someone please explain this java Regex to me?
^[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+(?:[A-Z]{2}|com|org|net|gov|mil|biz|info|mobi|name|in|aero|jobs|museum)$
This regex is used to validate an email address.
Validating email addresses is now considered bad practice (stop validating email addresses with regex), especially with such expression as in your question. For example here's a more complete expression.
As for this expression let's break it in parts:
Beginning of the matched string
^
Matches at least one character from the list
[a-z0-9!#$%&'*+/=?^_`{|}~-]+
Non-capturing (see backreference) group which can be repeated 0..n times, that matches a . and then at least one character from the list.
(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*
Just this character
#
Non-capturing group matching one character in this list [a-z0-9] and then possibly more characters from the following lists. Matched string must start and end with [a-z0-9] and inside it can have [a-z0-9-].
(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+
Non-capturing group that matches 2 uppercase letters or one of the words.
(?:[A-Z]{2}|com|org|net|gov|mil|biz|info|mobi|name|in|aero|jobs|museum)
End of the string.
$
^ # Beginning of the line
[a-z0-9!#$%&'*+/=?^_`{|}~-]+ # One or more (+) characters from the
bracket expression, i.e., letters [a-z],
numbers [0-9], !, $, %, et cetera
(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)* # Zero or more (*) of the above
expression, preceded by a dot \\.
# # Literal #
(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+ # A digit or a letter, followed by
optional digits, letters, or dashes,
followed by a a dot
(?:[A-Z]{2}|com|org|net...) # Country code ([A-Z]{2}), or a top level
domain, such as com, org, net.
$ # End of the line
Using a concrete example, john#foo.com. The first part of the e-mail, john, will be matched by ^[a-z0-9!#$%&'*+/=?^_{|}~-]+. The # will be matched by, well, #. The domain foo, as well as the dot, is matched by (?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+. Finally, the TLD com is matched by the alternation (?:[A-Z]{2}|com|org|net|gov|mil|biz|info|mobi|name|in|aero|jobs|museum).
Related
I'm new to Regex in Java and I wanted to know how can I build one that only takes a string that consists of one or two comma-separated lists of uppercase letters, separated by a single whitespace.
I would need to filter out strings that start with a comma, that end with a comma or strings that have multiple consecutive commas.
All these would be invalid:
"D,, D"
"D D,,"
"D, ,D"
"D, ,,D"
"D,, ,D"
"D,,"
",,A"
",A"
"A,"
All these would be valid:
"D,D T,F"
"D,D T"
"A,A"
"A"
I used (\s?("[\w\s]*"|\d*)\s?(,,|$)) for consecutive commas but it doesn't do the trick when the comma is at the end or beggining of one of the whitespace separated substring like "D, ,D"
Should I aim to split by whitespace and look for a simpler regex for each of the substrings?
That would be something like this:
^[A-Z](,[A-Z])*( [A-Z](,[A-Z])*)*$
What happens here, is the following:
We expect a letter, optionally followed by one or more times a comma-immediately-followed-by-another-letter.
Then we optionally accept a space, and then the abovementioned pattern. And this is repeated.
Test: https://regex101.com/r/kzLhtw/1
You could, of course, slightly optimize the regex by making all capturing groups non-capturing: just put ?: immediately behind the (, that is, (?:.
You might use
^[A-Z](?: [A-Z])*(?:,[A-Z](?: [A-Z])*){0,2}$
^ Start of string
[A-Z] Match a single char A-Z
(?: [A-Z])* Optionally repeat a space and and a single char A-Z
(?: Non capture group
,[A-Z](?: [A-Z])* Match a comma, char A-Z followed by optionally repeat matching a space and a char A-Z
){0,2} Close the group and repeat 0-2 times
$ End of string
Regex demo
"a string that consists of one or two comma-separated lists of uppercase letters, separated by a single whitespace"
Not sure how to exactly interpretate the above, but my reading is: One or two comma-seperated lists where each list may only consist of uppercase characters. In the case of two lists, the two lists are seperated by a single space.
You could try:
^(?!.* .* )[A-Z](?:[ ,][A-Z])*$
See the online demo
^ - Start string anchor.
(?!.* .* ) - Negative lookahead to prevent two spaces present.
[A-Z] - A single uppercase alpha-char.
(?: - Open non-capture group:
[ ,] - A comma or space.
[A-Z] - A single uppercase alpha-char.
)* - Close non-capture group and match 0+ times upt to;
$ - End string anchor.
I am trying to mask email address in the following different ways.
Mask all characters except first three and the ones follows the # symbol.
This expression works fine.
(?<=.{3}).(?=[^#]*?#)
abcdefgh#gmail.com -> abc*****#gmail.com
Mask all characters except last three before # symbol.
Example : abcdefgh#gmail.com -> *****fgh#gmail.com
I am not sure how to check for # and do reverse match.
Can someone throw pointers on this?
Maybe you could do a positive lookahead:
.(?=.*...#)
See the online Demo
. - Any character other than newline.
(?=.*...#) - Positive lookahead for zero or more characters other than newline followed by three characters other than newline and #.
You could use a negated character class [^\s#] matching a non whitespace char except an #. Then assert what is on the right is that negated character class 3 times followed by matching the # sign.
In the replacement use *
[^\s#](?=[^#\s]*[^#\s]{3}#)
[^\s#] Negated character class, match a non whitespace char except #
(?= Positive lookahead, assert what is on the right is
[^#\s]* Match 0+ times a non whitespace char except #
[^#\s]{3} Match 3 times a non whitespace char except #
# Match the #
) Close lookahead
Regex demo
If there can be only a single # in the email address, you could for example make use of a finite quantifier in the positive lookbehind:
(?<=(?<!\S)[^\s#]{0,1000})[^\s#](?=[^#\s]*[^#\s]{3}#[^\s#]+\.[a-z]{2,}(?!\S))
Regex demo
I have this format: xx:xx:xx or xx:xx:xx-y, where x can be 0-9 a-f A-F and y can be only 0 or 1.
I come up with this regex: ([0-9A-Fa-f]{2}[:][0-9A-Fa-f]{2}[:][0-9A-Fa-f]{2}|[-][0-1]{1})
(See regexr).
But this matches 0a:0b:0c-3 too, which is not expected.
Is there any way to remove these cases from result?
[:] means a character from the list that contains only :. It is the same as
:. The same for [-] which has the same result as -.
Also, {1} means "the previous piece exactly one time". It does not have any effect, you can remove it altogether.
To match xx:xx:xx or xx:xx:xx-y, the part that matches -y must be optional. The quantifier ? after the optional part mark it as optional.
All in all, your regex should be like this:
[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}(-[01])?
If the regex engine you use can be told to ignore the character case then you can get rid of A-F (or a-f) from all character classes and the regex becomes:
[0-9a-f]{2}:[0-9a-f]{2}:[0-9a-f]{2}(-[01])?
How it works, piece by piece:
[0-9a-f] # any digit or letter from (and including) 'a' to 'f'
{2} # the previous piece exactly 2 times
: # the character ':'
[0-9a-f]
{2}
:
[0-9a-f]
{2}
( # start a group; it does not match anything
- # the character '-'
[01] # any character from the class (i.e. '0' or '1')
) # end of group; the group is needed for the next quantifier
? # the previous piece (i.e. the group) is optional
# it can appear zero or one times
See it in action: https://regexr.com/4rfvr
Update
As #the-fourth-bird mentions in a comment, if the regex must match the entire string then you need to anchor its ends:
^[0-9a-f]{2}:[0-9a-f]{2}:[0-9a-f]{2}(-[01])?$
^ as the first character of a regex matches the beginning of the string, $ as the last character matches the end of the string. This way the regex matches the entire string only (when there aren't other characters before or after the xx:xx:xx or xx:xx:xx-y part).
If you use the regex to find xx:xx:xx or xx:xx:xx-y in a larger string then you don't need to add ^ and $. Of course, you can add only ^ or $ to let the regex match only at the beginning or at the end of the string.
You want
xx:xx:xx or if it is followed by a -, then it must be a 0 or 1 and then it is the end (word boundry).
So you don't want any of these
0a:0b:0c-123
0a:0b:0cd
10a:0b:0c
either.
Then you want "negative lookingahead", so if you match the first part, you don't want it to be followed by a - (the first pattern) and it should end there (word boundary), and if it is followed by a -, then it must be a 0 or 1, and then a word boundary:
/\b([0-9a-f]{2}[:][0-9a-f]{2}[:][0-9a-f]{2}(?!-)\b|\b[0-9a-f]{2}[:][0-9a-f]{2}[:][0-9a-f]{2}-[01]\b)/i
To prevent any digit in front, a word boundary is added to the front as well.
Example: https://regexr.com/4rg42
The following almost worked:
/\b([0-9a-f]{2}[:][0-9a-f]{2}[:][0-9a-f]{2}\b[^-]|\b[0-9a-f]{2}[:][0-9a-f]{2}[:][0-9a-f]{2}-[01]\b)/i
but if it is the end of file and it is 3a:2b:11, then the [^-] will try to match a non - character and it won't match.
Example: https://regexr.com/4rg4q
I am trying to replace 'eed' and 'eedly' with 'ee' from words where there is a vowel before either term ('eed' or 'eedly') appears.
So for example, the word indeed would become indee because there is a vowel ('i') that happens before the 'eed'. On the other hand the word 'feed' would not change because there is no vowel before the suffix 'eed'.
I have this regex: (?i)([aeiou]([aeiou])*[e{2}][d]|[dly]\\b)
You can see what is happening with this here.
As you can see, this is correctly identifying words that end with 'eed', but it is not correctly identifying 'eedly'.
Also, when it does the replace, it is replacing all words that end with 'eed' , even words like feed which it should not remove the eed
What should I be considering here in order to make it correctly identify the words based on the rules I specified?
You can use:
str = str.replaceAll("(?i)\\b(\\w*?[aeiou]\\w*)eed(?:ly)?", "$1ee");
Updated RegEx Demo
\\b(\\w*?[aeiou]\\w*) before eed or eedly makes sure there is at least one vowel in the same word before this.
To expedite this regex you can use negated expression regex:
\\b([^\\Waeiou]*[aeiou]\\w*)eed(?:ly)?
RegEx Breakup:
\\b # word boundary
( # start captured group #`
[^\\Waeiou]* # match 0 or more of non-vowel and non-word characters
[aeiou] # match one vowel
\\w* # followed by 0 or more word characters
) # end captured group #`
eed # followed by literal "eed"
(?: # start non-capturing group
ly # match literal "ly"
)? # end non-capturing group, ? makes it optional
Replacement is:
"$1ee" which means back reference to captured group #1 followed by "ee"
find dly before finding d. otherwise your regex evaluation stops after finding eed.
(?i)([aeiou]([aeiou])*[e{2}](dly|d))
So I'm trying to separate the following two groups formatted as:
FIRST - GrouP second.group.txt
The first group can contain any character
The second group is a dot(.) delimited string.
I'm using the following regex to separate these two groups:
([A-Z].+).*?([a-z]+\.[a-z]+)
However, it gives a wrong result:
1: FIRST - GrouP second.grou
2: p.txt
I don't understand because I'm using "nongreedy" separater (.*?) instead of the greedy one (. *)
What am I doing wrong here?
Thanks
You can this regex to match both groups:
\b([A-Z].+?)\s*\b([a-z]+(?:\.[a-z]+)+)\b
RegEx Demo
Breakup:
\b # word boundary
([A-Z].+?) # match [A-Z] followed by 1 or more chars (lazy)
\s* # match 0 or more spaces
\b # word boundary
([a-z]+ # match 1 or more of [a-z] chars
(?:\.[a-z]+)+) # match a group of dot followed by 1 or more [a-z] chars
\b # word boundary
PS: (?:..) is used for non-capturing group.
This is one possible solution that should be pretty compact:
(.*?-\s*\S+)|(\S+\.?)+
https://regex101.com/r/iW8mE5/1
It is looking for anything followed by a dash, zero or more spaces, and then non-whitespace characters. And if it doesn't find that, it looks for non-whitespace followed by an optional decimal.