So I'm trying to separate the following two groups formatted as:
FIRST - GrouP second.group.txt
The first group can contain any character
The second group is a dot(.) delimited string.
I'm using the following regex to separate these two groups:
([A-Z].+).*?([a-z]+\.[a-z]+)
However, it gives a wrong result:
1: FIRST - GrouP second.grou
2: p.txt
I don't understand because I'm using "nongreedy" separater (.*?) instead of the greedy one (. *)
What am I doing wrong here?
Thanks
You can this regex to match both groups:
\b([A-Z].+?)\s*\b([a-z]+(?:\.[a-z]+)+)\b
RegEx Demo
Breakup:
\b # word boundary
([A-Z].+?) # match [A-Z] followed by 1 or more chars (lazy)
\s* # match 0 or more spaces
\b # word boundary
([a-z]+ # match 1 or more of [a-z] chars
(?:\.[a-z]+)+) # match a group of dot followed by 1 or more [a-z] chars
\b # word boundary
PS: (?:..) is used for non-capturing group.
This is one possible solution that should be pretty compact:
(.*?-\s*\S+)|(\S+\.?)+
https://regex101.com/r/iW8mE5/1
It is looking for anything followed by a dash, zero or more spaces, and then non-whitespace characters. And if it doesn't find that, it looks for non-whitespace followed by an optional decimal.
Related
I am trying to mask email address in the following different ways.
Mask all characters except first three and the ones follows the # symbol.
This expression works fine.
(?<=.{3}).(?=[^#]*?#)
abcdefgh#gmail.com -> abc*****#gmail.com
Mask all characters except last three before # symbol.
Example : abcdefgh#gmail.com -> *****fgh#gmail.com
I am not sure how to check for # and do reverse match.
Can someone throw pointers on this?
Maybe you could do a positive lookahead:
.(?=.*...#)
See the online Demo
. - Any character other than newline.
(?=.*...#) - Positive lookahead for zero or more characters other than newline followed by three characters other than newline and #.
You could use a negated character class [^\s#] matching a non whitespace char except an #. Then assert what is on the right is that negated character class 3 times followed by matching the # sign.
In the replacement use *
[^\s#](?=[^#\s]*[^#\s]{3}#)
[^\s#] Negated character class, match a non whitespace char except #
(?= Positive lookahead, assert what is on the right is
[^#\s]* Match 0+ times a non whitespace char except #
[^#\s]{3} Match 3 times a non whitespace char except #
# Match the #
) Close lookahead
Regex demo
If there can be only a single # in the email address, you could for example make use of a finite quantifier in the positive lookbehind:
(?<=(?<!\S)[^\s#]{0,1000})[^\s#](?=[^#\s]*[^#\s]{3}#[^\s#]+\.[a-z]{2,}(?!\S))
Regex demo
I'm writing a syntax checker (in Java) for a file that has the keywords and comma (separation)/semicolon (EOL) separated values. The amount of spaces between two complete constructions is unspecified.
What is required:
Find any duplicate words (consecutive and non-consecutive) in the multiline file.
// Example_1 (duplicate 'test'):
item1 , test, item3 ;
item4,item5;
test , item6;
// Example_2 (duplicate 'test'):
item1 , test, test ;
item2,item3;
I've tried to apply the (\w+)(s*\W\s*\w*)*\1 pattern, which doesn't catch duplicate properly.
You may use this regex with mode DOTALL (single line):
(?s)(\b\w+\b)(?=.*\b\1\b)
RegEx Demo
RegEx Details:
(?s): Enable DOTALL mode
(\b\w+\b): Match a complete word and capture it in group #1
(?=.*\b\1\b): Lookahead to assert that we have back-reference \1 present somewhere ahead. \b is used to make sure we match exact same word again.
Additionally:
Based on earlier comments below if intent was to not match consecutive word repeats like item1 item1, then following regex may be used:
(?s)(\b\w+\b)(?!\W+\1\b)(?=.*\b\1\b)
RegEx Demo 2
There is one extra negative lookahead assertion here to make sure we don't match consecutive repeats.
(?!\W+\1\b): Negative lookahead to fail the match for consecutive repeats.
You may use
\b(\w+)\b(?:\s*[^\w\s]\s*\w+)+\s*[^\w\s]\s*\b\1\b
See the regex demo
Details
\b(\w+)\b - Group 1: one or more word chars as a whole word
(?:\s*[^\w\s]\s*\w+)+ - 1 or more occurrences of:
\s* - 0+ whitespaces
[^\w\s] - 1 char other than a word and whitespace char
\s* - 0+ whitespaces
\w+ - 1+ word chars
\s* - 0+ whitespaces
[^\w\s] - 1 char other than a word and whitespace char
\s* - 0+ whitespaces
\b\1\b - the same value as in Group 1 as whole word.
To only match the word, put the second part of the regex into a positive lookahead:
\b(\w+)\b(?=(?:\s*[^\w\s]\s*\w+)+\s*[^\w\s]\s*\b\1\b)
^^^ ^
See this regex demo.
Java regex variable declaration:
String regex = "\\b(\\w+)\\b(?:\\s*[^\\w\\s]\\s*\\w+)+\\s*[^\\w\\s]\\s*\\b\\1\\b";
To make it fully Unicode aware add (?U):
String regex = "(?U)\\b(\\w+)\\b(?:\\s*[^\\w\\s]\\s*\\w+)+\\s*[^\\w\\s]\\s*\\b\\1\\b";
I have a regex:
"(\\d+\\.\\,?)+"
And the value:
3.053,500
But my regex pattern does not match it.
I want to have a pattern which validates numbers, dots and commas.
For exmaple values which are valid:
1
12
1,2
1.2
1,23,456
1,23.456
1.234,567
etc.
Your (\d+\.\,?)+ regex matches 1 or more repetitions of 1+ digits, a dot, and an opional ,. It means the strings must end with a dot. 3.053,500 does not end with a dot.
You may use
s.matches("\\d+(?:[.,]\\d+)*")
See the regex demo
Note that the ^ and $ anchors are not necessary in Java's .matches() method as the match is anchored to the start/end of the string automatically. At regex101.com, the anchors are meant to match start/end of the line (since the demo is run against a multiline string).
Pattern details
\d+ - 1+ digits
(?: - start of a non-capturing group:
[.,] - a dot or ,
\d+ - 1+ digits
)* - 0 or more repetitions.
I am trying to replace 'eed' and 'eedly' with 'ee' from words where there is a vowel before either term ('eed' or 'eedly') appears.
So for example, the word indeed would become indee because there is a vowel ('i') that happens before the 'eed'. On the other hand the word 'feed' would not change because there is no vowel before the suffix 'eed'.
I have this regex: (?i)([aeiou]([aeiou])*[e{2}][d]|[dly]\\b)
You can see what is happening with this here.
As you can see, this is correctly identifying words that end with 'eed', but it is not correctly identifying 'eedly'.
Also, when it does the replace, it is replacing all words that end with 'eed' , even words like feed which it should not remove the eed
What should I be considering here in order to make it correctly identify the words based on the rules I specified?
You can use:
str = str.replaceAll("(?i)\\b(\\w*?[aeiou]\\w*)eed(?:ly)?", "$1ee");
Updated RegEx Demo
\\b(\\w*?[aeiou]\\w*) before eed or eedly makes sure there is at least one vowel in the same word before this.
To expedite this regex you can use negated expression regex:
\\b([^\\Waeiou]*[aeiou]\\w*)eed(?:ly)?
RegEx Breakup:
\\b # word boundary
( # start captured group #`
[^\\Waeiou]* # match 0 or more of non-vowel and non-word characters
[aeiou] # match one vowel
\\w* # followed by 0 or more word characters
) # end captured group #`
eed # followed by literal "eed"
(?: # start non-capturing group
ly # match literal "ly"
)? # end non-capturing group, ? makes it optional
Replacement is:
"$1ee" which means back reference to captured group #1 followed by "ee"
find dly before finding d. otherwise your regex evaluation stops after finding eed.
(?i)([aeiou]([aeiou])*[e{2}](dly|d))
I want to write a regex in Java to check if a string ends in double consonant.
My regex is not working.
\\w+[^aeiou]\\1$
Appreciate your help
Thanks a ton.
It doesn't work since \1 references a non-existent subpattern. You need to assign a capturing group. Capturing groups could be used later on in the regular expression as a backreference to what was matched in that captured group.
\\w+([^aeiou])\\1$
Based off the comment above about your regular expression not only matching double consonants, I would consider combining an intersection with negation to make sure the grouped character is an actual letter character.
(?i)\\w+([a-z&&[^aeiou]])\\1$
This might work.
# "(?i)\\w+(?:(?![aeiou])[a-z]){2}$"
(?i) # Case independent
\w+
(?:
(?! [aeiou] ) # Not a vowel ahead
[a-z] # Consonant only
){2}
$