Regex doesnt stop after sign - java

Hi I have regex like this
(.*(?=\sI+)*) (.*)
But it doesn't capture groups correctly as I need.
For this example data :
Vladimir Goth
Langraab II Landgraab
Léa Magdalena III Rouault Something
Anna Maria Teodora
Léa Maria Teodora II
1,2 are only correctly captured.
So what I need is
If there is no I+ is split by first space.
If after I+ there are other words first gorup should contains all to I+. So, group1 for 3rd example should be Léa Magdalena III
If after I+ there aren't any other words like in example 5, group1 should be capture to first space.
#Edit
I+ should be replaced by roman numbers

If you want to support any Roman numbers you can use
^(\S+(?:.*\b(?=[MDCLXVI])M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3})\b(?= +\S))?) +(.*)
If you need to support Roman numbers up to XX (exclusive):
^(\S+(?:.*\b(?=[XVI])X?(?:IX|IV|V?I{0,3})\b(?= +\S))?) +(.*)
See the regex demo #1 and demo #2. Replace spaces with \h or \s in the Java code and double backslashes in the Java string literal.
Details:
^ - start of string
( - Group 1 start:
\S+ - one or more non-whitespaces
(?: - a non-capturing group:
.* - any zero or more chars other than line break chars as many as possible
\b - a word boundary
(?=[MDCLXVI]) - require at least one Roman digit immediately to the right
M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}) - a Roman number pattern
\b - a word boundary
(?= +\S) - a positive lookahead that requires one or more spaces and then one non-whitespace right after the current position
)? - end of the non-capturing group, repeat one or zero times (it is optional)
) - end of the first group
+ - one or more spaces
(.*) - Group 2: the rest of the line.
In Java:
String regex = "^(\\S+(?:.*\\b(?=[MDCLXVI])M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3})\\b(?=\\h+\\S))?)\\h+(.*)";
// Or
String regex = "^(\\S+(?:.*\\b(?=[XVI])X?(?:IX|IV|V?I{0,3})\\b(?=\\s+\S))?)\\s+(.*)";

Related

Regular expression to match pattern and text afterwards until that pattern occurs again, then repeat

I'm trying to write a regex for my Kotlin/JVM program that satisfies:
Given this line of text {#FF00FF}test1{#112233}{placeholder} test2
It should match:
Match 1: #FF00FF as group 1 and test1 as group 2
Match 2: #112233 as group 1 and {placeholder} test2 as group 2
#FF00FF can be any valid 6 character hex color code.
The thing I'm struggling with is to match the text after the color pattern until another color pattern comes up.
Current regex I came up with is \{(#[a-zA-Z0-9]{6})\}((?!\{#[a-zA-Z0-9]{6}\}).*)
You can use
\{(#[a-zA-Z0-9]{6})\}(.*?)(?=\{#[a-zA-Z0-9]{6}\}|$)
See the regex demo. Details:
\{ - a { char
(#[a-zA-Z0-9]{6}) - Group 1: a # char and six alphanumerics
\} - a } char
(.*?) - Group 2: any zero or more chars other than line break chars as few as possible
(?=\{#[a-zA-Z0-9]{6}\}|$) - a position immediately followed with a {, #, six alphanumerics and a } char, or end of string.

Regex to identify consecutive and non-consecutive duplicate words in multiline text

I'm writing a syntax checker (in Java) for a file that has the keywords and comma (separation)/semicolon (EOL) separated values. The amount of spaces between two complete constructions is unspecified.
What is required:
Find any duplicate words (consecutive and non-consecutive) in the multiline file.
// Example_1 (duplicate 'test'):
item1 , test, item3 ;
item4,item5;
test , item6;
// Example_2 (duplicate 'test'):
item1 , test, test ;
item2,item3;
I've tried to apply the (\w+)(s*\W\s*\w*)*\1 pattern, which doesn't catch duplicate properly.
You may use this regex with mode DOTALL (single line):
(?s)(\b\w+\b)(?=.*\b\1\b)
RegEx Demo
RegEx Details:
(?s): Enable DOTALL mode
(\b\w+\b): Match a complete word and capture it in group #1
(?=.*\b\1\b): Lookahead to assert that we have back-reference \1 present somewhere ahead. \b is used to make sure we match exact same word again.
Additionally:
Based on earlier comments below if intent was to not match consecutive word repeats like item1 item1, then following regex may be used:
(?s)(\b\w+\b)(?!\W+\1\b)(?=.*\b\1\b)
RegEx Demo 2
There is one extra negative lookahead assertion here to make sure we don't match consecutive repeats.
(?!\W+\1\b): Negative lookahead to fail the match for consecutive repeats.
You may use
\b(\w+)\b(?:\s*[^\w\s]\s*\w+)+\s*[^\w\s]\s*\b\1\b
See the regex demo
Details
\b(\w+)\b - Group 1: one or more word chars as a whole word
(?:\s*[^\w\s]\s*\w+)+ - 1 or more occurrences of:
\s* - 0+ whitespaces
[^\w\s] - 1 char other than a word and whitespace char
\s* - 0+ whitespaces
\w+ - 1+ word chars
\s* - 0+ whitespaces
[^\w\s] - 1 char other than a word and whitespace char
\s* - 0+ whitespaces
\b\1\b - the same value as in Group 1 as whole word.
To only match the word, put the second part of the regex into a positive lookahead:
\b(\w+)\b(?=(?:\s*[^\w\s]\s*\w+)+\s*[^\w\s]\s*\b\1\b)
^^^ ^
See this regex demo.
Java regex variable declaration:
String regex = "\\b(\\w+)\\b(?:\\s*[^\\w\\s]\\s*\\w+)+\\s*[^\\w\\s]\\s*\\b\\1\\b";
To make it fully Unicode aware add (?U):
String regex = "(?U)\\b(\\w+)\\b(?:\\s*[^\\w\\s]\\s*\\w+)+\\s*[^\\w\\s]\\s*\\b\\1\\b";

How to avoid a hyphen from splitting a regex?

I'm writing a simple android app for saving your favorite games in a list.
In the first screen a user has to enter his gamertag (as a String). The gamertag should only contain letters from a-z (uppercase and lowercase), numbers (0-9) and underscores/hpyhens (_ and -).
I can get it to work with an underscore in every position or a hyphen at the beginning. But if the String contains a hyphen in the middle it gets "split" into two pieces and if the hyphen is at the end, it stands alone.
I came up with this regex:
[a-zA-Z0-9_\-]\w+
in java it looks a little different because the \ needs to be escaped:
[a-zA-Z0-9_\\-]\\w+
Gamertags that should validate:
- GamerTag
- Gamer_Tag
- _GamerTag
- GamerTag_
- -GamerTag
- Gamer-Tag
- GamerTag-
Gamertags that shouldn't validate:
- !GamerTag
- Gamer%Tag
- Gamer Tag
Gamertags that should validate, but my regex fails:
- Gamer-Tag
- GamerTag-
Your pattern [a-zA-Z0-9_\-]\w+ matches 1 character out of the character class followed by 1+ times a word character \w which does not match a -.
You could repeat the character class 1+ times where the hyphen is present and if the hyphen is at the end of the character class you don't have to eacape it.
[a-zA-Z0-9_-]+
The Gamer-Tag does not get split but has 2 matches. The character class matches G and the \w+ matches amer. Then in the next match the character class matches - and \w+ matches Tag.
If those are the only values allowed, you could use anchors ^ to assert the start and $ to assert the end of the string.
^[a-zA-Z0-9_-]+$
Regex demo

value which should match only numbers, dots and commas

I have a regex:
"(\\d+\\.\\,?)+"
And the value:
3.053,500
But my regex pattern does not match it.
I want to have a pattern which validates numbers, dots and commas.
For exmaple values which are valid:
1
12
1,2
1.2
1,23,456
1,23.456
1.234,567
etc.
Your (\d+\.\,?)+ regex matches 1 or more repetitions of 1+ digits, a dot, and an opional ,. It means the strings must end with a dot. 3.053,500 does not end with a dot.
You may use
s.matches("\\d+(?:[.,]\\d+)*")
See the regex demo
Note that the ^ and $ anchors are not necessary in Java's .matches() method as the match is anchored to the start/end of the string automatically. At regex101.com, the anchors are meant to match start/end of the line (since the demo is run against a multiline string).
Pattern details
\d+ - 1+ digits
(?: - start of a non-capturing group:
[.,] - a dot or ,
\d+ - 1+ digits
)* - 0 or more repetitions.

Splitting string only at the first matching regex position

I have this string:
"123456 - A, Bcd, 789101 - E, Fgh"
I want it split into: "123456 - A, Bcd" and "789101 - E, Fgh".
How can I achieve this? What regex and split expressions should I use?
I see I can find the comma after "Bcd" using .matches(".*[a-z],\\s[0-9].*")
but how do I split the strings ONLY at that comma? .split(",\\s") splits at all occurring comma followed by space...
I work with JAVA 1.6.
You may split on the comma that is followed with 0+ whitespaces, 6 digits, space and a hyphen:
String[] result = s.split(",\\s*(?=\\d{6} -)");
See the regex demo.
Pattern details
, - a comma
\s* - 0+ whitespace chars
(?=\\d{6} -) - a positive lookahead (a non-consuming pattern, what it matches won't be part of the result) that requires 6 digits followed with a space and - immediately to the right of the current location.

Categories

Resources