Splitting string only at the first matching regex position

Splitting string only at the first matching regex position - java

I have this string:
"123456 - A, Bcd, 789101 - E, Fgh"
I want it split into: "123456 - A, Bcd" and "789101 - E, Fgh".
How can I achieve this? What regex and split expressions should I use?
I see I can find the comma after "Bcd" using .matches(".*[a-z],\\s[0-9].*")
but how do I split the strings ONLY at that comma? .split(",\\s") splits at all occurring comma followed by space...
I work with JAVA 1.6.

You may split on the comma that is followed with 0+ whitespaces, 6 digits, space and a hyphen:
String[] result = s.split(",\\s*(?=\\d{6} -)");
See the regex demo.
Pattern details
, - a comma
\s* - 0+ whitespace chars
(?=\\d{6} -) - a positive lookahead (a non-consuming pattern, what it matches won't be part of the result) that requires 6 digits followed with a space and - immediately to the right of the current location.

Related

Regex to match comma separated values

I'm new to Regex in Java and I wanted to know how can I build one that only takes a string that consists of one or two comma-separated lists of uppercase letters, separated by a single whitespace.
I would need to filter out strings that start with a comma, that end with a comma or strings that have multiple consecutive commas.
All these would be invalid:
"D,, D"
"D D,,"
"D, ,D"
"D, ,,D"
"D,, ,D"
"D,,"
",,A"
",A"
"A,"
All these would be valid:
"D,D T,F"
"D,D T"
"A,A"
"A"
I used (\s?("[\w\s]*"|\d*)\s?(,,|$)) for consecutive commas but it doesn't do the trick when the comma is at the end or beggining of one of the whitespace separated substring like "D, ,D"
Should I aim to split by whitespace and look for a simpler regex for each of the substrings?

That would be something like this:
^[A-Z](,[A-Z])*( [A-Z](,[A-Z])*)*$
What happens here, is the following:
We expect a letter, optionally followed by one or more times a comma-immediately-followed-by-another-letter.
Then we optionally accept a space, and then the abovementioned pattern. And this is repeated.
Test: https://regex101.com/r/kzLhtw/1
You could, of course, slightly optimize the regex by making all capturing groups non-capturing: just put ?: immediately behind the (, that is, (?:.

You might use
^[A-Z](?: [A-Z])*(?:,[A-Z](?: [A-Z])*){0,2}$
^ Start of string
[A-Z] Match a single char A-Z
(?: [A-Z])* Optionally repeat a space and and a single char A-Z
(?: Non capture group
,[A-Z](?: [A-Z])* Match a comma, char A-Z followed by optionally repeat matching a space and a char A-Z
){0,2} Close the group and repeat 0-2 times
$ End of string
Regex demo

"a string that consists of one or two comma-separated lists of uppercase letters, separated by a single whitespace"
Not sure how to exactly interpretate the above, but my reading is: One or two comma-seperated lists where each list may only consist of uppercase characters. In the case of two lists, the two lists are seperated by a single space.
You could try:
^(?!.* .* )[A-Z](?:[ ,][A-Z])*$
See the online demo
^ - Start string anchor.
(?!.* .* ) - Negative lookahead to prevent two spaces present.
[A-Z] - A single uppercase alpha-char.
(?: - Open non-capture group:
[ ,] - A comma or space.
[A-Z] - A single uppercase alpha-char.
)* - Close non-capture group and match 0+ times upt to;
$ - End string anchor.

Regex to identify consecutive and non-consecutive duplicate words in multiline text

I'm writing a syntax checker (in Java) for a file that has the keywords and comma (separation)/semicolon (EOL) separated values. The amount of spaces between two complete constructions is unspecified.
What is required:
Find any duplicate words (consecutive and non-consecutive) in the multiline file.
// Example_1 (duplicate 'test'):
item1 , test, item3 ;
item4,item5;
test , item6;
// Example_2 (duplicate 'test'):
item1 , test, test ;
item2,item3;
I've tried to apply the (\w+)(s*\W\s*\w*)*\1 pattern, which doesn't catch duplicate properly.

You may use this regex with mode DOTALL (single line):
(?s)(\b\w+\b)(?=.*\b\1\b)
RegEx Demo
RegEx Details:
(?s): Enable DOTALL mode
(\b\w+\b): Match a complete word and capture it in group #1
(?=.*\b\1\b): Lookahead to assert that we have back-reference \1 present somewhere ahead. \b is used to make sure we match exact same word again.
Additionally:
Based on earlier comments below if intent was to not match consecutive word repeats like item1 item1, then following regex may be used:
(?s)(\b\w+\b)(?!\W+\1\b)(?=.*\b\1\b)
RegEx Demo 2
There is one extra negative lookahead assertion here to make sure we don't match consecutive repeats.
(?!\W+\1\b): Negative lookahead to fail the match for consecutive repeats.

You may use
\b(\w+)\b(?:\s*[^\w\s]\s*\w+)+\s*[^\w\s]\s*\b\1\b
See the regex demo
Details
\b(\w+)\b - Group 1: one or more word chars as a whole word
(?:\s*[^\w\s]\s*\w+)+ - 1 or more occurrences of:
\s* - 0+ whitespaces
[^\w\s] - 1 char other than a word and whitespace char
\s* - 0+ whitespaces
\w+ - 1+ word chars
\s* - 0+ whitespaces
[^\w\s] - 1 char other than a word and whitespace char
\s* - 0+ whitespaces
\b\1\b - the same value as in Group 1 as whole word.
To only match the word, put the second part of the regex into a positive lookahead:
\b(\w+)\b(?=(?:\s*[^\w\s]\s*\w+)+\s*[^\w\s]\s*\b\1\b)
^^^ ^
See this regex demo.
Java regex variable declaration:
String regex = "\\b(\\w+)\\b(?:\\s*[^\\w\\s]\\s*\\w+)+\\s*[^\\w\\s]\\s*\\b\\1\\b";
To make it fully Unicode aware add (?U):
String regex = "(?U)\\b(\\w+)\\b(?:\\s*[^\\w\\s]\\s*\\w+)+\\s*[^\\w\\s]\\s*\\b\\1\\b";

Java regex match anything except a single expression

I am trying to replace everything except a specific expression including digits in java using only the replaceAll() method and a single regex.
Given the String P=32 N=5 M=2 I want to extract each variable independently.
I can match the expression N=5 with the regex N=\d, but I can't seem to find an inverse expression that will match anything but N=\d, where x may be any digit.
I do not want to use Pattern or Matcher but solve this using regex only. So for x, y, z being any digit, I want to be able to replace everything but the expression N=y in a String P=x N=y M=z:
String input = "P=32 N=5 M=2";
output = input.replaceAll(regex, "");
System.out.println(output);
// expected "N=5"

You may use
s = s.replaceAll("\\s*\\b(?!N=\\d)\\w+=\\d+", "").trim();
See the Java demo and the regex demo.
Details
\s* - 0+ whitespaces
\b - a word boundary
(?!N=\d) - immediately to the right, there should be no N= and any digit
\w+ - 1+ letters/digits/_
= - an = sign
\d+ - 1+ digits.

Regex to allow a space instead of following numbers after first two letters

I need a RegEx that allow a single space after two letters i.e. AB123 should not be allowed but AB 123 should be allowed ?

Here is the regex [a-zA-Z]{2}\s\S*
[a-zA-Z] means character from a to Z
{2} means character twice
\s means white space
\S means non white space.
* duplicate with 0 or more
https://regex101.com/r/uWYci4/1

This pattern will do the work: ^[a-zA-Z]{2} \d+$
Explanation:
^ - match beginning of a string
[a-zA-Z]{2} - match two letters (upper- or lowercase),
- match space
\d+ - match one or more digits
$ - match end of a string
Demo

Regex match numbers with spaces but not without spaces

Trying to match a string of numbers with spaces in between while ignoring other strings of numbers without spaces in between them. I'd like to match 16 characters.
eg. Would like to match 12345 67890 1234 but NOT 1234567890123456
I have tried this:
[0-9 ]{16}
But this matches both sets of strings.

I used and corrected #Wiktor Stribiżew regex, because original regex will match a space at the beginning and the end of the number.
Regex: \b(?![0-9]{16})\d[0-9 ]{14}\d\b
Details:
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
(?!) Negative Lookahead
[] Match a single character present in the list 0-9
{n} Matches exactly n times
\d matches a digit (equal to [0-9])
RegEx demo

You can use this regex to enforcees at least one space in between numbers:
\d+(?:\h+\d+)+
RegEx Demo
\d+: Match 1+ digits
(?:\h+\d+)+: Match 1+ group of 1+ whitespace and 1+ digits

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Splitting string only at the first matching regex position - java

Related

Regex to match comma separated values

Regex to identify consecutive and non-consecutive duplicate words in multiline text

Java regex match anything except a single expression

Regex to allow a space instead of following numbers after first two letters

Regex match numbers with spaces but not without spaces

Categories

Resources