Regular Expression :match string containing only non repeating words

Regular Expression :match string containing only non repeating words - java

I have this situation(Java code):
1) a string such as : "A wild adventure" should match.
2) a string with adjacent repeated words: "A wild wild adventure" shouldn't match.
With this regular expression: .* \b(\w+)\b\s*\1\b.* i can match strings containing adjacent repeated words.
How to reverse the situation i.e how to match strings which do not contain adjacent repeat words

Use negative lookahead assertion, (?!pattern).
String[] tests = {
"A wild adventure", // true
"A wild wild adventure" // false
};
for (String test : tests) {
System.out.println(test.matches("(?!.*\\b(\\w+)\\s\\1\\b).*"));
}
Explanation courtesy of Rick Measham's explain.pl:
REGEX: (?!.*\b(\w+)\s\1\b).*
NODE EXPLANATION
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1
or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
--------------------------------------------------------------------------------
\1 what was matched by capture \1
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
See also
regular-expressions.info/Lookarounds
Related questions
using regular expression in Java
Uses negative lookahead to ensure a string doesn't have a character occuring more than once
Java split is eating my characters.
Many examples of using assertions
How do I convert CamelCase into human-readable names in Java?
Very instructive example of using lookarounds
Note
Negative assertions only make sense when there are also other patterns that you want to positively match (see examples above). Otherwise, you can just use boolean complement operator ! to negate matches with whatever pattern you were using before.
String[] tests = {
"A wild adventure", // true
"A wild wild adventure" // false
};
for (String test : tests) {
System.out.println(!test.matches(".*\\b(\\w+)\\s\\1\\b.*"));
}

Related

How do I get a regex expression to contain only uppercase letters or numbers?

Regex expression: [A-Z]([^0-9]|[^A-Z])+[A-Z]
The requirements are that the string should start and end with a capital letter A-Z, and contain at least one number in between. It should not have anything else besides capital letters on the inside. However, it's accepting spaces and punctuation too.
My expression fails the following test case A65AJ3L 3F,D due to the comma and whitespace.
Why does this happen when I explicitly said only numbers and uppercase letters can be in the string?

Starting the character class with [^ makes is a negated character class.
Using ([^0-9]|[^A-Z])+ matches any char except a digit (but does match A-Z), or any char except A-Z (but does match a digit).
This way it can match any character.
If you would turn it into [A-Z]([0-9]|[A-Z])+[A-Z] it still does not make it mandatory to match at least a single digit on the inside due to the alternation | and it can still match AAA for example.
You might use:
^[A-Z]+[0-9][A-Z0-9]*[A-Z]$
The pattern matches:
^ Start of string
[A-Z]+ Match 1+ times A-Z
[0-9] Match a single digit
[A-Z0-9]* Optionally match either A-Z or 0-9
[A-Z] Match a single char A-Z
$ End of string
Regex demo

Use
^(?=\D*\d\D*$)[A-Z][A-Z\d]*[A-Z]$
See regex proof.
(?=\D*\d\D*$) requires only one digit in the string, no more no less.
EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
\D* non-digits (all but 0-9) (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
\D* non-digits (all but 0-9) (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of
the string
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
[A-Z] any character of: 'A' to 'Z'
--------------------------------------------------------------------------------
[A-Z\d]* any character of: 'A' to 'Z', digits (0-9)
(0 or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
[A-Z] any character of: 'A' to 'Z'
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

Java String match using the regex for optional String

I have the following Java String which I need to compare with the regex but these string can consist of the optional values which may be present or not present I need to perform different things based on their availability:
String 1: With serial
uri = https://myid.com/123/1234567890128/456/1111
String 2: Without Serial
uri = https://myid.com/123/1234567890128
As we can see the incoming string can have the /456/1111 or it may not have. How can I write a single regex function which checks whether it is present or not? I have written a regex but it would work only if the /456/1111 is present:
uri.matches("(http|https)://.*/123/[0-9]{13}.*)")
I tried adding the optional values after looking at some of the answers here something like this:
uri.matches("(http|https)://.*/123/[0-9]{13}+([/456/[0-9]{1,20}]?)")
But for some reason, it does not work. Can someone please help me how can I verify whether there are any strings present in uri /456/1111 or not. I feel like I am missing some small thing.

Use
^https?:\/\/.*\/123\/[0-9]{1,13}(?:/456/[0-9]{1,20})?$
See proof.
Explanation
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
http 'http'
--------------------------------------------------------------------------------
s? 's' (optional (matching the most amount
possible))
--------------------------------------------------------------------------------
: ':'
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
123 '123'
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
[0-9]{1,13} any character of: '0' to '9' (between 1
and 13 times (matching the most amount
possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
/456/ '/456/'
--------------------------------------------------------------------------------
[0-9]{1,20} any character of: '0' to '9' (between 1
and 20 times (matching the most amount
possible))
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

regex101.com is your friend in this regard.
When looking at your regex on that site, you can see some errors like:
you have a lone ] at the end which seems off
and at last your ? at the end targets the wrong group, move it out of the parenthesis.
Something like https?://[^/]+/123/\d{13}(?:/456/\d{1,20})? should work for you.
The good thing about regex101 is that on the right side you see a detailed explanation about your regex, and it highlights exactly which character does what.

The reason the pattern that you tried does not work, is because in the last part of your pattern you have ([/456/[0-9]{1,20}]?) which means:
( Capture group
[/456/[0-9]{1,20} Match 1-20 repetitions of either / or a digit 0-9 (as 0-9 also matches 456)
]? Match optional ]
) Close group
What you could do instead, is making the last group as a whole optional without a character class use https? making the s optional.
^https?://.*/123/[0-9]{13}(?:/456/[0-9]{1,20})?$
Regex demo | Java demo
As you use matches() it should match the whole string and you can omit the anchors ^ and $
String uri1 = "https://myid.com/123/1234567890128/456/1111";
String uri2 = "https://myid.com/123/1234567890128";
String uri3 = "https://myid.com/123/1234567890128/456/111122222222222222222";
String pattern = "https?://.*/123/[0-9]{13}(?:/456/[0-9]{1,20})?";
System.out.println(uri1.matches(pattern));
System.out.println(uri2.matches(pattern));
System.out.println(uri3.matches(pattern));
Output
true
true
false

Java/Groovy regex parse Key-Value pairs without delimiters

I have trouble fetching Key Value pairs with my regex
Code so far:
String raw = '''
MA1
 D. Mueller Gießer MA2 Peter Mustermann 2. Mann  MA3 Ulrike Mastorius Schmelzer MA4 Heiner Becker s 3.Mann
 MA5 Rudolf Peters Gießer '''
Map map = [:] ArrayList<String> split = raw.findAll("(MA\\d)+(.*)"){ full, name, value -> map[name] = value }  println map
Output is:
[MA1:, MA2: Peter, MA3: Ulrike Mastorius Schmelzer, MA4: Heiner Becker, MA5: Rudolf Peters]
In my case the keys are:
MA1, MA2, MA3, MA\d (so MA with any 1 digit Number)
The value is absolutely everything until the next key comes up (including line breaks, tab, spaces etc...)
Does anybody have a clue how to do this?
Thanks in advance,
Sebastian

You can capture in the second group all that follows after the key and all the lines that do not start with the key
^(MA\d+)(.*(?:\R(?!MA\d).*)*)
The pattern matches
^ Start of string
(MA\d+) Capture group 1 matching MA and 1+ digits
( Capture group 2
.* Match the rest of the line
(?:\R(?!MA\d).*)* Match all lines that do not start with MA followed by a digit, where \R matches any unicode newline sequence
) Close group 2
Regex demo
In Java with the doubled escaped backslashes
final String regex = "^(MA\\d+)(.*(?:\\R(?!MA\\d).*)*)";

Use
(?ms)^(MA\d+)(.*?)(?=\nMA\d|\z)
See proof.
Explanation
EXPLANATION
--------------------------------------------------------------------------------
(?ms) set flags for this block (with ^ and $
matching start and end of line) (with .
matching \n) (case-sensitive) (matching
whitespace and # normally)
--------------------------------------------------------------------------------
^ the beginning of a "line"
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
MA 'MA'
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
.*? any character (0 or more times (matching
the least amount possible))
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
MA 'MA'
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\z the end of the string
--------------------------------------------------------------------------------
) end of look-ahead

Minimum & Maximum character amounts Regex

I'm still new to regex, and I'm trying to create a regex for verifying ids for an app I am creating.
The id constraints are as follows -
Can only begin with either A-Z, a-z, ,, ', -.
Can contain all of the above and also ., just not at the beginning.
Must have at least two A-Z | a-z characters
And characters can only appear once. (,, shouldn't match, only ,)
EDIT: I was unclear about the fourth point, it should only disallow consecutive symbols, but not consecutive letters.
So far all I have is
^(A-Za-z',-)(A-Za-z',-\\.)+$ // I'm using java hence the reason for the `\\.`
I don't know how to match a specific amount of things within my regex. I would imagine it is something simple, but any help would be very useful.
I'm very new to regex and I'm really lost as to how to do this.
Edit: final regex is as follows
^(?=.*[A-Za-z].*[A-Za-z].*)(?!.*(,|'|\-|\.)\1.*)[A-Za-z,'\-][A-Za-z,'\-\.]*
Thanks to Ro Yo Mi and RebelWitoutAPulse!

Description
^(?!\.)(?=(?:.*?[A-Za-z]){2})(?:([a-zA-Z,'.-])(?!.*?\1))+$
This regular expression will do the following:
(?!\.)
validates the string does not start with a .
(?=(?:.*?[A-Za-z]){2})
validates the string has at least two A-Z | a-z characters
(?:([a-zA-Z,'.-])(?!.*?\1))+
allows the string to only contain a-z, A-Z, ,, ., -
Allows characters to only appear once. (,, shouldn't match, only ,)
Example
Live Demo
https://regex101.com/r/hO2mU1/1
Sample text
-abced
aabdefsa
abcdefs
.abded
ac.dC
ab
a.b
Sample Matches
-abced
abcdefs
ac.dC
ab
a.b
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (2 times):
----------------------------------------------------------------------
.*? any character (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
[A-Za-z] any character of: 'A' to 'Z', 'a' to
'z'
----------------------------------------------------------------------
){2} end of grouping
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture (1 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[a-zA-Z,'.-] any character of: 'a' to 'z', 'A' to
'Z', ',', ''', '.', '-'
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
.*? any character (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
\1 what was matched by capture \1
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
)+ end of grouping
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------

You could use the positive/negative lookahead. Rough explanation of this technique is that when regex processor encounters it - it suspends further regex processing until subregex defined inside the lookahead is matched.
The regex might be:
^(?=.*[A-Za-z].*[A-Za-z].*)(?!.*(.)\1.*)[A-Za-z,'\-][A-Za-z,'\-\.]*
Explanation:
^ - beginning of the string
(?=.*[A-Za-z].*[A-Za-z].*) - continue matching only if string has any amount of any characters, followed by something from a-Z, then again any amount of any character, then again a-Z, then anything. This effectively covers point 3.
(?!.*(.)\1.*) - stop matching if there are duplicate consequitive characters in the string. It checks for anything, then remembers a character using a capture group and checks the remainder for the string for occurence of character from capture group. This covers point 4.
Note: if point 4 meant that every character in the string should be unique, then you may add .* between (.) and \1.
Note: if this matches - the regex processing "caret" is back at the beginning of the string.
[A-Za-z,'\-] - the "real" matching begins. Character class matches your requirement from point 1.
[A-Za-z,'\-\.]* - any amount of characters mentioned in point 1 and point 4
Not sure about java regex specifics - quick google search found that this might be possible. Synthetic test works:
Astring # match
,string # match
.string # does not match
a.- # does not match: there are no two characters from [a-Z]
doesnotmatch # does not match: double non-consequitive occurrence of 't'
P.S. The regex may be optimised quite a lot if one were to use the defined character classes instead of a . - but this would add quite a lot of visual clutter to the answer.

Regex expression in plain english

I'm working on a new Java project and therefore im reading the already existing code. On a very important part of the code if found the following regex expression and i can't really tell what they are doing. Anybody can explain in plain english what they do??
1)
[^,]*|.+(,).+
2)
(\()?\d+(?(1)\))

Next time you need a regex explained, you can use the following explain.pl service from Rick Measham:
Regex: [^,]*|.+(,).+
NODE EXPLANATION
--------------------------------------------------------------------------------
[^,]* any character except: ',' (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
.+ any character except \n (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
, ','
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
.+ any character except \n (1 or more times
(matching the most amount possible))
Regex: (\()?\d+(?(1)\))
NODE EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1 (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
\( '('
--------------------------------------------------------------------------------
)? end of \1 (NOTE: because you're using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
(?(1) if back-reference \1 matched, then:
--------------------------------------------------------------------------------
\) ')'
--------------------------------------------------------------------------------
| else:
--------------------------------------------------------------------------------
succeed
--------------------------------------------------------------------------------
) end of conditional on \1
Links
http://rick.measham.id.au/paste/explain.pl
Note on conditionals
JAVA DOES NOT SUPPORT CONDITIONALS! An unconditionalized regex for the second pattern would be something like:
\d+|\(\d+\)
i.e. a non-zero repetition of digits, with or without surrounding parentheses.
Links
regular-expressions.info/If-then-else conditionals
Conditionals are supported by the JGsoft engine, Perl, PCRE and the .NET framework.
The patterns in depth
Here's a test harness for the first pattern
import java.util.regex.*;
//...
Pattern p = Pattern.compile("[^,]*|.+(,).+");
String[] tests = {
"", // [] is a match with no commas
"abc", // [abc] is a match with no commas
",abc", // [,abc] is not a match
"abc,", // [abc,] is not a match
"ab,c", // [ab,c] is a match with separating comma
"ab,c,", // [ab,c,] is a match with separating comma
",", // [,] is not a match
",,", // [,,] is not a match
",,,", // [,,,] is a match with separating comma
};
for (String test : tests) {
Matcher m = p.matcher(test);
System.out.format("[%s] is %s %n", test,
!m.matches() ? "not a match"
: m.group(1) != null
? "a match with separating comma"
: "a match with no commas"
);
}
Conclusion
To match, the string must fall into one of these two cases:
Contains no comma (potentially an empty string)
Contains a comma that separates two non-empty strings
On a match, \1 can be used to distinguish between the two cases
And here's a similar test harness for the second pattern, rewritten without using conditionals (which isn't supported by Java):
Pattern p = Pattern.compile("\\d+|(\\()\\d+\\)");
String[] tests = {
"", // [] is not a match
"0", // [0] is a match without parenthesis
"(0)", // [(0)] is a match with surrounding parenthesis
"007", // [007] is a match without parenthesis
"(007)", // [(007)] is a match with surrounding parenthesis
"(007", // [(007] is not a match
"007)", // [007)] is not a match
"-1", // [-1] is not a match
};
for (String test : tests) {
Matcher m = p.matcher(test);
System.out.format("[%s] is %s %n", test,
!m.matches() ? "not a match"
: m.group(1) != null
? "a match with surrounding parenthesis"
: "a match without parenthesis"
);
}
As previously said, this matches a non-zero number of digits, possibly surrounded by parenthesis (and \1 distinguishes between the two).

1)
[^,]* means any number of characters that are not a comma
.+(,).+ means 1 or more characters followed by a comma followed by 1 or more characters
| means either the first one or the second one
2)
(\()? means zero or one '(' note* backslash is to escape '('
\d+ means 1 or more digits
(?(1)\)) means if back-reference \1 matched, then ')' note* no else is given
Also note that parenthesis are used to capture certain parts of the regular expression, except, of course, if they are escaped with a backslash

1) Anything that doesn't starts with a comma, or anything that contains a comma in between.
2) Any number that ends with a 1, and is between parenthesis, possible closed before and opened again after the number.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regular Expression :match string containing only non repeating words - java

Related

How do I get a regex expression to contain only uppercase letters or numbers?

Java String match using the regex for optional String

Java/Groovy regex parse Key-Value pairs without delimiters

Minimum & Maximum character amounts Regex

Regex expression in plain english

Categories

Resources