java regex to capture any number of periods within a string - java

I am trying to match on any of the following:
$tag:parent.child$
$tag:grand.parent.child$
$tag:great.grand.parent.child$
I have tried a bunch of combos but not sure how to do this without an exp for each one: https://regex101.com/r/cMvx9I/1
\$tag:[a-z]*\.[a-z]*\$
I know this is wrong, but haven't been able to find the right method yet. Help is greatly appreciated.

Your regex was: \$tag:[a-z]*\.[a-z]*\$
You need a repeating group of .name, so use: \$tag:[a-z]+(?:\.[a-z]+)+\$
That assumes there has to be at least 2 names. If only one name is allowed, i.e. no period, then change last + to *.

You can use \$tag:(?:[a-z]+\.)*[a-z]+\$
\$ a literal $
tag: literal tag:
(?:...) a non-capturing group of:
[a-z]+ one or more lower-case letters and
\. a literal dot
* any number of the previous group (including zero of them)
[a-z]+ one or more lower-case letters
\$ a literal $

The following pattern will match any periods within a string:
\.

Not sure if this is what you want, but you can make a non-capturing group out of a pattern and then find that a certain number of times:
\$tag:(?:[a-z]+?\.*){1,4}\$
\$tag: - Literal $tag:
(?:[a-z]+?\.*) - Non-capturing group of any word character one or more times (shortest match) followed by an optional literal period
{1,4} - The capturing group appears anywhere between 1-4 times (you can change this as needed, or use a simple + if it could be any number of times).
\$ - Literal $
I normally prefer \w instead of [a-z] as it is equivalent to [a-zA-Z0-9_], but using this depends on what you are trying to find.
Hope this helps.

Related

Regex pattern matching with multiple strings

Forgive me. I am not familiarized much with Regex patterns.
I have created a regex pattern as below.
String regex = Pattern.quote(value) + ", [NnoneOoff0-9\\-\\+\\/]+|[NnoneOoff0-9\\-\\+\\/]+, "
+ Pattern.quote(value);
This regex pattern is failing with 2 different set of strings.
value = "207e/160";
Use Case 1 -
When channelStr = "207e/160, 149/80"
Then channelStr.matches(regex), returns "true".
Use Case 2 -
When channelStr = "207e/160, 149/80, 11"
Then channelStr.matches(regex), returns "false".
Not able to figure out why? As far I can understand it may be because of the multiple spaces involved when more than 2 strings are present with separated by comma.
Not sure what should be correct pattern I should write for more than 2 strings.
Any help will be appreciated.
If you print your pattern, it is:
\Q207e/160\E, [NnoneOoff0-9\-\+\/]+|[NnoneOoff0-9\-\+\/]+, \Q207e/160\E
It consists of an alternation | matching a mandatory comma as well on the left as on the right side.
Using matches(), should match the whole string and that is the case for 207e/160, 149/80 so that is a match.
Only for this string 207e/160, 149/80, 11 there are 2 comma's, so you do get a partial match for the first part of the string, but you don't match the whole string so matches() returns false.
See the matches in this regex demo.
To match all the values, you can use a repeating pattern:
^[NnoeOf0-9+/-]+(?:,\h*[NnoeOf0-90+/-]+)*$
^ Start of string
[NnoeOf0-9\\+/-]+
(?: Non capture group
,\h* Match a comma and optional horizontal whitespace chars
[NnoeOf0-90-9\\+/-]+ Match 1+ any of the listed in the character class
)* Close the non capture group and optionally repeat it (if there should be at least 1 comma, then the quantifier can be + instead of *)
$ End of string
Regex demo
Example using matches():
String channelStr1 = "207e/160, 149/80";
String channelStr2 = "207e/160, 149/80, 11";
String regex = "^[NnoeOf0-9+/-]+(?:,\\h*[NnoeOf0-90+/-]+)*$";
System.out.println(channelStr1.matches(regex));
System.out.println(channelStr2.matches(regex));
Output
true
true
Note that in the character class you can put - at the end not having to escape it, and the + and / also does not have to be escaped.
You can use regex101 to test your RegEx. it has a description of everything that's going on to help with debugging. They have a quick reference section bottom right that you can use to figure out what you can do with examples and stuff.
A few things, you can add literals with \, so \" for a literal double quote.
If you want the pattern to be one or more of something, you would use +. These are called quantifiers and can be applied to groups, tokens, etc. The token for a whitespace character is \s. So, one or more whitespace characters would be \s+.
It's difficult to tell exactly what you're trying to do, but hopefully pointing you to regex101 will help. If you want to provide examples of the current RegEx you have, what you want to match and then the strings you're using to test it I'll be happy to provide you with an example.
^(?:[NnoneOoff0-9\\-\\+\\/]+ *(?:, *(?!$)|$))+$
^ Start
(?: ... ) Non-capturing group that defines an item and its separator. After each item, except the last, the separator (,) must appear. Spaces (one, several, or none) can appear before and after the comma, which is specified with *. This group can appear one or more times to the end of the string, as specified by the + quantifier after the group's closing parenthesis.
Regex101 Test

Having problems with java regex

I have the following regex:
/[-A-Z]{4}\d{2}/[0-9A-F]{8}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{12}.png
Basically I want to check for strings of the basic type
ABCD12/<here_is_a_random_uuid_as_a_string>.png
The UUID (which is in UPPER CASE) checking works fine, but now let's take a look at a special case. I want to accept strings like this
--CD12/...
AB--12/...
but NOT like this:
A--D12/...
But I can not get the first part of the regex right. Basically I need to check for either two digits or two -after each other twice.
For my understanding [-A-Z]{4} means "either - or something between A - Z with a length of 4". So why doesn't my pattern work?
EDIT:
This answer was posted within the comments and it works:
(?mi)^(?:--[A-Z]{2}|[A-Z]{2}(?:--|[A-Z]{2}))\d{2}/[0-9A-F]{8}(?:-[0-9A-F]{4}){3}-[0-9A-F]{12}\.png$
Can somebody explain to me what (?mi) and what (?:...) means? The normal ? means 0 or 1 time, but what is the : for?
EDIT 2:
Just for those how might have a similar problem and do not want to read all of those regexes ;)
I slightly modified an answer to also accept patterns like ----12. The end result:
"^/(?:--[A-Z]{2}|-{4}|[A-Z]{2}(?:--|[A-Z]{2}))\\d{2}/[0-9A-F]{8}(?:-[0-9A-F]{4}){3}-[0-9A-F]{12}\\.png$"
It works like a charm.
You may use this regex for your cases:
^(?:--[A-Z]{2}|[A-Z]{2}(?:--|[A-Z]{2}))\d{2}/[0-9A-F]{8}(?:-[0-9A-F]{4}){3}-[0-9A-F]{12}\.png$
RegEx Demo
Details about first part:
^: Start
(?:: Start non-capture group
--[A-Z]{2}: Match -- followed by 2 letters
|: OR
[A-Z]{2}: Match 2 letters
(?:--|[A-Z]{2}): Match -- OR 2 letters
): End non-capture group
btw (?:...) is non-capture group.
Your [-A-Z]{4} matches any four occurrences of an uppercase ASCII letter or -, so it can also match ----, A---, ---B, -B--, etc.
You want to make sure that if there are hyphens, they come after or before two letters:
(?:[A-Z]{2}--|--[A-Z]{2}|[A-Z]{4})
It means:
(?: - start of a non-capturing group:
[A-Z]{2}-- - two uppercase ASCII letters and then --
| - or
--[A-Z]{2} - -- and then any two uppercase ASCII letters
| - or
[A-Z]{4} - any four uppercase ASCII letters
) - end of the non-capturing group.
The full pattern:
(?:[A-Z]{2}--|--[A-Z]{2}|[A-Z]{4})\d{2}/[0-9A-F]{8}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{12}\.png
To force the entire string match, add ^ (start of string) and $ (end of string) anchors:
^(?:[A-Z]{2}--|--[A-Z]{2}|[A-Z]{4})\d{2}/[0-9A-F]{8}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{12}\.png$
See the regex demo
Note the . matches any char, to match a literal dot, you should escape it.

Restrict consecutive characters using Java Regex

I need to allow alphanumeric characters , "?","." , "/" and "-" in the given string. But I need to restrict consecutive - only.
For example:
www.google.com/flights-usa should be valid
www.google.com/flights--usa should be invalid
currently I'm using ^[a-zA-Z0-9\\/\\.\\?\\_\\-]+$.
Please suggest me how to restrict consecutive - only.
You may use grouping with quantifiers:
^[a-zA-Z0-9/.?_]+(?:-[a-zA-Z0-9/.?_]+)*$
See the regex demo
Details:
^ - start of string
[a-zA-Z0-9/.?_]+ - 1 or more characters from the set defined in the character class (can be replaced with [\w/.?]+)
(?:-[a-zA-Z0-9/.?_]+)* - zero or more sequences ((?:...)*) of:
- - hyphen
[a-zA-Z0-9/.?_]+ - see above
$ - end of string.
Or use a negative lookahead:
^(?!.*--)[a-zA-Z0-9/.?_-]+$
^^^^^^^^^
See the demo here
Details:
^ - start of string
(?!.*--) - a negative lookahead that will fail the match once the regex engine finds a -- substring after any 0+ chars other than a newline
[a-zA-Z0-9/.?_-]+ - 1 or more chars from the set defined in the character class
$ - end of string.
Note that [a-zA-Z0-9_] = \w if you do not use the Pattern.UNICODE_CHARACTER_CLASS flag. So, the first would look like "^[\\w/.?]+(?:-[\\w/.?]+)*$" and the second as "^(?!.*--)[\\w/.?-]+$".
One approach is to restrict multiple dashes with negative look-behind on a dash, like this:
^(?:[a-zA-Z0-9\/\.\?\_]|(?<!-)-)+$
The right side of the |, i.e. (?<!-)-, means "a dash, unless preceded by another dash".
Demo.
I'm not sure of the efficiency of this, but I believe this should work.
^([a-zA-Z0-9\/\.\?\_]|\-([^\-]|$))+$
For each character, this regex checks if it can match [a-zA-Z0-9\/\.\?\_], which is everything you included in your regex except the hyphen. If that does not match, it instead tries to match \-([^\-]|$), which matches a hyphen not followed by another hyphen, or a hyphen at the end of the string.
Here's a demo.

Regex to match and limit character classes

I'm not sure if this is possible using Regex but I'd like to be able to limit the number of underscores allowed based on a different character. This is to limit crazy wildcard queries to a search engine written in Java.
The starting characters would be alphanumeric. But I basically want a match if there are more underscores than preceding characters. So
BA_ would be fine but BA___ would match the regex and would get kicked out of the query parser.
Is that possible using Regex?
Yes you can do it. This pattern will succeed only if there are less underscores than letters (you can adapt it with the characters you want):
^(?:[A-Z](?=[A-Z]*(\\1?+_)))*+[A-Z]+\\1?$
(as Pshemo notices it, anchors are not needed if you use the matches() method, I wrote them to illustrate the fact that this pattern must be bounded whatever the means. With lookarounds for example.)
negated version:
^(?:[A-Z](?=[A-Z]*(\\1?+_)))*\\1?_*$
The idea is to repeat a capture group that contains a backreference to itself + an underscore. At each repetition, the capture group is growing. ^(?:[A-Z](?=[A-Z]*+(\\1?+_)))*+ will match all letters that have a correspondant underscore. You only need to add [A-Z]+ to be sure that there is more letters, and to finish your pattern with \\1? that contains all the underscores (I make it optional, in case there is no underscore at all).
Note that if you replace [A-Z]+ with [A-Z]{n} in the first pattern, you can set exactly the number of characters difference between letters and underscores.
To give a better idea, I will try to describe step by step how it works with the string ABC-- (since it's impossible to put underscores in bold, i use hyphens instead) :
In the non-capturing group, the first letter is found
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
let's enter the lookahead (keep in mind that all in the lookahead is only
a check and not a part of the match result.)
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
the first capturing group is encounter for the first time and its content is not
defined. This is the reason why an optional quantifier is used, to avoid to make
the lookahead fail. Consequence: \1?+ doesn't match something new.
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
the first hyphen is matched. Once the capture group closed, the first capture
group is now defined and contains one hyphen.
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
The lookahead succeeds, let's repeat the non-capturing group.
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
The second letter is found
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
We enter the lookahead
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
but now, things are different. The capture group was defined before and
contains an hyphen, this is why \1?+ will match the first hyphen.
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
the literal hyphen matches the second hyphen in the string. And now the
capture group 1 contains the two hypens. The lookahead succeeds.
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
We repeat one more time the non capturing group.
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
In the lookahead. There is no more letters, it's not a problem, since
the * quantifier is used.
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
\\1?+ matches now two hyphens.
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
but there is no more hyphen in the string for the literal hypen and the regex
engine can not use the bactracking since \1?+ has a possessive quantifier.
The lookahead fails. Thus the third repetition of the non-capturing group too!
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
ensure that there is at least one more letter.
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
We match the end of the string with the backreference to capture group 1 that
contains the two hyphens. Note that the fact that this backreference is optional
allows the string to not have hyphens at all.
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
This is the end of the string. The pattern succeeds.
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
Note: The use of the possessive quantifier for the non-capturing group is needed to avoid false results. (Where you can observe a strange behavior, that can be useful.)
Example:ABC--- and the pattern: ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$ (without the possessive quantifier)
The non-capturing group is repeated three times and `ABC` are matched:
ABC--- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$
Note that at this step the first capturing group contains ---
But after the non capturing group, there is no more letter to match for [A-Z]+
and the regex engine must backtrack.
ABC--- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$
Question: How many hyphens are in the capture group now?
Answer: Always three!
If the repeated non-capturing group gives a letter back, the capture group contains always three hyphens (as the last time the capture group has been read by the regex engine).This is counter-intuitive, but logical.
Then the letter C is found:
ABC--- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$
And the three hyphens
ABC--- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$
The pattern succeeds
ABC--- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$
Robby Pond asked me in comments how to find strings that have more underscores than letters (all that is not an underscore). The best way is obviously to count the numbers of underscores and to compare with the string length. But about a full regex solution, it is not possible to build a pattern for that with Java since the pattern needs to use the recursion feature. For example you can do it with PHP:
$pattern = <<<'EOD'
~
(?(DEFINE)
(?<neutral> (?: _ \g<neutral>?+ [A-Z] | [A-Z] \g<neutral>?+ _ )+ )
)
\A (?: \g<neutral> | _ )+ \z
~x
EOD;
var_dump(preg_match($pattern, '____ABC_DEF___'));
Its not possible in singular regular expression.
i) Logic needs to be implemented to get number of characters before underscores(regular expression should be written to get characters word before underscore).
ii) And validate result (number of characters - 1) = number of semicolons followed(regular expression which returns stream of underscores followed by characters).
Edit: Dang! I just noticed that you need this for java. Anyways...I leave it here if someone from the .Net world stumbles upon this post.
You can use Balancing Groups if you are using .Net:
^(?:(?<letter>[^_])|(?<-letter>_))*(?(letter)(?=)|(?!))$
The .net regex engine has the ability to maintain all captured patterns in the captured groups. In other flavors the captured group would always contain the last matched pattern but in .net all previous matches are contained in a capture collection for your use. Also the .net engine has the ability to push and pop to the stack of the captured groups using the ?<group-name>, ?<-group-name> constructs. These two handy constructs can be utilized to match pairs of paranthesis, etc.
In the above regex, the engine starts from the start of the string and tries to match anything other than "_". This of course can be changed to whatever works for you(e.g [A-Z][a-z]). The alternation basically means either match [^\_] or [\_] and doing so either push or pop from the captured group.
The latter part of the regex is a conditional (?(group-name)true|false). It basically says, if the group still exists(more pushes than pops), then do the true section and if not do the false section. The easiest way to make the pattern match is to use an empty positive look ahead: (?=) and the easiest way to make it fail is (?!) which is a negative lookahead.

Require Help for Regular Expression

I am Doing a Check on the JTextfield Values that it Should be XX.YY.Z format
10.01.5
No space at beginning or after allowed.
EDIT:-
How Can I Specify Last as Alphanumeric character i.e. Z can be Number or character
\d matches a digit, and \. matches a dot.
\d\d\.\d\d\.\d
i.e. "\\d\\d\\.\\d\\d\\.\\d".
I don't have much experience in Java but this what I would do in PHP.
^\d\d.\d\d.\d$
\d represents one degit, \d\d represents two degits
^ a caret character is there to ensure that it must start with the number (No spaces at the beginning)
$ a dollar sign ensures that there will be no spaces or other characters at the end.
You could use quantifiers
\d{2}\.\d{2}\.\d
That is the indicated, and your regex becomes more easy to read and to change.
more on Quantifiers

Categories

Resources