two regex patterns, can they be one? - java

I have two regular expressions that I use to validate Colorado driver's license formats.
[0-9]{2}[-][0-9]{3}[-][0-9]{4}
and
[0-9]{9}
We have to allow for only 9 digits but the user is free to enter it in as 123456789 or 12-345-6789.
Is there a way I can combine these into one? Like a regex conditional statement of sorts? Right now I am simply enumerating through all the available formats and breaking out once one is matched. I could always strip the hyphens out before I do the compare and only use [0-9]{9}, but then I won't be learning anything new.

For a straight combine,
(?:[0-9]{2}[-][0-9]{3}[-][0-9]{4}|[0-9]{9})
or to merge the logic (allowing dashes in one position without the other, which may not be desired),
[0-9]{2}-?[0-9]{3}-?[0-9]{4}
(The brackets around the hyphens in your first regex aren't doing anything.)
Or merging the logic so that both hyphens are required if one is present,
(?:\d{2}-\d{3}-|\d{5})\d{4}
(Your [0-9]s can also be replaced with \ds.)

How about using a backreference to match the second hyphen only if the first is given:
\d{2}(-?)\d{3}\1\d{4}
Although I've never used regexes in Java so if it's supported the syntax might be different. I've just tried this out in Ruby.

A neat version which will allow either dashes or not dashes is:
\d{2}(-?)\d{3}\1\d{4}
The capture (the brackets) will capture either '-' or nothing. The \1 will match again whatever was captured.

I think this should work:
\d{2}-?\d{3}-?\d{4}

Related

validate special characters by negating unicode letters with regex pattern?

This regex: \p{L}+ matches these characters "ASKJKSDJKDSJÄÖÅüé" of the example string "ASKJKSDJK_-.;,DSJÄÖÅ!”#€%&/()=?`¨’<>üé" which is great but is the exact opposite of what I want. Which leads me to negating regexes.
Goal:
I want to match any and all characters that are not a letter nor a number in multiple languages.
Could a negative regex be a natural direction for this?
I should mention one intended use for the regex I'd like to find is to validate passwords for the rule:
that it needs to contain at least one special character, which I
define as not being a number nor a letter.
It would seem defining ranges of special characters should be avoided if possible, because why limit the possibilities? Thus my definition. I assume there could be some problems with such a wide definition, but it is a first step.
If you have some suggestions for a better solution I'm giving below or just have some thoughts on the subject, I'm sure I'm not the only one that would like to learn about it. Thanks.
Note I'm using double \\ in the Java code. Platform is Java 11.
You can shove those \\p things in []. And thus, use the fact that you can negate chargroups. This is all you need:
Pattern p = Pattern.compile("[^\\p{L}]");
Matcher m = p.matcher("ASKJKSDJK_-.;,DSJÄÖÅ!”#€%&/()=?`¨’<>üé");
while (m.find()) System.out.print(m.group(0));
That prints:
_-.;,!”#€%&/()=?`¨’<>
Which is exactly what you're looking for, no?
No need to mess with lookaheads here.
So after having read similar, though not identical questions and some equally great answers, I came up with this solution: (?=\P{L})(?=\P{N}) meaning match both not letters and not numbers. Even if I'm asserting numbers separately I need to negate both to meet the specification of special characters (See question).
This is making use of a non-consuming regular expression with the parentheses and the?=, first matching the expression in the first parenthesis and after that continue to match the whole in the second. Thanks to #Jason Cohen for this detail in the Regular Expressions: Is there an AND operator? discussion.
The upper case P in \P{L} and \P{N} expresses the "not belonging to a category" in Unicode Categories, where the uppercase P means "not", i e the opposite of a lowercase p.
It's not perfect for a real world solution, but works as a starting point at least. Note I'm using double \\ in the Java code. Platform is Java 11.

Regex with at least one of two options but in sequence

I am trying to write a regex (for use in a Java Pattern) that will match strings that possibly have a letter that is possibly followed by a space then number, but must have at least one of them. For example, the following strings should be matched:
"a 5"
"b 9"
" 8"
However, it should not match an empty string ("").
Furthermore, I would like to make each of the components part a named capture group.
The following works, but allows the empty string.
"(?<let>\\p{Alpha})?( (?<num>\\p{Digit}))?"
To ensure that there is at least one of them, you can use lookahead (?=\\p{Alpha}| \\p{Digit}) at the beginning:
"(?=\\p{Alpha}| \\p{Digit})(?<let>\\p{Alpha})?( (?<num>\\p{Digit}))?"
In general, to avoid empty strings you can use (?=.).
You can use a negative lookahead to avoid empty input and keep your regex as:
^(?!$)(?<let>\p{L})?(?:\h+(?<num>\p{N}))?$
RegEx Demo
(?!$) is negative lookahead to fail the match for empty strings.
You can solve problem with:
([a-z]? \d)|([a-z] \d?)
You can see this code that covers your test cases in demo here. You can see this code in demo here. This is very basic regular expression knowledge, you should definitely learn more about regular expressions, there are bunch of good tutorials on web (e.g this one).
You can use | for or, then simply repeat "any pattern" to match everything like this.
((?<let>[A-z])|(?<num>\d)\s*)+
That lets you match any number of named patterns in any order.

RegEx: First three characters unique + additional match

I am looking to achieve a password policy with RegEx.
The policy contains of these rules:
First three characters must be unique
Password must be at least 8 characters long
Password must contain at least one letter, one digit and one special-character (of white-list)
I found this pattern that matches 1):
^(.)((?!\1).)((?!\1)(?!\2).)((?!\1)(?!\2)(?!\3))
This pattern matches 2) and 3):
^(?=.*[a-zA-Z].*)(?=.*[0-9].*)(?=.*[$&+,:;=?##|'<>.^*()%!-].*)(.{8,})
Now I am stuck combining these two pattern into one. Can someone help here please? ;-)
Alternatively, don't combine them into one. Just check each of the 3 regexes one at a time. Combining them is going to be scary and incomprehensible if you ever need to add new rules, or change existing rules (especially since you're already having trouble combining them).
You can use this regex:
^(?=(.)(?!\1)(.)(?!\1)(?!\2))(?=.*[a-zA-Z])(?=\D*\d)(?=.*?[$&+,:;=?##|'<>.^*()%!-])(.{8,})$
RegEx Demo
(?=(.)(?!\1)(.)(?!\1)(?!\2)) # Makes sure first 3 characters are unique using lookaheads

Java Regexp pattern check

Pattern
^\\d{1}-\\d{10}|\\d{1,9}|^TWC([0-9){12})$
should validate any of these
1-23232445
1-232323
1-009121212
12
12222
TWC12222
TWC1222324
When i test for TWC pattern doesn't match, I have added "|" to consider OR condition and then to have numbers from 0-9 but limiting to 12 digits. What am i missing ?
TWC([0-9)
I think this is where it might be not working??
You need
TWC([0-9]{12})
Complete answer...
(\d{1}-\d{1,12})|^TWC(\d{1,12})$
even nicer answer ..
^(\\d-|TWC|)(\\d{1,12})$ // this syntax i believe will match your needs.
tested :)
^([0-9]-|TWC|)([0-9]{1,12})$ // or
^(\d-|TWC|)(\d{1,12})$
breakdown
^
this denotes the start of the string
\d or [0-9]
denotes one character of the numbers 0 through 9 (note \d might not work in some lanagues or require different syntax!)
|
is essentially an OR
{1,12}
will only accept a particular pattern 1-12 times for instance in my code the patternw ould be \d or [0-9]
$
is the end of the line
this essentially checks if the line contains a [0-9] with a - after,TWC, or just a nothing space to account for nothing being there at the start then reads up to 12 digits. Should work for all your cases.
testing
edit code.
all unit tests. click on "java" if you want to see them :0
more testing.
NOTE:
YOU NEED TO LOOK AT THE SYNTAX OF WHAT YOU ARE USING IN SOME CASES YOU MIGHT NEED TO \ SOME THINGS IN ORDER FOR THEM TO WORK.. IN C++/C its 2 // IN ORDER FOR THESE TO WORK PLEASE BE VERY WARY ABOUT PARTICULAR SYNTAXES.
Sorry for all the confusion, and also for lying a whole bunch apparently. The issue you're having is that you are using exact quantifiers in a couple of places you don't mean to, namely the {10} and {12}. This requires exactly ten or twelve digits in those spots. What you presumably want is for those to be {1,10} and {1,12} respectively.
What I would do is something like this, using parentheses and quantifiers to clean everything up and repeating yourself as little as possible, to avoid confusion. You've got three possible prefixes (a digit and a dash, or "TWC", or nothing). I'd put those possibilities all together, and then add the rest. This makes the regex much easier to look at.
^(\\d-|TWC){0,1}\\d{1,12}$
The breakdown:
^ is at the beginning, always.
(\\d-|TWC){0,1} Next comes either a single digit followed by a dash, or the string "TWC". This prefix occurs either zero times (for no prefix) or one time.
\\d{1,12}$ Finally, there is a string of one to twelve digits, followed by the end of the line/input (depending on your DOTALL settings of course).
Of course you won't be able to simplify it quite this much if the different prefixes can only allow certain numbers of digits, but this is the basic idea.
You've also got what looks like a typo; TWC([0-9){12}) should be TWC([0-9]{12}). I'm guessing this was just a typo when writing out the question though, since what you have right now would blow up at runtime when you tried to use it otherwise, and it sounds like it's working for some of your inputs.

How can I handle multiple parenthesis in a regex?

I have strings of this type:
text (more text)
What I would like to do is to have a regular expression that extracts the "more text" segment of the string. So far I have been using this regular expression:
"^.*\\((.*)\\)$"
Which although it works on many cases, it seems to fail if I have something of the sort:
text (more text (even more text))
What I get is: even more text)
What I would like to get instead is:
more text (even more text) (basically the content of the outermost pair of brackets.)
Besides lazy quantification, another way is:
"^[^(]*\\((.*)\\)$"
In both regexes, there is a explicitly specified left parenthesis ("\\(", with Java String escaping) immediately before the matching group. In the original, there was a .* before that, allowing anything (including other left parentheses). In mine, left parentheses are not allowed here (there is a negated character class), so the explicitly specified left parenthesis in the outermost.
I recommend this (double escaping of the backslash removed, since this is not part of the regex):
^[^(]*\((.*)\)
Matching with your version (^.*\((.*)\)$) occurs like this:
The star matches greedily, so your first .* goes right to the end of the string.
Then it backtracks just as much as necessary so the \( can match - that would be the last opening paren in the string.
Then the next .* goes right to the end of the string again.
Then it backtracks just as much so the \) can match, i.e. to the last closing paren.
When you use [^(]* instead of .*, it can't go past the first opening paren, so the first opening paren (the correct one) in the string will delimit your sub-match.
Try:
"^.*?\\((.*)\\)$"
That should make the first matching less greedy. Greedy means it swallows everything it possibly can while still getting an overall pattern match.
The other suggestion:
"^[^(]*\\((.*)\\)$"
Might be more along the line of what you're looking for though. For this simple example it doesn't really matter so much, but it could if you wanted to expand on the regex, for example by making the part inside the braces optional.
Try this:
"^.*?\\((.*)\\)$"
True regular expressions can't count parentheses; this requires a pushdown automaton. Some regex libraries have extensions to support this, but I don't think Java's does (could be wrong; Java isn't my forté).
BTW, the other answers I've seen so far will work with the example given, but will break with, e.g., text (more text (even more text)) (another bit of text). Changing greediness doesn't make up for the inability to count.
$str =~ /^.*?\((.*)\)/
I think the reason is because you second wildcard is picking up the closing parenthesis. You'll need to exclude it.

Categories

Resources