Java Regular Expression: what is " '- " - java

I came up to a line in java that uses regular expressions.
It needs a user input of Last Name
return lastName.matches( "[a-zA-z]+([ '-][a-zA-Z]+)*" );
I would like to know what is the function of the [ '-].
Also why do we need both a "+" and a "*" at the same time, and the [ '-][a-zA-Z] is in brackets?

Your RE is: [a-zA-z]+([ '-][a-zA-Z]+)*
I'll break it down into its component parts:
[a-zA-Z]+
The string must begin with any letter, a-z or A-Z, repeated one or more times (+).
([ '-][a-zA-Z]+)*
[ '-]
Any single character of <space>, ', or -.
[a-zA-Z]+
Again, any letter, a-z or A-Z, repeated once or more times.
This combination of letters ('- and a-ZA-Z) may then be repeated zero or more times.
Why [ '-]? To allow for hiphenated names, such as Higgs-Boson or names with apostrophes, such as O'Reilly, or names with spaces such as Van Dyke.

The expression [ '-] means "one of ', , or -". The order is very important - the dash must be the last one, otherwise the character class would be considered a range, and other characters with code points between the space and the quote ' would be accepted as well.
+ means "one or more repetitions"; * means "zero or more repetitions", referring to the term of the regular expression preceding the + or * modifier.]
Overall, the expression matches groups of lowercase and uppercase letters separated by spaces, dashes, or single quotes.

it means it can be any of the characters space ' or - ( space, quote dash )
the - can be done as \- as it also can mean a range... like a-z

This looks like it is a pattern to match double-barreled (space or hyphen) or I-don't-know-what-to-call-it names like O'Grady... for example:
It would match
counter-terrorism
De'ville
O'Grady
smith-jones
smith and wesson
But it will not match
jones-
O'Learys'
#hashtag
Bob & Sons

The idea is, after the first [A-Za-z]+ consumes all the letters it can, the match will end right there unless the next character is a space, an apostrophe, or a hyphen ([ '-]). If one of those characters is present, it must be followed by at least one more letter.
A lot of people have difficulty with this. The naively write something like [A-Za-z]+[ '-]?[A-Za-z]*, figuring both the separator and the extra chunks of letters are optional. But they're not independently optional; if there is a separator ([ '-]), it must be followed by at least one more letter. Otherwise it would treat strings like R'- j'-' as valid. Your regex doesn't have that problem.
By the way, you've got a typo in your regex: [a-zA-z]. You want to watch out for that, because [A-z] does match all the uppercase and lowercase letters, so it will seem to be working correctly as long as the inputs are valid. But it also matches several non-letter characters whose code points happen to lie between those of Z and a. And very few IDEs or regex tools will catch that error.

Related

Regular expression --- username valid expression [duplicate]

http://regexr.com/3ars8
^(?=.*[0-9])(?=.*[A-z])[0-9A-z-]{17}$
Should match "17 alphanumeric chars, hyphens allowed too, must include at least one letter and at least one number"
It'll correctly match:
ABCDF31U100027743
and correctly decline to match:
AB$DF31U100027743
(and almost any other non-alphanumeric char)
but will apparently allow:
AB^DF31U100027743
Because your character class [A-z] matches this symbol.
[A-z] matches [, \, ], ^, _, `, and the English letters.
Actually, it is a common mistake. You should use [a-zA-Z] instead to only allow English letters.
Here is a visualization from Expresso, showing what the range [A-z] actually covers:
So, this regex (with i option) won't capture your string.
^(?=.*[0-9])(?=.*[a-z])[0-9a-z-]{17}$
In my opinion, it is always safer to use Ignorecase option to avoid such an issue and shorten the regex.
regex uses ASCII printable characters from the space to the tilde range.
Whenever we use [A-z] token it matches the following table highlighted characters. If we use [ -~] token it matches starting from SPACE to tilde.
You're allowing A-z (capital 'A' through lower 'z'). You don't say what regex package you're using, but it's not necessarily clear that A-Z and a-z are contiguous; there could be other characters in between. Try this instead:
^(?=.*[0-9])(?=.*[A-Za-z])[0-9A-Za-z-]{17}$
It seems to meet your criteria for me in regexpal.

Unique regex for first name and last name

I have a single input where users should enter name and surname. The problem is i need to use checking regEx. There's a list of a requirements:
The name should start from Capital Letter (not space)
There can't be space stacks
It's obligate to support these Name and Surname (all people are able to write theirs first/name). Example:
John Smith
and
Armirat Bair Hossan
And the last symbol shouldn't be space.
Please help,
ATM i have regex like
^\\p{L}\\[p{L} ,.'-]+$
but it denies ALL input, which is not good
Thanks for helping me
UPDATE:
CORRECT INPUT:
"John Smith"
"Alberto del Muerto"
INCORRECT
" John Smith "
" John Smith"
You can use
^[\p{Lu}\p{M}][\p{L}\p{M},.'-]+(?: [\p{L}\p{M},.'-]+)*$
or
^\p{Lu}\p{M}*+(?:\p{L}\p{M}*+|[,.'-])++(?: (?:\p{L}\p{M}*+|[,.'-])++)*+$
See the regex demo and demo 2
Java declaration:
if (str.matches("[\\p{Lu}\\p{M}][\\p{L}\\p{M},.'-]+(?: [\\p{L}\\p{M},.'-]+)*")) { ... }
// or if (str.matches("\\p{Lu}\\p{M}*+(?:\\p{L}\\p{M}*+|[,.'-])++(?: (?:\\p{L}\\p{M}*+|[,.'-])++)*+")) { ... }
The first regex breakdown:
^ - start of string (not necessary with matches() method)
[\p{Lu}\p{M}] - 1 Unicode letter (incl. precomposed ones as \p{M} matches diacritics and \p{Lu} matches any uppercase Unicode base letter)
[\p{L}\p{M},.'-]+ - matches 1 or more Unicode letters, a ,, ., ' or - (if 1 letter names are valid, replace + with - at the end here)
(?: [\p{L}\p{M},.'-]+)* - 0 or more sequences of
- a space
[\p{L}\p{M},.'-]+ - 1 or more characters that are either Unicode letters or commas, or periods, or apostrophes or -.
$ - end of string (not necessary with matches() method)
NOTE: Sometimes, names contain curly apostrophes, you can add them to the character classes ([‘’]).
The 2nd regex is less effecient but is more accurate as it will only match diacritics after base letters. See more about matching Unicode letters at regular-expressions.info:
To match a letter including any diacritics, use \p{L}\p{M}*+.
Try this one
^[^- '](?=(?![A-Z]?[A-Z]))(?=(?![a-z]+[A-Z]))(?=(?!.*[A-Z][A-Z]))(?=(?!.*[- '][- ']))[A-Za-z- ']{2,}$
There is also an interactive Demo of this pattern available at an external website.
You made a typo: the second \\ should be in front of p.
However even then there is a check missing for a trailing space
"^\\p{L}[\\p{L} ,.'-]+$"
For a .matches the following would suffice
"\\p{L}[\\p{L} ,.'-]*[\\p{L}.]"
Names like "del Rey, Hidalgo" do not require an initial capital.
Also I would advise to simply .trim() the input; imagine a user regarding at the input being rejected for a spurious blank.
Try this
^[A-Z][a-z]+(([\s][A-Z])?[a-z]+){1,2}$
but use \\ instead \ for java

Help with regex

I'm constructing a regex which will accept at least 1 alpha numerical character and any number of spaces.
Right now I've got...[A-Za-z0-9]+[ \t\r\n]* which I understand to be at least 1 alphanumeric OR at least 1 space. How would I fix this?
EDIT: To answer the comments below I want it to accept strings which contain ATLEAST 1 alphanumeric AND any number of (including no) spaces. Right now it will accept JUST a whitespace.
EDIT2: To clarify, I don't want the any number of whitespace (including 0) to be accepted unless there is at least 1 alphanumeric character
\s*\p{Alnum}[\p{Alnum}\s]*
Your regex, [A-Za-z0-9]+[ \t\r\n]*, requires the string to start with a letter or digit (or, more accurately, it doesn't start matching until it sees one). Adding \s* allows the match to start with whitespace, but you still won't match any alphanumerics after the first whitespace character that follows an alphanumeric (for example, it won't match the xyz in abc xyz. Changing the trailing \s* to [\p{Alnum}\s]* fixes that problem.
On a side note, \p{Alnum} is exactly equivalent to [A-Za-z0-9] in Java, which is not the case in all regex flavors. I used \p{Alnum}, not just because it's shorter, but because it gives more protection from typos like [A-z] (which is syntactically valid, but almost certainly not what the author really meant).
EDIT: Performance should be considered, too. I originally included a + after the first \p{Alnum}, but I realized that wasn't a good idea. If this were part of a longer regex, and the regex didn't match right away, it could end up wasting a lot of time trying to match the same groups of characters with \p{Alnum}+ or [\p{Alnum}\s]*. The leading \s* is okay, though, because \s doesn't match any of the characters that \p{Alnum} matches.
Any one or more word char zero or more whitespace
\w+\s*
Hey try this ([^\s]+\s*) [^\s] means catch everything that is not white space, while \s* means that an white space is optional (if you really want at least one white space put + instead of )
Edit: sory mine catch everithing not only alphanumeric (put ([a-zA-Z0-9]+\s) for alphanumeric)
This should do the trick:
\s*\p{Alnum}+\s*
\p{Alnum} is an alphanumeric character: [\p{Alpha}\p{Digit}]
* says "zero or more times"
+ says "at least one" (not "or" as you seem to believe, or is written |)
| means "or"
\s is a whitespace character: [ \t\n\x0B\f\r]
EDIT: To answer the comments below I want it to accept strings which contain AT LEAST 1 alphanumeric AND any number of (including no) spaces.
The pattern I suggested requires at least one alpha numeric character.
EDIT2: To clarify, I don't want the any number of whitespace (including 0) to be accepted unless there is at least 1 alphanumeric character
The pattern I suggested will not accept only white space characters only.

java.util.regex matching anything before expression

I trying to tokenize following snippets by types of numbers:
"(0-22) 222-33-44, 222-555-666, tel./.fax (111-222-333) 22-33-44 UK, TEL/faks: 000-333-444, fax: 333-444-555, tel: 555-666-888"
and
"tel: 555-666-888, tel./fax (111-222-333) 22-33-44 UK"
and
"fax (111-222-333) 22-33-44 UK, TEL/faks: 000-333-444, fax: 333-444-555"
and so on.
The conception is that this can be any combination of like "tel/faks" and "tel/fax numbers" after it or just a "tel/fax number" at the beginning of the string.
I make this:
"(?:.(?!((tel|fax|faks)[ /:.]+)+))++"
on example 1, but after find() it returns: (chars '_' were added by me)
_(0-22) 222-33-44, 222-555-666,_
_TEL./_
_FAX (111-222-333) 22-33-44 UK,_
_TEL_
_FAKS: 000-333-444,_
_FAX: 333-444-555_
it seems that I loosing one char in every group and combined types like "TEL/faks" are splited. I need also to grab (if this exist, if not then default number is tel) for future processing.
How can I get rid of this?
ps. I use: case-insensitive
Your regular expression means (roughly):
(?: Match a group consisting of:
. any character
(?! that is not followed by
((tel|fax|faks)[ /:.]+)+)) "tel" or "fax" or "fakx", followed by at least one
punctuation character from [ /:.]
+ (multiple times)
That's why you get a missing character before "Tel", "Fax" etc - because your regular expression says never to match the character before "Tel", "Fax" etc.
That's also why "Tel./.faks:" gets split - because the last "." comes before "fax", so it doesn't get matched.
I would suggest constructing two regular expressions that match:
A - a telephone number (parens, digits, commas, spaces), with at least one digit
B - a telephone/fax designation ("fax", "faks", "tel", punctuation)
Then search for strings matching
B*A+

Matching '_' and '-' in java regexes

I had this regex in java that matched either an alphanumeric character or the tilde (~)
^([a-z0-9])+|~$
Now I have to add also the characters - and _ I've tried a few combinations, neither of which work, for example:
^([a-zA-Z0-9_-])+|~$
^([a-zA-Z0-9]|-|_)+|~$
Sample input strings that must match:
woZOQNVddd
00000
ncnW0mL14-
dEowBO_Eu7
7MyG4XqFz-
A8ft-y6hDu
~
Any clues / suggestion?
- is a special character within square brackets. It indicates a range. If it's not at either end of the regex it needs to be escaped by putting a \ before it.
It's worth pointing out a shortcut: \w is equivalent to [0-9a-zA-Z_] so I think this is more readable:
^([\w-]+|~$
You need to escape the -, like \-, since it is a special character (the range operator). _ is ok.
So ^([a-z0-9_\-])+|~$.
Edit: your last input String will not match because the regular expression you are using matches a string of alphanumeric characters (plus - and _) OR a tilde (because of the pipe). But not both. If you want to allow an optional tilde on the end, change to:
^([a-z0-9_\-])+(~?)$
If you put the - first, it won't be interpreted as the range indicator.
^([-a-zA-Z0-9_])+|~$
This matches all of your examples except the last one using the following code:
String str = "A8ft-y6hDu ~";
System.out.println("Result: " + str.matches("^([-a-zA-Z0-9_])+|~$"));
That last example won't match because it doesn't fit your description. The regex will match any combination of alphanumerics, -, and _, OR a ~ character.

Categories

Resources