java.util.regex matching anything before expression - java

I trying to tokenize following snippets by types of numbers:
"(0-22) 222-33-44, 222-555-666, tel./.fax (111-222-333) 22-33-44 UK, TEL/faks: 000-333-444, fax: 333-444-555, tel: 555-666-888"
and
"tel: 555-666-888, tel./fax (111-222-333) 22-33-44 UK"
and
"fax (111-222-333) 22-33-44 UK, TEL/faks: 000-333-444, fax: 333-444-555"
and so on.
The conception is that this can be any combination of like "tel/faks" and "tel/fax numbers" after it or just a "tel/fax number" at the beginning of the string.
I make this:
"(?:.(?!((tel|fax|faks)[ /:.]+)+))++"
on example 1, but after find() it returns: (chars '_' were added by me)
_(0-22) 222-33-44, 222-555-666,_
_TEL./_
_FAX (111-222-333) 22-33-44 UK,_
_TEL_
_FAKS: 000-333-444,_
_FAX: 333-444-555_
it seems that I loosing one char in every group and combined types like "TEL/faks" are splited. I need also to grab (if this exist, if not then default number is tel) for future processing.
How can I get rid of this?
ps. I use: case-insensitive

Your regular expression means (roughly):
(?: Match a group consisting of:
. any character
(?! that is not followed by
((tel|fax|faks)[ /:.]+)+)) "tel" or "fax" or "fakx", followed by at least one
punctuation character from [ /:.]
+ (multiple times)
That's why you get a missing character before "Tel", "Fax" etc - because your regular expression says never to match the character before "Tel", "Fax" etc.
That's also why "Tel./.faks:" gets split - because the last "." comes before "fax", so it doesn't get matched.
I would suggest constructing two regular expressions that match:
A - a telephone number (parens, digits, commas, spaces), with at least one digit
B - a telephone/fax designation ("fax", "faks", "tel", punctuation)
Then search for strings matching
B*A+

Related

Some more regex criteria in existing regex

I want to add into below regex which also pass following criteria -
^[\p{L}\d'][ \p{L}\d'-]*[\p{L}\d'-']$
Should start with letter (A-Z or a-z) only.
Can accepts only single letter also.
Accept hyphen (-), Space, dot (.) in between the string or end of the string. (No other special character)
Accept numbers in between and end to the string.
Please also want to achieve existing criteria what this regex is doing.
E.g.
Expected -
t, T, test, test123, te12st, te-st, te.st, te st, éééééé, ṪỲɎɆḂɃɀȿȸȺȔȐȳɊÉâÇë, Επίθετο
Not Expected -
12test, 1, .test, -test, , tes*t (none of the special character except hyphen, dot & space),
To match the expected and not the not expected including a single letter, you could match \pL from the start of the string. Then repeat 0+ times any of the listed in [\d\pL .-] and then assert the end of the string.
Note that not all of your expected start with a-zA-Z.
^\pL[\d\pL .-]*$
In Java
String regex = "^\\pL[\\d\\pL .-]*$";
Regex demo | Java demo
^[A-Za-z]+[\p{L}\d-.\s]*$
This is a possible solution, however these test criteria conflict with your first requirement: éééééé, ṪỲɎɆḂɃɀȿȸȺȔȐȳɊÉâÇë, Επίθετο. Where it 1) accepts one or more of A-Za-z then 2) zero or more combination of letters, numbers, hyphens, space, and periods.
If you want it to also accept those three test criteria then this is a possible solution:
^[\p{L}]+[\p{L}\d-.\s]*$

Complicated regex and possible simple way to do it [duplicate]

I don't write many regular expressions so I'm going to need some help on the one.
I need a regular expression that can validate that a string is an alphanumeric comma delimited string.
Examples:
123, 4A67, GGG, 767 would be valid.
12333, 78787&*, GH778 would be invalid
fghkjhfdg8797< would be invalid
This is what I have so far, but isn't quite right: ^(?=.*[a-zA-Z0-9][,]).*$
Any suggestions?
Sounds like you need an expression like this:
^[0-9a-zA-Z]+(,[0-9a-zA-Z]+)*$
Posix allows for the more self-descriptive version:
^[[:alnum:]]+(,[[:alnum:]]+)*$
^[[:alnum:]]+([[:space:]]*,[[:space:]]*[[:alnum:]]+)*$ // allow whitespace
If you're willing to admit underscores, too, search for entire words (\w+):
^\w+(,\w+)*$
^\w+(\s*,\s*\w+)*$ // allow whitespaces around the comma
Try this pattern: ^([a-zA-Z0-9]+,?\s*)+$
I tested it with your cases, as well as just a single number "123". I don't know if you will always have a comma or not.
The [a-zA-Z0-9]+ means match 1 or more of these symbols
The ,? means match 0 or 1 commas (basically, the comma is optional)
The \s* handles 1 or more spaces after the comma
and finally the outer + says match 1 or more of the pattern.
This will also match
123 123 abc (no commas) which might be a problem
This will also match 123, (ends with a comma) which might be a problem.
Try the following expression:
/^([a-z0-9\s]+,)*([a-z0-9\s]+){1}$/i
This will work for:
test
test, test
test123,Test 123,test
I would strongly suggest trimming the whitespaces at the beginning and end of each item in the comma-separated list.
You seem to be lacking repetition. How about:
^(?:[a-zA-Z0-9 ]+,)*[a-zA-Z0-9 ]+$
I'm not sure how you'd express that in VB.Net, but in Python:
>>> import re
>>> x [ "123, $a67, GGG, 767", "12333, 78787&*, GH778" ]
>>> r = '^(?:[a-zA-Z0-9 ]+,)*[a-zA-Z0-9 ]+$'
>>> for s in x:
... print re.match( r, s )
...
<_sre.SRE_Match object at 0xb75c8218>
None
>>>>
You can use shortcuts instead of listing the [a-zA-Z0-9 ] part, but this is probably easier to understand.
Analyzing the highlights:
[a-zA-Z0-9 ]+ : capture one or more (but not zero) of the listed ranges, and space.
(?:[...]+,)* : In non-capturing parenthesis, match one or more of the characters, plus a comma at the end. Match such sequences zero or more times. Capturing zero times allows for no comma.
[...]+ : capture at least one of these. This does not include a comma. This is to ensure that it does not accept a trailing comma. If a trailing comma is acceptable, then the expression is easier: ^[a-zA-Z0-9 ,]+
Yes, when you want to catch comma separated things where a comma at the end is not legal, and the things match to $LONGSTUFF, you have to repeat $LONGSTUFF:
$LONGSTUFF(,$LONGSTUFF)*
If $LONGSTUFF is really long and contains comma repeated items itself etc., it might be a good idea to not build the regexp by hand and instead rely on a computer for doing that for you, even if it's just through string concatenation. For example, I just wanted to build a regular expression to validate the CPUID parameter of a XEN configuration file, of the ['1:a=b,c=d','2:e=f,g=h'] type. I... believe this mostly fits the bill: (whitespace notwithstanding!)
xend_fudge_item_re = r"""
e[a-d]x= #register of the call return value to fudge
(
0x[0-9A-F]+ | #either hardcode the reply
[10xks]{32} #or edit the bitfield directly
)
"""
xend_string_item_re = r"""
(0x)?[0-9A-F]+: #leafnum (the contents of EAX before the call)
%s #one fudge
(,%s)* #repeated multiple times
""" % (xend_fudge_item_re, xend_fudge_item_re)
xend_syntax = re.compile(r"""
\[ #a list of
'%s' #string elements
(,'%s')* #repeated multiple times
\]
$ #and nothing else
""" % (xend_string_item_re, xend_string_item_re), re.VERBOSE | re.MULTILINE)
Try ^(?!,)((, *)?([a-zA-Z0-9])\b)*$
Step by step description:
Don't match a beginning comma (good for the upcoming "loop").
Match optional comma and spaces.
Match characters you like.
The match of a word boundary make sure that a comma is necessary if more arguments are stacked in string.
Please use - ^((([a-zA-Z0-9\s]){1,45},)+([a-zA-Z0-9\s]){1,45})$
Here, I have set max word size to 45, as longest word in english is 45 characters, can be changed as per requirement

Java Regular Expression: what is " '- "

I came up to a line in java that uses regular expressions.
It needs a user input of Last Name
return lastName.matches( "[a-zA-z]+([ '-][a-zA-Z]+)*" );
I would like to know what is the function of the [ '-].
Also why do we need both a "+" and a "*" at the same time, and the [ '-][a-zA-Z] is in brackets?
Your RE is: [a-zA-z]+([ '-][a-zA-Z]+)*
I'll break it down into its component parts:
[a-zA-Z]+
The string must begin with any letter, a-z or A-Z, repeated one or more times (+).
([ '-][a-zA-Z]+)*
[ '-]
Any single character of <space>, ', or -.
[a-zA-Z]+
Again, any letter, a-z or A-Z, repeated once or more times.
This combination of letters ('- and a-ZA-Z) may then be repeated zero or more times.
Why [ '-]? To allow for hiphenated names, such as Higgs-Boson or names with apostrophes, such as O'Reilly, or names with spaces such as Van Dyke.
The expression [ '-] means "one of ', , or -". The order is very important - the dash must be the last one, otherwise the character class would be considered a range, and other characters with code points between the space and the quote ' would be accepted as well.
+ means "one or more repetitions"; * means "zero or more repetitions", referring to the term of the regular expression preceding the + or * modifier.]
Overall, the expression matches groups of lowercase and uppercase letters separated by spaces, dashes, or single quotes.
it means it can be any of the characters space ' or - ( space, quote dash )
the - can be done as \- as it also can mean a range... like a-z
This looks like it is a pattern to match double-barreled (space or hyphen) or I-don't-know-what-to-call-it names like O'Grady... for example:
It would match
counter-terrorism
De'ville
O'Grady
smith-jones
smith and wesson
But it will not match
jones-
O'Learys'
#hashtag
Bob & Sons
The idea is, after the first [A-Za-z]+ consumes all the letters it can, the match will end right there unless the next character is a space, an apostrophe, or a hyphen ([ '-]). If one of those characters is present, it must be followed by at least one more letter.
A lot of people have difficulty with this. The naively write something like [A-Za-z]+[ '-]?[A-Za-z]*, figuring both the separator and the extra chunks of letters are optional. But they're not independently optional; if there is a separator ([ '-]), it must be followed by at least one more letter. Otherwise it would treat strings like R'- j'-' as valid. Your regex doesn't have that problem.
By the way, you've got a typo in your regex: [a-zA-z]. You want to watch out for that, because [A-z] does match all the uppercase and lowercase letters, so it will seem to be working correctly as long as the inputs are valid. But it also matches several non-letter characters whose code points happen to lie between those of Z and a. And very few IDEs or regex tools will catch that error.

Multiple Regular Expressions

I'm not used to them and having trouble with the java syntax "matches".
I have two files one is 111.123.399.555.xml the other one is Conf.xml.
Now I only want to get the first file with regular expressions.
string.matches("[1-9[xml[.]]]");
doesnt work.
How to do this?
The use of string.matches("[1-9[xml[.]]]"); will not work because [] will create a character class group, not a capturing group.
What this means is that, to java, your expression is saying "match any of: [1-to-9 [or x, or m, or l [or *any*]]]" (*any* here is because you did not escape the ., and as it, it will create a match any character command)
Important:
"\" is recognized by java as a literal escape character, and for it to be sent to the matcher as an actual matcher's escape character (also "\", but in string form), it itself needs to be escaped, thus, when you mean to use "\" on the matcher, you must actually use "\\".
This is a bit confusing when you are not used to it, but to sum it up, to send an actual "\" to be matched to the matcher, you might have to use "\\\\"! The first "\\" will become "\" to the matcher, thus a scape character, and the second "\\", escaped by the first, will become the actual "\" string!
The correct pattern-string to match for a ###.###.###.###.xml pattern where the "#" are always numbers, is string.matches("(\\d{3}\\.){4}xml"), and how it works is as follows:
The \\d = will match a single digit character. It is the same as
using [0-9], just simpler.
The {3} specifies matching for "exactly 3 times" for the previous
\\d. Thus matching ###.
The \\. matches a single dot character.
The () enclosing the previous code says "this is a capturing group"
to the matcher. It is used by the next {4}, thus creating a "match
this whole ###. group exactly 4 times", thus creating "match ###.###.###.###.".
And finally, the xml before the pattern-string ends will match
exactly "xml", which, along the previous items, makes the exact match for that pattern: "###.###.###.###.xml".
For further learning, read Java's Pattern docs.
string.matches("[1-9.]+\\.xml")
should do it.
[1-9.]+ matches one or more digits between 1 and 9 and/or periods. (+ means "one or more", * means "zero or more", ? means "zero or one").
\.xml matches .xml. Since . means "any character" in a regex, you need to escape it if you want it to mean a literal period: \. (and since this is in a Java string, the backslash itself needs to be escaped by doubling).

How can I express such requirement using Java regular expression?

I need to check that a file contains some amounts that match a specific format:
between 1 and 15 characters (numbers or ",")
may contains at most one "," separator for decimals
must at least have one number before the separator
this amount is supposed to be in the middle of a string, bounded by alphabetical characters (but we have to exclude the malformed files).
I currently have this:
\d{1,15}(,\d{1,14})?
But it does not match with the requirement as I might catch up to 30 characters here.
Unfortunately, for some reasons that are too long to explain here, I cannot simply pick a substring or use any other java call. The match has to be in a single, java-compatible, regular expression.
^(?=.{1,15}$)\d+(,\d+)?$
^ start of the string
(?=.{1,15}$) positive lookahead to make sure that the total length of string is between 1 and 15
\d+ one or more digit(s)
(,\d+)? optionally followed by a comma and more digits
$ end of the string (not really required as we already checked for it in the lookahead).
You might have to escape backslashes for Java: ^(?=.{1,15}$)\\d+(,\\d+)?$
update: If you're looking for this in the middle of another string, use word boundaries \b instead of string boundaries (^ and $).
\b(?=[\d,]{1,15}\b)\d+(,\d+)?\b
For java:
"\\b(?=[\\d,]{1,15}\\b)\\d+(,\\d+)?\\b"
More readable version:
"\\b(?=[0-9,]{1,15}\\b)[0-9]+(,[0-9]+)?\\b"

Categories

Resources