Regular expression to extract sub string in groups - java

I want to extract names from the following input using regular expression.
Student Names:
Name1
Name2
Name3
Parent Names:
Name1
Name2
Name3
I am using the following method to match the data and I am not supposed to modify the method. I have to come up with regular expression that works with this method.
public void parseName(String patternRegX){
Pattern patternDomainStatus = Pattern.compile(patternRegX);
Matcher matcherName = patternName.matcher(inputString);
List<String> tmp=new ArrayList<String>();
while (matcherName.find()){
if (!matcherName.group(2).isEmpty())
tmp.add(matcherName.group(2));
}
}
I came up with a regular expression that could get me the desired result, but the problem I found was that grouping doesn't work inside square brackets([]).
private String studentRegX="(Student Names:\n[ +(\S+)\n]+\n)";
I am using the following regular expression now, but that is getting me only the last name in each set.
private String studentRegX="Student Names:\\n( +(\\S+)\\n)+\\n";
private String parentRegX="Parent Names:\\n( +(\\S+)\\n)+\\n";
Thank you in advance for the help.

First of all, I hope you can change the parseName method a little bit, because it doesn't compile. patternDomainStatus and patternName are probably supposed to refer to the same object:
Pattern pattern = Pattern.compile(patternRegX);
Matcher matcherName = pattern.matcher(inputString);
Secondly, you need to think about your regex a little differently.
Right now, your regexes are trying to match entire chunks with multiple names in them. But matcherName.find() finds "the next subsequence of the input sequence that matches the pattern" (per the javadoc).
So what you want is a regex that matches a single name. matcherName.find() will loop through each part of your string that matches that regex.

Because regex has little to do with algorithmic prowess, here an answer:
On Windows the line break is unfortunately "\r\n".
I check that a newline preceded and that there is at least some white space before the name.
The name may have a space.
With look-behind I check that "Parent Names" follows.
Then
Pattern.compile("(?s)(?<=\n)[ \t]+([^\r\n]*)\r?\n(?=.*Parent Names)");
// ~~~~ '.' also matches newline
// ~~~~~~~ look-behind must be newline
// ~~~~~~ whitespace (spaces/tabs)
// ~~~~~~~~~~ group 1, name
// ~~~~~~~~~~~~~~~~~~~~ look-ahead
Without say, a bit different algorithm would be more solid and understandable.
To make it group(2) instead of the above group(1), you could introduce extra braces before: ([ \t]+)

It can be done using the \G anchor all in a single regex.
This opens it up for a little regex algorithmic prowess.
Each match will be either:
Group 1 is not NULL/empty - New student group, group 3 will contain first student name.
Group 2 is not NULL/empty - New parent group, group 3 will contain first parent name.
Group 3 is never NULL/empty - The first/next either student or parent name depending on which
group 1 or 2 last matched.
In all cases, group 3 will contain a name that has been trimmed and ready to put into an array.
# "~(?mi-)(?:(?!\\A)\\G|^(?:(Student)|(Parent))[ ]Names:)\\s*^(?!(?:Student|Parent)[ ]Names:)[^\\S\\r\\n]*(.+?)[^\\S\\r\\n]*$~"
(?xmi-) # Inline 'Expanded, multiline, case insensitive' modifiers
(?:
(?! \A ) # Here, matched before, give Name a first chance
\G # to match again.
|
^ # BOL
(?:
( Student ) # (1), New 'Student' group
| ( Parent ) # (2), New 'Parent' group
)
[ ] Names:
)
# Name section
\s* # Consume all whitespace up until the start of a Name line
^ # BOL
(?!
(?: Student | Parent ) # Names only, Not the start of Student/Parent group here
[ ] Names:
)
[^\S\r\n]* # Trim leading whitespace ( can use \h if supported )
( .+? ) # (3), the Name
[^\S\r\n]* # Trim trailing whitespace ( can use \h if supported )
$ # EOL

If you're not already familiar with the difference between repeating a capturing group and capturing a repeating group, that's worth reading up on. One resource for that is http://www.regular-expressions.info/captureall.html, but others would be fine too.
If you already knew about that difference and were trying to capture a repeating group already with what you've written above, then please edit your post to explain what you're trying to do (a letter-by-letter explanation would be ideal, so we see what you understand and what you don't, so we can help you with whatever you're stuck on).
I see what I believe is the solution, but since this is clearly homework, I'm not willing to simply give it to you. But I'd be happy to help you figure it out.
--- Edit: ---
You're only getting one match because the regex requires "Student Names:" or "Parent Names:" to be in each match, so you can only match once. For your regex to match multiple times in a row (as required by the while (matcherName.find())), you need to get the "Student Names:" and "Parent Names:" out of the regex, so the regex can match repeatedly.
It's easy to get all of the names (both students and parents), with just a regex that looks for newlines followed by one or more spaces and then text. The challenge is to differentiate the student names (which come before the "Parent Names:" line) from the parent names (which come after the "Parent Names:" line). The key concept for differentiating between them is lookaheads, which can be positive or negative. Take a look at them and see if you can figure out how to implement this using lookaheads.
Also, you may find that group #2 isn't the group you really want to use. It's unfortunate that the group number is hard-coded, but since it is, you can tweak your regex to make groups non-capturing with (?:stuff) syntax. That will let you reduce the number of groups and ensure that the group you actually want is #2.

Related

Complicated regex and possible simple way to do it [duplicate]

I don't write many regular expressions so I'm going to need some help on the one.
I need a regular expression that can validate that a string is an alphanumeric comma delimited string.
Examples:
123, 4A67, GGG, 767 would be valid.
12333, 78787&*, GH778 would be invalid
fghkjhfdg8797< would be invalid
This is what I have so far, but isn't quite right: ^(?=.*[a-zA-Z0-9][,]).*$
Any suggestions?
Sounds like you need an expression like this:
^[0-9a-zA-Z]+(,[0-9a-zA-Z]+)*$
Posix allows for the more self-descriptive version:
^[[:alnum:]]+(,[[:alnum:]]+)*$
^[[:alnum:]]+([[:space:]]*,[[:space:]]*[[:alnum:]]+)*$ // allow whitespace
If you're willing to admit underscores, too, search for entire words (\w+):
^\w+(,\w+)*$
^\w+(\s*,\s*\w+)*$ // allow whitespaces around the comma
Try this pattern: ^([a-zA-Z0-9]+,?\s*)+$
I tested it with your cases, as well as just a single number "123". I don't know if you will always have a comma or not.
The [a-zA-Z0-9]+ means match 1 or more of these symbols
The ,? means match 0 or 1 commas (basically, the comma is optional)
The \s* handles 1 or more spaces after the comma
and finally the outer + says match 1 or more of the pattern.
This will also match
123 123 abc (no commas) which might be a problem
This will also match 123, (ends with a comma) which might be a problem.
Try the following expression:
/^([a-z0-9\s]+,)*([a-z0-9\s]+){1}$/i
This will work for:
test
test, test
test123,Test 123,test
I would strongly suggest trimming the whitespaces at the beginning and end of each item in the comma-separated list.
You seem to be lacking repetition. How about:
^(?:[a-zA-Z0-9 ]+,)*[a-zA-Z0-9 ]+$
I'm not sure how you'd express that in VB.Net, but in Python:
>>> import re
>>> x [ "123, $a67, GGG, 767", "12333, 78787&*, GH778" ]
>>> r = '^(?:[a-zA-Z0-9 ]+,)*[a-zA-Z0-9 ]+$'
>>> for s in x:
... print re.match( r, s )
...
<_sre.SRE_Match object at 0xb75c8218>
None
>>>>
You can use shortcuts instead of listing the [a-zA-Z0-9 ] part, but this is probably easier to understand.
Analyzing the highlights:
[a-zA-Z0-9 ]+ : capture one or more (but not zero) of the listed ranges, and space.
(?:[...]+,)* : In non-capturing parenthesis, match one or more of the characters, plus a comma at the end. Match such sequences zero or more times. Capturing zero times allows for no comma.
[...]+ : capture at least one of these. This does not include a comma. This is to ensure that it does not accept a trailing comma. If a trailing comma is acceptable, then the expression is easier: ^[a-zA-Z0-9 ,]+
Yes, when you want to catch comma separated things where a comma at the end is not legal, and the things match to $LONGSTUFF, you have to repeat $LONGSTUFF:
$LONGSTUFF(,$LONGSTUFF)*
If $LONGSTUFF is really long and contains comma repeated items itself etc., it might be a good idea to not build the regexp by hand and instead rely on a computer for doing that for you, even if it's just through string concatenation. For example, I just wanted to build a regular expression to validate the CPUID parameter of a XEN configuration file, of the ['1:a=b,c=d','2:e=f,g=h'] type. I... believe this mostly fits the bill: (whitespace notwithstanding!)
xend_fudge_item_re = r"""
e[a-d]x= #register of the call return value to fudge
(
0x[0-9A-F]+ | #either hardcode the reply
[10xks]{32} #or edit the bitfield directly
)
"""
xend_string_item_re = r"""
(0x)?[0-9A-F]+: #leafnum (the contents of EAX before the call)
%s #one fudge
(,%s)* #repeated multiple times
""" % (xend_fudge_item_re, xend_fudge_item_re)
xend_syntax = re.compile(r"""
\[ #a list of
'%s' #string elements
(,'%s')* #repeated multiple times
\]
$ #and nothing else
""" % (xend_string_item_re, xend_string_item_re), re.VERBOSE | re.MULTILINE)
Try ^(?!,)((, *)?([a-zA-Z0-9])\b)*$
Step by step description:
Don't match a beginning comma (good for the upcoming "loop").
Match optional comma and spaces.
Match characters you like.
The match of a word boundary make sure that a comma is necessary if more arguments are stacked in string.
Please use - ^((([a-zA-Z0-9\s]){1,45},)+([a-zA-Z0-9\s]){1,45})$
Here, I have set max word size to 45, as longest word in english is 45 characters, can be changed as per requirement

Regex to match user and user#domain

A user can login as "user" or as "user#domain". I only want to extract "user" in both cases. I am looking for a matcher expression to fit it, but im struggling.
final Pattern userIdPattern = Pattern.compile("(.*)[#]{0,1}.*");
final Matcher fieldMatcher = userIdPattern.matcher("user#test");
final String userId = fieldMatcher.group(1)
userId returns "user#test". I tried various expressions but it seems that nothing fits my requirement :-(
Any ideas?
If you use "(.*)[#]{0,1}.*" pattern with .matches(), the (.*) grabs the whole line first, then, when the regex index is still at the end of the line, the [#]{0,1} pattern triggers and matches at the end of the line because it can match 0 # chars, and then .* again matches at that very location as it matches any 0+ chars. Thus, the whole line lands in your Group 1.
You may use
String userId = s.replaceFirst("^([^#]+).*", "$1");
See the regex demo.
Details
^ - start of string
([^#]+) - Group 1 (referred to with $1 from the replacement pattern): any 1+ chars other than #
.* - the rest of the string.
A little bit of googling came up with this:
(.*?)(?=#|$)
Will match everthing before an optional #
I would suggest keeping it simple and not relying on regex in this case if you are using java and have a simple case like you provided.
You could simply do something like this:
String userId = "user#test";
if (userId.indexOf("#") != -1)
userId = userId.substring(0, userId.indexOf("#"));
// from here on userId will be "user".
This will always either strip out the "#test" or just skip stripping it out when it is not there.
Using regex in most cases makes the code less maintainable by another dev in the future because most devs are not very good with regular expressions, at least in my experience.
You included the # as optional, so the match tries to get the longest user name. As you didn't put the restriction of a username is not allowed to have #s in it, it matched the longest string.
Just use:
[^#]*
as the matching subexpr for usernames (and use $0 to get the matched string)
Or you can use this one that can be used to find several matches (and to get both the user part and the domain part):
\b([^#\s]*)(#[^#\s]*)?\b
The \b force your string to be tied to word boundaries, then the first group matches non-space and non-# chars (any number, better to use + instead of * there, as usernames must have at least one char) followed (optionally) by a # and another string of non-space and non-# chars). In this case, $0 matches the whole email addres, $1 matches the username part, and $2 the #domain part (you can refine to only the domain part, adding a new pair of parenthesis, as in
b([^#\s]*)(#([^#\s]*))?\b
See demo.

Regular expressions - capturing groups confusion

I am reading an Oracle tutorial on regular expressions. I am on the topic Capturing groups. Though the reference is excellent, but except that a parenthesis represents a group, I am finding many difficulties in understanding the topic. Here are my confusions.
What is the significance of counting groups in an expression?
What are non-capturing groups?
Elaborating with examples would be nice.
One usually doesn't count groups other than to know which group has which number. E.g. ([abc])([def](\d+)) has three groups, so I know to refer to them as \1, \2 and \3. Note that group 3 is inside 2. They are numbered from the left by where they begin.
When searching with regex to find something in a string, as opposed to matching when you make sure the whole string matches the subject, group 0 will give you just the matched string, but not the stuff that was before or after it. Imagine if you will a pair of brackets around your whole regex. It's not part of the total count because it's not really considered a group.
Groups can be used for other things than capturing. E.g. (foo|bar) will match "foo" or "bar". If you're not interested in the contents of a group, you can make it non-capturing (e.g: (?:foo|bar) (varies by dialect)), so as not to "use up" the numbers assigned to groups. But you don't have to, it's just convenient sometimes.
Say I want to find a word that begins and ends in the same letter: \b([a-z])[a-z]*\1\b The \1 will then be the same as whatever the first group captured. Of course it can be used for much more powerful stuff, but I think you'll get the idea.
(Coming up with relevant examples is certainly the hardest part.)
Edit: I answered when the questions were:
What is the significance of counting groups in an expression?
There is a special group, called as group-0, which means the entire expression. It is not reported by groupCount() method. Why is that?
I don't understand what are non-capturing groups?
Why we need back-references? What is the significance of back-references?
Say you have a string, abcabc, and you want to figure out whether the first part of the string matches the second part. You can do this with a single regex by using capturing groups and backreferences. Here is the regex I would use:
(.+)\1
The way this works is .+ matches any sequence of characters. Because it is in parentheses, it is caught in a group. \1 is a backreference to the 1st capturing group, so it is the equivalent of the text caught by the capturing group. After a bit of backtracking, the capturing group matches the first part of the string, abc. The backreference \1 is now the equivalent of abc, so it matches the second half of the string. The entire string is now matched, so it is confirmed that the first half of the string matches the second half.
Another use of backreferences is in replacing. Say you want to replace all {...} with [...], if the text inside { and } is only digits. You can easily do this with capturing groups and backreferences, using the regex
{(\d+)}
And replacing with that with [\1].
The regex matches {123} in the string abc {123} 456, and captures 123 in the first capturing group. The backreference \1 is now the equivalent of 123, so replacing {(\d+)} in abc {123} 456 with [\1] results in abc [123] 456.
The reason non-capturing groups exist is because groups in general have more uses that just capturing. The regex (xyz)+ matches a string that consists entirely of the group, xyz, repeated, such as xyzxyzxyz. A group is needed because xyz+ only matches xy and then z repeated, i.e. xyzzzzz. The problem with using capturing groups is that they are slightly less efficient compared to non-capturing groups, and they take up an index. If you have a complicated regex with a lot of groups in it, but you only need to reference a single one somewhere in the middle, it's a lot better to just reference \1 rather than trying to count all the groups up to the one you want.
I hope this helps!
Can't think of an appropriate example at the moment, but I'm assuming someone might need to know the number of sub matches in the RegEx.
Group 0 is always the entire base match. I'm assuming groupCount() just lets you know how many capture groups you've specified in the expression.
A non-capturing group (?:) would be used to, well, not capture a group. Ex. if you need to test if a string contains one of several words and don't want to capture the word in a new group: (?:hello|hi there) world !== hello|hi there world. The first matches "hello world" or "hi there world" but the second matches "hello" or "hi there world".
They can be used as a part of a multitude of powerful reasons, such as testing whether or not a number is prime or composite. :) Or you could simply test to ensure a search parameter isn't repeated, ie. ^(\d)(?!.*\1)\d+$ would ensure the first digit is unique in a string.

Java regex capturing groups indexes

I have the following line,
typeName="ABC:xxxxx;";
I need to fetch the word ABC,
I wrote the following code snippet,
Pattern pattern4=Pattern.compile("(.*):");
matcher=pattern4.matcher(typeName);
String nameStr="";
if(matcher.find())
{
nameStr=matcher.group(1);
}
So if I put group(0) I get ABC: but if I put group(1) it is ABC, so I want to know
What does this 0 and 1 mean? It will be better if anyone can explain me with good examples.
The regex pattern contains a : in it, so why group(1) result omits that? Does group 1 detects all the words inside the parenthesis?
So, if I put two more parenthesis such as, \\s*(\d*)(.*): then, will be there two groups? group(1) will return the (\d*) part and group(2) return the (.*) part?
The code snippet was given in a purpose to clear my confusions. It is not the code I am dealing with. The code given above can be done with String.split() in a much easier way.
Capturing and grouping
Capturing group (pattern) creates a group that has capturing property.
A related one that you might often see (and use) is (?:pattern), which creates a group without capturing property, hence named non-capturing group.
A group is usually used when you need to repeat a sequence of patterns, e.g. (\.\w+)+, or to specify where alternation should take effect, e.g. ^(0*1|1*0)$ (^, then 0*1 or 1*0, then $) versus ^0*1|1*0$ (^0*1 or 1*0$).
A capturing group, apart from grouping, will also record the text matched by the pattern inside the capturing group (pattern). Using your example, (.*):, .* matches ABC and : matches :, and since .* is inside capturing group (.*), the text ABC is recorded for the capturing group 1.
Group number
The whole pattern is defined to be group number 0.
Any capturing group in the pattern start indexing from 1. The indices are defined by the order of the opening parentheses of the capturing groups. As an example, here are all 5 capturing groups in the below pattern:
(group)(?:non-capturing-group)(g(?:ro|u)p( (nested)inside)(another)group)(?=assertion)
| | | | | | || | |
1-----1 | | 4------4 |5-------5 |
| 3---------------3 |
2-----------------------------------------2
The group numbers are used in back-reference \n in pattern and $n in replacement string.
In other regex flavors (PCRE, Perl), they can also be used in sub-routine calls.
You can access the text matched by certain group with Matcher.group(int group). The group numbers can be identified with the rule stated above.
In some regex flavors (PCRE, Perl), there is a branch reset feature which allows you to use the same number for capturing groups in different branches of alternation.
Group name
From Java 7, you can define a named capturing group (?<name>pattern), and you can access the content matched with Matcher.group(String name). The regex is longer, but the code is more meaningful, since it indicates what you are trying to match or extract with the regex.
The group names are used in back-reference \k<name> in pattern and ${name} in replacement string.
Named capturing groups are still numbered with the same numbering scheme, so they can also be accessed via Matcher.group(int group).
Internally, Java's implementation just maps from the name to the group number. Therefore, you cannot use the same name for 2 different capturing groups.
For The Rest Of Us
Here is a simple and clear example of how this works:
( G1 )( G2 )( G3 )( G4 )( G5 )
Regex:([a-zA-Z0-9]+)([\s]+)([a-zA-Z ]+)([\s]+)([0-9]+)
String: "!* UserName10 John Smith 01123 *!"
group(0): UserName10 John Smith 01123
group(1): UserName10
group(2):
group(3): John Smith
group(4):
group(5): 01123
As you can see, I have created FIVE groups which are each enclosed in parentheses.
I included the !* and *! on either side to make it clearer. Note that none of those characters are in the RegEx and therefore will not be produced in the results. Group(0) merely gives you the entire matched string (all of my search criteria in one single line). Group 1 stops right before the first space because the space character was not included in the search criteria. Groups 2 and 4 are simply the white space, which in this case is literally a space character, but could also be a tab or a line feed etc. Group 3 includes the space because I put it in the search criteria ... etc.
Hope this makes sense.
Parenthesis () are used to enable grouping of regex phrases.
The group(1) contains the string that is between parenthesis (.*) so .* in this case
And group(0) contains whole matched string.
If you would have more groups (read (...) ) it would be put into groups with next indexes (2, 3 and so on).

Regex lookaround construct in Java: advise on optimization needed

I am trying to search for filenames in a comma-separated list in:
text.txt,temp_doc.doc,template.tmpl,empty.zip
I use Java's regex implementation. Requirements for output are as follows:
Display only filenames and not their respective extensions
Exclude files that begin with "temp_"
It should look like:
text
template
empty
So far I have managed to write more or less satisfactory regex to cope with the first task:
[^\\.,]++(?=\\.[^,]*+,?+)
I believe to make it comply with the second requirement best option is to use lookaround constructs, but not sure how to write a reliable and optimized expression. While the following regex does seem to do what is required, it is obviously a flawed solution if for no other reason than it relies on explicit maximum filename length.
(?!temp_|emp_|mp_|p_|_)(?<!temp_\\w{0,50})[^\\.,]++(?=\\.[^,]*+,?+)
P.S. I've been studying regexes only for a few days, so please don't laugh at this newbie-style overcomplicated code :)
Display only filenames and not their respective extensions
Exclude files that begin with "temp_"
One variant would be like this:
(?:^|,)(?!temp_)((?:(?!\.[^.]*(?:,|$)).)+)
This allows
file names that do not begin with a "word character" (Tim Pietzcker's solution does not)
file names that contain a dot (sth. like file.name.ext will be matched as file.name)
But actually, this is really complex. You'll be better off writing a small function that splits the input at the commas and strips the extension from the parts.
Anyway, here's the tear-down:
(?:^|,) # filename start: either start of the string or comma
(?!temp_) # negative look-ahead: disallow filenames starting with "temp_"
( # match group 1 (will contain your file name)
(?: # non-capturing group (matches one allowed character)
(?! # negative look-ahead (not followed by):
\. # a dot
[^.]* # any number of non-dots (this matches the extension)
(?:,|$) # filename-end (either end of string or comma)
) # end negative look-ahead
. # this character is valid, match it
)+ # end non-capturing group, repeat
) # end group 1
http://rubular.com/r/4jeHhsDuJG
How about this:
Pattern regex = Pattern.compile(
"\\b # Start at word boundary\n" +
"(?!temp_) # Exclude words starting with temp_\n" +
"[^,]+ # Match one or more characters except comma\n" +
"(?=\\.) # until the last available dot",
Pattern.COMMENTS);
This also allows dots within filenames.
Another option:
(?:temp_[^,.]*|([^,.]*))\.[^,]*
That pattern will match all file names, but will capture only valid names.
If at the current position the pattern can match temp_file.ext, it matches it and does not capture.
It it cannot match temp_, it tires to match ([^,.]*)\.[^,]*, and capture the file's name.
You can see an example here: http://www.rubular.com/r/QywiDgFxww

Categories

Resources