Regex lookaround construct in Java: advise on optimization needed

Regex lookaround construct in Java: advise on optimization needed - java

I am trying to search for filenames in a comma-separated list in:
text.txt,temp_doc.doc,template.tmpl,empty.zip
I use Java's regex implementation. Requirements for output are as follows:
Display only filenames and not their respective extensions
Exclude files that begin with "temp_"
It should look like:
text
template
empty
So far I have managed to write more or less satisfactory regex to cope with the first task:
[^\\.,]++(?=\\.[^,]*+,?+)
I believe to make it comply with the second requirement best option is to use lookaround constructs, but not sure how to write a reliable and optimized expression. While the following regex does seem to do what is required, it is obviously a flawed solution if for no other reason than it relies on explicit maximum filename length.
(?!temp_|emp_|mp_|p_|_)(?<!temp_\\w{0,50})[^\\.,]++(?=\\.[^,]*+,?+)
P.S. I've been studying regexes only for a few days, so please don't laugh at this newbie-style overcomplicated code :)

Display only filenames and not their respective extensions
Exclude files that begin with "temp_"
One variant would be like this:
(?:^|,)(?!temp_)((?:(?!\.[^.]*(?:,|$)).)+)
This allows
file names that do not begin with a "word character" (Tim Pietzcker's solution does not)
file names that contain a dot (sth. like file.name.ext will be matched as file.name)
But actually, this is really complex. You'll be better off writing a small function that splits the input at the commas and strips the extension from the parts.
Anyway, here's the tear-down:
(?:^|,) # filename start: either start of the string or comma
(?!temp_) # negative look-ahead: disallow filenames starting with "temp_"
( # match group 1 (will contain your file name)
(?: # non-capturing group (matches one allowed character)
(?! # negative look-ahead (not followed by):
\. # a dot
[^.]* # any number of non-dots (this matches the extension)
(?:,|$) # filename-end (either end of string or comma)
) # end negative look-ahead
. # this character is valid, match it
)+ # end non-capturing group, repeat
) # end group 1
http://rubular.com/r/4jeHhsDuJG

How about this:
Pattern regex = Pattern.compile(
"\\b # Start at word boundary\n" +
"(?!temp_) # Exclude words starting with temp_\n" +
"[^,]+ # Match one or more characters except comma\n" +
"(?=\\.) # until the last available dot",
Pattern.COMMENTS);
This also allows dots within filenames.

Another option:
(?:temp_[^,.]*|([^,.]*))\.[^,]*
That pattern will match all file names, but will capture only valid names.
If at the current position the pattern can match temp_file.ext, it matches it and does not capture.
It it cannot match temp_, it tires to match ([^,.]*)\.[^,]*, and capture the file's name.
You can see an example here: http://www.rubular.com/r/QywiDgFxww

Related

extract set of lines from file based on pattern match

I have a file that contains thousands of tuples(set of three lines) as follows:
# dev2
SAMETEXT %{URI} ^dev2-00.XXX.XXX.XXX
SAMETEXT %{URI} ^/XXX/
DIFFTEXT ^/XXX/(.*) https://XXX-XXX-XXX-XXX-dev2.XXX.XXX.XXX.XXX.XXX/XXX/$1 [X,Y]
There are multiple sets of same kind with different data such as dev1, dev2, dev3. Now I want to get all lines in same manner as they are in the file except dev2. File have a random or mixed groups but all groups are tuple of same lines as mentioned above.
I tried to get it with the following pattern but it give all other tuples as well which lies inside this span.
Pattern dev2Pattern = Pattern.compile("dev2\\R.*dev2-00.*\\RRewriteRule.*dev2", Pattern.DOTALL);
However, my objective is NOT to get matched pattern in resulted file. Thankx in advance.

If you want to match all the lines after # dev except when it is # dev 2 you could use a negative lookahead to assert what is right after dev is not 2.
Then match all lines that do not start with # dev followed by a digit.
^# dev(?!2\b)[0-9]+(?:\R(?!# dev[0-9]).*)*
^ Start of string
# dev(?!2\b) Match # dev and assert what is directly on the right is not 2 and word boundary
[0-9]+ Match 1+ digits
(?: Non capturing grouop
\R Match unicode newline sequence
(?!# dev[0-9]) Assert what is directly to the right is not # dev and a digit
.* If that is the case, match 0+ times any char except a newline
)* Close group and repeat 0+ times
Regex demo | Java Demo
In java
String regex = "^# dev(?!2\\b)[0-9]+(?:\\R(?!# dev[0-9]).*)*";

Regular expression to extract sub string in groups

I want to extract names from the following input using regular expression.
Student Names:
Name1
Name2
Name3
Parent Names:
Name1
Name2
Name3
I am using the following method to match the data and I am not supposed to modify the method. I have to come up with regular expression that works with this method.
public void parseName(String patternRegX){
Pattern patternDomainStatus = Pattern.compile(patternRegX);
Matcher matcherName = patternName.matcher(inputString);
List<String> tmp=new ArrayList<String>();
while (matcherName.find()){
if (!matcherName.group(2).isEmpty())
tmp.add(matcherName.group(2));
}
}
I came up with a regular expression that could get me the desired result, but the problem I found was that grouping doesn't work inside square brackets([]).
private String studentRegX="(Student Names:\n[ +(\S+)\n]+\n)";
I am using the following regular expression now, but that is getting me only the last name in each set.
private String studentRegX="Student Names:\\n( +(\\S+)\\n)+\\n";
private String parentRegX="Parent Names:\\n( +(\\S+)\\n)+\\n";
Thank you in advance for the help.

First of all, I hope you can change the parseName method a little bit, because it doesn't compile. patternDomainStatus and patternName are probably supposed to refer to the same object:
Pattern pattern = Pattern.compile(patternRegX);
Matcher matcherName = pattern.matcher(inputString);
Secondly, you need to think about your regex a little differently.
Right now, your regexes are trying to match entire chunks with multiple names in them. But matcherName.find() finds "the next subsequence of the input sequence that matches the pattern" (per the javadoc).
So what you want is a regex that matches a single name. matcherName.find() will loop through each part of your string that matches that regex.

Because regex has little to do with algorithmic prowess, here an answer:
On Windows the line break is unfortunately "\r\n".
I check that a newline preceded and that there is at least some white space before the name.
The name may have a space.
With look-behind I check that "Parent Names" follows.
Then
Pattern.compile("(?s)(?<=\n)[ \t]+([^\r\n]*)\r?\n(?=.*Parent Names)");
// ~~~~ '.' also matches newline
// ~~~~~~~ look-behind must be newline
// ~~~~~~ whitespace (spaces/tabs)
// ~~~~~~~~~~ group 1, name
// ~~~~~~~~~~~~~~~~~~~~ look-ahead
Without say, a bit different algorithm would be more solid and understandable.
To make it group(2) instead of the above group(1), you could introduce extra braces before: ([ \t]+)

It can be done using the \G anchor all in a single regex.
This opens it up for a little regex algorithmic prowess.
Each match will be either:
Group 1 is not NULL/empty - New student group, group 3 will contain first student name.
Group 2 is not NULL/empty - New parent group, group 3 will contain first parent name.
Group 3 is never NULL/empty - The first/next either student or parent name depending on which
group 1 or 2 last matched.
In all cases, group 3 will contain a name that has been trimmed and ready to put into an array.
# "~(?mi-)(?:(?!\\A)\\G|^(?:(Student)|(Parent))[ ]Names:)\\s*^(?!(?:Student|Parent)[ ]Names:)[^\\S\\r\\n]*(.+?)[^\\S\\r\\n]*$~"
(?xmi-) # Inline 'Expanded, multiline, case insensitive' modifiers
(?:
(?! \A ) # Here, matched before, give Name a first chance
\G # to match again.
|
^ # BOL
(?:
( Student ) # (1), New 'Student' group
| ( Parent ) # (2), New 'Parent' group
)
[ ] Names:
)
# Name section
\s* # Consume all whitespace up until the start of a Name line
^ # BOL
(?!
(?: Student | Parent ) # Names only, Not the start of Student/Parent group here
[ ] Names:
)
[^\S\r\n]* # Trim leading whitespace ( can use \h if supported )
( .+? ) # (3), the Name
[^\S\r\n]* # Trim trailing whitespace ( can use \h if supported )
$ # EOL

If you're not already familiar with the difference between repeating a capturing group and capturing a repeating group, that's worth reading up on. One resource for that is http://www.regular-expressions.info/captureall.html, but others would be fine too.
If you already knew about that difference and were trying to capture a repeating group already with what you've written above, then please edit your post to explain what you're trying to do (a letter-by-letter explanation would be ideal, so we see what you understand and what you don't, so we can help you with whatever you're stuck on).
I see what I believe is the solution, but since this is clearly homework, I'm not willing to simply give it to you. But I'd be happy to help you figure it out.
--- Edit: ---
You're only getting one match because the regex requires "Student Names:" or "Parent Names:" to be in each match, so you can only match once. For your regex to match multiple times in a row (as required by the while (matcherName.find())), you need to get the "Student Names:" and "Parent Names:" out of the regex, so the regex can match repeatedly.
It's easy to get all of the names (both students and parents), with just a regex that looks for newlines followed by one or more spaces and then text. The challenge is to differentiate the student names (which come before the "Parent Names:" line) from the parent names (which come after the "Parent Names:" line). The key concept for differentiating between them is lookaheads, which can be positive or negative. Take a look at them and see if you can figure out how to implement this using lookaheads.
Also, you may find that group #2 isn't the group you really want to use. It's unfortunate that the group number is hard-coded, but since it is, you can tweak your regex to make groups non-capturing with (?:stuff) syntax. That will let you reduce the number of groups and ensure that the group you actually want is #2.

Simple regex to match strings containing <n> chars

I'm writing this regexp as i need a method to find strings that does not have n dots,
I though that negative look ahead would be the best choice, so far my regexp is:
"^(?!\\.{3})$"
The way i read this is, between start and end of the string, there can be more or less then 3 dots but not 3.
Surprisingly for me this is not matching hello.here.im.greetings
Which instead i would expect to match.
I'm writing in Java so its a Perl like flavor, i'm not escaping the curly braces as its not needed in Java
Any advice?

You're on the right track:
"^(?!(?:[^.]*\\.){3}[^.]*$)"
will work as expected.
Your regex means
^ # Match the start of the string
(?!\\.{3}) # Make sure that there aren't three dots at the current position
$ # Match the end of the string
so it could only ever match the empty string.
My regex means:
^ # Match the start of the string
(?! # Make sure it's impossible to match...
(?: # the following:
[^.]* # any number of characters except dots
\\. # followed by a dot
){3} # exactly three times.
[^.]* # Now match only non-dot characters
$ # until the end of the string.
) # End of lookahead
Use it as follows:
Pattern regex = Pattern.compile("^(?!(?:[^.]*\\.){3}[^.]*$)");
Matcher regexMatcher = regex.matcher(subjectString);
foundMatch = regexMatcher.find();

Your regular expression only matches 'not' three consecutive dots. Your example seems to show you want to 'not' match 3 dots anywhere in the sentence.
Try this: ^(?!(?:.*\\.){3})
Demo+explanation: http://regex101.com/r/bS0qW1
Check out Tims answer instead.

Regex for multiple lines

I am looking for a pattern for multiple lines
I am new to regex and heavily using them using in my project
I need to come up with a pattern that will match a few group of lines. The pattern should
match either these lines
* Source: Test *
* *
or
Ord. 429 Tckt. 1
or
Guest:
Yes, it is not clear. I got a pattern for the second line ( Ord. 429 Tckt. 1) which is:
[\s]+[\w]+[\.][\s]+[\d]+[\s]+[\w]+[\.][\s]+[\d]+

If you need one large regex to match all of these, the following should work if you have the Pattern.DOTALL and Pattern.MULTILINE flags set (see Rubular):
^\*[^\n]*\*$.*?^\*[^\n]*\*$|^\w+\.[ \t]+\d+[ \t]+\w+\.[ \t]+\d+$|^Guest:[^\n]*$
Here is a breakdown of the different sections (split by the |):
Your first group of lines:
^\*[^\n]*\*$.*?^\*[^\n]*\*$
---------------------------
^ # start of a line
\* # a literal '*'
[^\n]* # any number of non-newline characters
\* # a literal '*'
$ # end of a line
.*? # any number of characters, as few as possible (includes newlines)
^\*[^\n]*\*$ # repeat of the first six elements of pattern as described above
The second line portion (for lines like 'Ord. 429 Tckt. 1') is adapted from yours with some minor changes.
^\w+\.[ \t]+\d+[ \t]+\w+\.[ \t]+\d+$
As for the third, it should be pretty basic, start of a line followed by 'Guest:' and then any number of non-newline characters.
^Guest:[^\n]*$

Add the multi-line switch (?s) to the front of your regex:
(?s)[\s]+[\w]+[\.][\s]+[\d]+[\s]+[\w]+[\.][\s]+[\d]+

I'm assuming that you are using Java. You would be using java.util.Regex. You are probably looking for the Pattern.DOTALL flag on Pattern. This treats line terminators as a character that you can match with ..
Pattern.compile("^*\sSource: Test\s**\s*", Patther.DOTALL);
It depends on how strict you want to be, but the above will match the first line in the first snippet (including the line terminator).
If you need more help with the API or this is the wrong API, edit your question to be clearer.
Are you trying to match all three in a single regex? It can be done, but the patter will be a bit ugly. I can probably help with that too.
A decent regex tester page is: http://www.fileformat.info/tool/regex.htm. You can do a google search for something like regex java tester.
Just one last thing, the pattern at the bottom won't do what you want if I understand fully.
[\s]+ matches one or more spaces, so whitespace is required on the front. Also, you don't need the square brackets. They work, but are only needed for alternation. If you wanted to match either a or b but not both: [ab]. But, if you want to match just a, you just put a in your pattern.
\s+ one or more spaces
\w+ one or more word chars (no digits or punctuation,etc)
. period
\s+ some whitespace
\d+ some digits
\s+ some whitespace
\w some word chars
. period
\s+ some whitespace
\d+ a single digit
so,
\s+\w+\.\s+\d+\s+\w+\.\s+\d+
Are there supposed to be blank lines in between the Source: Test and the line with just the stars?
You are going to end up with something like this:
(?: # non-capturing group
\s*\* Source: Test\s+\* # first line of the of the first block
\s+\*\s+\* # second line, assuming that there is no space
# between lines or an arbitrary amout of whitespace
) # end of first group
| # or....
(?: # second group (non capturing)
\s+\w+\.\s+\d+\s+\w+\.\s+\d+ # what we discussed before for Org/Tckt
)
|
(?:\s+Guest:) # the last one is easy :)
You may or may not know this, but comments like I have up there can be put into your code via the Pattern.COMMENTS flag. Some people like that. I've also broken up the different groups into their own constant and then pasted them together when compiling the patter. I like that pretty well.
I hope all of this helps.

Differentiating between slashes in a string using a regular expression

A program that I'm writing (in Java) gets input data made up of three kinds of parts, separated by a slash /. The parts can be one of the following:
A name matching the regular expression \w*
A call matching the expression \w*\(.*\)
A path matching the expression <.*>|\".*\". A path can contain slashes.
An example string could look like this:
bar/foo()/foo(bar)/<foo/bar>/bar/"foo/bar"/foo()
which has the following structure
name/call/call/path/name/path/call
I want to split this string into parts, and I'm trying to do this using a regular expression. My current expression captures slashes after calls and paths, but I'm having trouble getting it to capture slashes after names without also including slashes that may exist within paths. My current expression, just capturing slashes after paths and calls looks like this:
(?<=[\)>\"])/
How can I expand this expression to also capture slashes after names without including slashes within paths?

(\w+|\w+\([^/]*\)(?:/\w+\([^/]*\))*|<[^>]*>|"[^"]*")(?=/|$)
captures this from the string 'bar/foo()/foo(bar)/<foo/bar>/bar/"foo/bar"/foo()'
'bar'
'foo()/foo(bar)'
'<foo/bar>'
'bar'
'"foo/bar"'
'foo()'
It does not capture the separating slashes, though (what for? - just assume they are there).
The simpler (\w+|\w+\([^/]*\)|<[^>]*>|"[^"]*")(?=/|$) would capture calls separately:
"foo()"
"foo(bar)"
EDIT: Usually, I do a regex breakdown:
( # begin group 1 (for alternation)
\w+ # at least one word character
| # or...
\w+ # at least one word character
\( # a literal "("
[^/]* # anything but a "/", as often as possible
\) # a literal ")"
| # or...
< # a "<"
[^>]* # anything but a ">", as often as possible
> # a ">"
| # or...
" # a '"'
[^"]* # anything but a '"', as often as possible
" # a '"'
) # end group 1
(?=/|$) # look-ahead: ...followed by a slash or the end of string

My first thought was to match slashes with an even number of quotes to the left of it. (I.e., having a positive look behind of something like (".*")* but this ends up in an exception saying
Look-behind group does not have an obvious maximum length
Honestly I think you'd be better of with a Matcher, using an or:ed together version of your components, (something like \w*|\w*\(.*\)|(<.*>|\".*\")) and do while (matcher.find()).

Having your deliminator for your string not escaped when used inside your input might not be the best choice. However, you do have the luxury of the "false" slash being inside a regular pattern. What I suggest...
Split the whole string on "/"
Parse each part until you get to the start of the path
Put the path elements into a list until the end of the path
Rejoin the path back on "/"
I highly recommend you consider escaping the "/" in your paths to make your life easier.

This pattern captures all parts of your example string separately without including the delimiter into the results:
\w+\(.*?\)|<.*>|\".*\"|\w+

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.