Regex for multiple lines - java

I am looking for a pattern for multiple lines
I am new to regex and heavily using them using in my project
I need to come up with a pattern that will match a few group of lines. The pattern should
match either these lines
* Source: Test *
* *
or
Ord. 429 Tckt. 1
or
Guest:
Yes, it is not clear. I got a pattern for the second line ( Ord. 429 Tckt. 1) which is:
[\s]+[\w]+[\.][\s]+[\d]+[\s]+[\w]+[\.][\s]+[\d]+

If you need one large regex to match all of these, the following should work if you have the Pattern.DOTALL and Pattern.MULTILINE flags set (see Rubular):
^\*[^\n]*\*$.*?^\*[^\n]*\*$|^\w+\.[ \t]+\d+[ \t]+\w+\.[ \t]+\d+$|^Guest:[^\n]*$
Here is a breakdown of the different sections (split by the |):
Your first group of lines:
^\*[^\n]*\*$.*?^\*[^\n]*\*$
---------------------------
^ # start of a line
\* # a literal '*'
[^\n]* # any number of non-newline characters
\* # a literal '*'
$ # end of a line
.*? # any number of characters, as few as possible (includes newlines)
^\*[^\n]*\*$ # repeat of the first six elements of pattern as described above
The second line portion (for lines like 'Ord. 429 Tckt. 1') is adapted from yours with some minor changes.
^\w+\.[ \t]+\d+[ \t]+\w+\.[ \t]+\d+$
As for the third, it should be pretty basic, start of a line followed by 'Guest:' and then any number of non-newline characters.
^Guest:[^\n]*$

Add the multi-line switch (?s) to the front of your regex:
(?s)[\s]+[\w]+[\.][\s]+[\d]+[\s]+[\w]+[\.][\s]+[\d]+

I'm assuming that you are using Java. You would be using java.util.Regex. You are probably looking for the Pattern.DOTALL flag on Pattern. This treats line terminators as a character that you can match with ..
Pattern.compile("^*\sSource: Test\s**\s*", Patther.DOTALL);
It depends on how strict you want to be, but the above will match the first line in the first snippet (including the line terminator).
If you need more help with the API or this is the wrong API, edit your question to be clearer.
Are you trying to match all three in a single regex? It can be done, but the patter will be a bit ugly. I can probably help with that too.
A decent regex tester page is: http://www.fileformat.info/tool/regex.htm. You can do a google search for something like regex java tester.
Just one last thing, the pattern at the bottom won't do what you want if I understand fully.
[\s]+ matches one or more spaces, so whitespace is required on the front. Also, you don't need the square brackets. They work, but are only needed for alternation. If you wanted to match either a or b but not both: [ab]. But, if you want to match just a, you just put a in your pattern.
\s+ one or more spaces
\w+ one or more word chars (no digits or punctuation,etc)
. period
\s+ some whitespace
\d+ some digits
\s+ some whitespace
\w some word chars
. period
\s+ some whitespace
\d+ a single digit
so,
\s+\w+\.\s+\d+\s+\w+\.\s+\d+
Are there supposed to be blank lines in between the Source: Test and the line with just the stars?
You are going to end up with something like this:
(?: # non-capturing group
\s*\* Source: Test\s+\* # first line of the of the first block
\s+\*\s+\* # second line, assuming that there is no space
# between lines or an arbitrary amout of whitespace
) # end of first group
| # or....
(?: # second group (non capturing)
\s+\w+\.\s+\d+\s+\w+\.\s+\d+ # what we discussed before for Org/Tckt
)
|
(?:\s+Guest:) # the last one is easy :)
You may or may not know this, but comments like I have up there can be put into your code via the Pattern.COMMENTS flag. Some people like that. I've also broken up the different groups into their own constant and then pasted them together when compiling the patter. I like that pretty well.
I hope all of this helps.

Related

Modifying existing Java regex

I have the following regex that validates the allowed characters:
^[a-zA-Z0-9-?\/:;(){}\[\]|`~´.\,'+÷ !##$£%^"&*_<>=àáâäçèéêëìíîïñòóôöùúûüýßÀÁÂÄÇÈÉÊËÌÍÎÏÒÓÔÖÙÚÛÜÑ\\]*$
I need to modify it so that the string being validated:
may not begin with space or “/”
may not contain “//”
may not end with “/”
For the space at the beginning I have adapted it to
^[^\s][a-zA-Z0-9-?\\/:;(){}\\[\\]|`~´.\\,'+÷ !##$£%^\"&*_<>=àáâäçèéêëìíîïñòóôöùúûüýßÀÁÂÄÇÈÉÊËÌÍÎÏÒÓÔÖÙÚÛÜÑ\\\\]*$
Not sure what to do about the other two requirements
For the second one I tried combining it with ^((?!//))*$ in various ways but to no success.
Note that ^((?!\/\/))*$ matches any empty string since the lookahead is a non-consuming pattern and here it always returns true.
[^\s] at the start of your pattern will match any chars other than whitespace chars, even those you did not specify in the character class.
You can use
^(?![\s/])(?!.*//)[a-zA-Z0-9?/:;(){}\[\]|`~´.,'+÷ !##$£%^\"&*_<>=àáâäçèéêëìíîïñòóôöùúûüýßÀÁÂÄÇÈÉÊËÌÍÎÏÒÓÔÖÙÚÛÜÑ\\-]*$(?<!/)
See the regex demo. Details:
^(?![\s/])(?!.*//) - at the start of string, two checks are peformed:
(?![\s/]) - no whitespace or / allowed (right at the start)
(?!.*//) - no // allowed anywhere after zero or more chars other than line break chars, as many as possible
(?<!/) is the check after the end of string is hit, and it fails the match if the last char in string is /.
Note that in Java regex declarations, you do not need to escape / since regex delimiter notation is not used, and / itself is not a special regex metacharacter.
It seems like the following regexp should be enough and more simple: (?!.*//)^[^ /].*[^/]$
So at the beginning you can use negative lookahead to prevent occurence of // anywhere in the text. Then any character but space and / is accepted at the beginning, then anything can be present (besides // which was excluded by negative lookahead) and anything but / is accepted at the end.
Since 95% of the time the special conditions on the space and forward slash
will not occur, it might be better to take those two characters out of your
big class and handle them separately if and when they occur.
The big class can also be condensed to speed things up a bit.
^(?>[a-zA-Z0-9\\!-.:-#\[\]-`{-~£´ÄÖäö÷À-ÂÇ-ÏÑ-ÔÙ-Üß-âç-ïñ-ôù-ý]+|(?:/(?!/|$)|[ ])(?<!^.))*$
https://regex101.com/r/LpCwt6/1
^
(?>
[a-zA-Z0-9\\!-.:-#\[\]-`{-~£´ÄÖäö÷À-ÂÇ-ÏÑ-ÔÙ-Üß-âç-ïñ-ôù-ý]+
| (?:
/
(?! / | $ )
| [ ]
)
(?<! ^ . )
)*
$
And if you want to absorb all the class characters it can get very small.
^(?>[!-.0-~£´ÄÖäö÷À-ÂÇ-ÏÑ-ÔÙ-Üß-âç-ïñ-ôù-ý]+|(?:/(?!/|$)|[ ])(?<!^.))*$
https://regex101.com/r/EYdM5C/1

Regular expression for allowing only 1 of a set of characters

I am trying to use some regex to validate some input inside of Java code. I have been successful in implementing "basic" regex, but this one seems to be out of my scope of knowledge. I am working through RegEgg tutorials to learn more.
Here are the conditions that need to be validated:
Field will always have 8 characters
Can be all spaces
Or
Valid characters: a-zA-Z0-9 -!& or a space
Cannot begin with a space
If one of the special characters is used, it can be the only one used
Legal: "B-123---" "AB&& &" "A!!!!!!!"
Illegal: "B-123!!!" "AB&& -" "A-&! "
Has to have at least one alphanumeric character (Can't be all special characters ie: "!!!!!!!!"
This was my regex before additional validations were added:
^(\s{8}|[A-Za-z\-\!\&][ A-Za-z0-9\-\!\&]{7})$"
Then the additional validations for now allowing multiple of the special characters, and I am a bit stuck. I have been successful in using a positive lookahead, but stuck when trying to use the positive lookbehind. (I think the data before the lookbehind was consumed), but I am speculating as I am a neophyte with this part of regex.
using the or construct (a|b) is a large part of this, and you've begun applying it, so that's a good start.
You've made the rule that it can't start with a digit; nothing in the spec says this. also, - inside [] has special meaning, so escape it, or make sure it is first or last, because then you don't have to. That gets us to:
^(\s{8}|[A-Za-z0-9-!& -]{8})$
next up is the rule that it has to be all the same special character if used at all. Given that there are only 3 special characters, could be easier to just explicitly list them all:
^(\s{8}|[A-Za-z0-9 -]{8}|[A-Za-z0-9 !]{8}|[A-Za-z0-9 &]{8})$
Next up: Can't start with a space, and can't be all-special. Confirming the negative (that it ISNT all-special characters) gets complicated; lookahead seems like a better plan here. This:
^ is regexp-ese for: "Start of line". Note that this doesn't 'consume' a character. 1 is regexpese for 'only the exact character '1' will match here, nothinge else', but as it matches, it also 'consumes' that character, whereas ^ doesn't do that. 'start of line' is not a concept that can be consumed.
This notion of 'a match may fail, but if it succeeds, nothing is consumed' isn't limited to ^ and $; you can write your own:
(?=abc) will match if abc would match at this position, but does not consume it. Thus, the regexp ^(=abc)ab.d$ would match the input string abcd and nothing else. This is called positive lookahead. (it 'looks ahead' and matches if it sees the regular expression in the parens, failing if it does not).
(?!abc) is negative lookahead. It matches if it DOESNT see the thing in the parens. (?!abc)a.c will match the input adc but not the input abc.
(?<=abc) is positive lookbehind. It matches if the pattern you provide would match such that the match ends at the position you find yourself.
(?<!abc) is negative lookbehind.
Note that lookahead and lookbehind can be somewhat limited, in that they may not allow variable length patterns. But, fortunately, your requirements make it easy to limit ourselves to fixed size patterns here. Thus, we can introduce: (?![&!-]{8}) as a non-consuming unit in our regexp that will fail the match if we have all-8 special characters.
We can use this trick to fail on starting space too: (?! ) is all we need for that one.
Let's replace \s which is whitespace with just which is the space character (the problem description says 'space', not 'whitespace').
Putting it all together:
^( {8}|(?! )(?![&!-]{8})([A-Za-z0-9 -]{8}|[A-Za-z0-9 !]{8}|[A-Za-z0-9 &]{8}))$
Thats:
8 spaces, or...
not a space, and not all-8 special character, then,
any of the valid chars, any amount of spaces, and any amount of one of the 3 allowed special symbols, as long as we have precisely 8 of them...
.. OR the same thing as #3 but with the second of the three special symbols
.. OR with the third of the three.
Plug em in at regex101 along with your various examples of 'legal' and 'not legal' and you can play around with it some more.
NB: You can also use backreferences to attempt to solve the 'only one special character is allowed' part of this, but attempting to tackle the 'not all special characters' part seems quite unwieldy if you don't get to use (negative) lookahead.
Its a matter of asserting the right conditions at the start of the regex.
^(?=[ ]*$|(?![ ]))(?!.*([!&-]).*(?!\1)[!&-])[a-zA-Z0-9 !&-]{8}$
see -> https://regex101.com/r/tN5y4P/1
Some discussion:
^ # Begin of text
(?= # Assert, cannot start with a space
[ ]* $ # unless it's all spaces
| (?! [ ] )
)
(?! # Assert, not mixed special chars
.*
( [!&-] ) # (1)
.*
(?! \1 )
[!&-]
)
[a-zA-Z0-9 !&-]{8} # Consume 8 valid characters from within this class
$ # End of text

extract set of lines from file based on pattern match

I have a file that contains thousands of tuples(set of three lines) as follows:
# dev2
SAMETEXT %{URI} ^dev2-00.XXX.XXX.XXX
SAMETEXT %{URI} ^/XXX/
DIFFTEXT ^/XXX/(.*) https://XXX-XXX-XXX-XXX-dev2.XXX.XXX.XXX.XXX.XXX/XXX/$1 [X,Y]
There are multiple sets of same kind with different data such as dev1, dev2, dev3. Now I want to get all lines in same manner as they are in the file except dev2. File have a random or mixed groups but all groups are tuple of same lines as mentioned above.
I tried to get it with the following pattern but it give all other tuples as well which lies inside this span.
Pattern dev2Pattern = Pattern.compile("dev2\\R.*dev2-00.*\\RRewriteRule.*dev2", Pattern.DOTALL);
However, my objective is NOT to get matched pattern in resulted file. Thankx in advance.
If you want to match all the lines after # dev except when it is # dev 2 you could use a negative lookahead to assert what is right after dev is not 2.
Then match all lines that do not start with # dev followed by a digit.
^# dev(?!2\b)[0-9]+(?:\R(?!# dev[0-9]).*)*
^ Start of string
# dev(?!2\b) Match # dev and assert what is directly on the right is not 2 and word boundary
[0-9]+ Match 1+ digits
(?: Non capturing grouop
\R Match unicode newline sequence
(?!# dev[0-9]) Assert what is directly to the right is not # dev and a digit
.* If that is the case, match 0+ times any char except a newline
)* Close group and repeat 0+ times
Regex demo | Java Demo
In java
String regex = "^# dev(?!2\\b)[0-9]+(?:\\R(?!# dev[0-9]).*)*";

Simple regex to match strings containing <n> chars

I'm writing this regexp as i need a method to find strings that does not have n dots,
I though that negative look ahead would be the best choice, so far my regexp is:
"^(?!\\.{3})$"
The way i read this is, between start and end of the string, there can be more or less then 3 dots but not 3.
Surprisingly for me this is not matching hello.here.im.greetings
Which instead i would expect to match.
I'm writing in Java so its a Perl like flavor, i'm not escaping the curly braces as its not needed in Java
Any advice?
You're on the right track:
"^(?!(?:[^.]*\\.){3}[^.]*$)"
will work as expected.
Your regex means
^ # Match the start of the string
(?!\\.{3}) # Make sure that there aren't three dots at the current position
$ # Match the end of the string
so it could only ever match the empty string.
My regex means:
^ # Match the start of the string
(?! # Make sure it's impossible to match...
(?: # the following:
[^.]* # any number of characters except dots
\\. # followed by a dot
){3} # exactly three times.
[^.]* # Now match only non-dot characters
$ # until the end of the string.
) # End of lookahead
Use it as follows:
Pattern regex = Pattern.compile("^(?!(?:[^.]*\\.){3}[^.]*$)");
Matcher regexMatcher = regex.matcher(subjectString);
foundMatch = regexMatcher.find();
Your regular expression only matches 'not' three consecutive dots. Your example seems to show you want to 'not' match 3 dots anywhere in the sentence.
Try this: ^(?!(?:.*\\.){3})
Demo+explanation: http://regex101.com/r/bS0qW1
Check out Tims answer instead.

Regex lookaround construct in Java: advise on optimization needed

I am trying to search for filenames in a comma-separated list in:
text.txt,temp_doc.doc,template.tmpl,empty.zip
I use Java's regex implementation. Requirements for output are as follows:
Display only filenames and not their respective extensions
Exclude files that begin with "temp_"
It should look like:
text
template
empty
So far I have managed to write more or less satisfactory regex to cope with the first task:
[^\\.,]++(?=\\.[^,]*+,?+)
I believe to make it comply with the second requirement best option is to use lookaround constructs, but not sure how to write a reliable and optimized expression. While the following regex does seem to do what is required, it is obviously a flawed solution if for no other reason than it relies on explicit maximum filename length.
(?!temp_|emp_|mp_|p_|_)(?<!temp_\\w{0,50})[^\\.,]++(?=\\.[^,]*+,?+)
P.S. I've been studying regexes only for a few days, so please don't laugh at this newbie-style overcomplicated code :)
Display only filenames and not their respective extensions
Exclude files that begin with "temp_"
One variant would be like this:
(?:^|,)(?!temp_)((?:(?!\.[^.]*(?:,|$)).)+)
This allows
file names that do not begin with a "word character" (Tim Pietzcker's solution does not)
file names that contain a dot (sth. like file.name.ext will be matched as file.name)
But actually, this is really complex. You'll be better off writing a small function that splits the input at the commas and strips the extension from the parts.
Anyway, here's the tear-down:
(?:^|,) # filename start: either start of the string or comma
(?!temp_) # negative look-ahead: disallow filenames starting with "temp_"
( # match group 1 (will contain your file name)
(?: # non-capturing group (matches one allowed character)
(?! # negative look-ahead (not followed by):
\. # a dot
[^.]* # any number of non-dots (this matches the extension)
(?:,|$) # filename-end (either end of string or comma)
) # end negative look-ahead
. # this character is valid, match it
)+ # end non-capturing group, repeat
) # end group 1
http://rubular.com/r/4jeHhsDuJG
How about this:
Pattern regex = Pattern.compile(
"\\b # Start at word boundary\n" +
"(?!temp_) # Exclude words starting with temp_\n" +
"[^,]+ # Match one or more characters except comma\n" +
"(?=\\.) # until the last available dot",
Pattern.COMMENTS);
This also allows dots within filenames.
Another option:
(?:temp_[^,.]*|([^,.]*))\.[^,]*
That pattern will match all file names, but will capture only valid names.
If at the current position the pattern can match temp_file.ext, it matches it and does not capture.
It it cannot match temp_, it tires to match ([^,.]*)\.[^,]*, and capture the file's name.
You can see an example here: http://www.rubular.com/r/QywiDgFxww

Categories

Resources