extract set of lines from file based on pattern match

extract set of lines from file based on pattern match - java

I have a file that contains thousands of tuples(set of three lines) as follows:
# dev2
SAMETEXT %{URI} ^dev2-00.XXX.XXX.XXX
SAMETEXT %{URI} ^/XXX/
DIFFTEXT ^/XXX/(.*) https://XXX-XXX-XXX-XXX-dev2.XXX.XXX.XXX.XXX.XXX/XXX/$1 [X,Y]
There are multiple sets of same kind with different data such as dev1, dev2, dev3. Now I want to get all lines in same manner as they are in the file except dev2. File have a random or mixed groups but all groups are tuple of same lines as mentioned above.
I tried to get it with the following pattern but it give all other tuples as well which lies inside this span.
Pattern dev2Pattern = Pattern.compile("dev2\\R.*dev2-00.*\\RRewriteRule.*dev2", Pattern.DOTALL);
However, my objective is NOT to get matched pattern in resulted file. Thankx in advance.

If you want to match all the lines after # dev except when it is # dev 2 you could use a negative lookahead to assert what is right after dev is not 2.
Then match all lines that do not start with # dev followed by a digit.
^# dev(?!2\b)[0-9]+(?:\R(?!# dev[0-9]).*)*
^ Start of string
# dev(?!2\b) Match # dev and assert what is directly on the right is not 2 and word boundary
[0-9]+ Match 1+ digits
(?: Non capturing grouop
\R Match unicode newline sequence
(?!# dev[0-9]) Assert what is directly to the right is not # dev and a digit
.* If that is the case, match 0+ times any char except a newline
)* Close group and repeat 0+ times
Regex demo | Java Demo
In java
String regex = "^# dev(?!2\\b)[0-9]+(?:\\R(?!# dev[0-9]).*)*";

Related

Regular expression to mask email except the three characters before the domain

I am trying to mask email address in the following different ways.
Mask all characters except first three and the ones follows the # symbol.
This expression works fine.
(?<=.{3}).(?=[^#]*?#)
abcdefgh#gmail.com -> abc*****#gmail.com
Mask all characters except last three before # symbol.
Example : abcdefgh#gmail.com -> *****fgh#gmail.com
I am not sure how to check for # and do reverse match.
Can someone throw pointers on this?

Maybe you could do a positive lookahead:
.(?=.*...#)
See the online Demo
. - Any character other than newline.
(?=.*...#) - Positive lookahead for zero or more characters other than newline followed by three characters other than newline and #.

You could use a negated character class [^\s#] matching a non whitespace char except an #. Then assert what is on the right is that negated character class 3 times followed by matching the # sign.
In the replacement use *
[^\s#](?=[^#\s]*[^#\s]{3}#)
[^\s#] Negated character class, match a non whitespace char except #
(?= Positive lookahead, assert what is on the right is
[^#\s]* Match 0+ times a non whitespace char except #
[^#\s]{3} Match 3 times a non whitespace char except #
# Match the #
) Close lookahead
Regex demo
If there can be only a single # in the email address, you could for example make use of a finite quantifier in the positive lookbehind:
(?<=(?<!\S)[^\s#]{0,1000})[^\s#](?=[^#\s]*[^#\s]{3}#[^\s#]+\.[a-z]{2,}(?!\S))
Regex demo

regex expression to remove eed from string

I am trying to replace 'eed' and 'eedly' with 'ee' from words where there is a vowel before either term ('eed' or 'eedly') appears.
So for example, the word indeed would become indee because there is a vowel ('i') that happens before the 'eed'. On the other hand the word 'feed' would not change because there is no vowel before the suffix 'eed'.
I have this regex: (?i)([aeiou]([aeiou])*[e{2}][d]|[dly]\\b)
You can see what is happening with this here.
As you can see, this is correctly identifying words that end with 'eed', but it is not correctly identifying 'eedly'.
Also, when it does the replace, it is replacing all words that end with 'eed' , even words like feed which it should not remove the eed
What should I be considering here in order to make it correctly identify the words based on the rules I specified?

You can use:
str = str.replaceAll("(?i)\\b(\\w*?[aeiou]\\w*)eed(?:ly)?", "$1ee");
Updated RegEx Demo
\\b(\\w*?[aeiou]\\w*) before eed or eedly makes sure there is at least one vowel in the same word before this.
To expedite this regex you can use negated expression regex:
\\b([^\\Waeiou]*[aeiou]\\w*)eed(?:ly)?
RegEx Breakup:
\\b # word boundary
( # start captured group #`
[^\\Waeiou]* # match 0 or more of non-vowel and non-word characters
[aeiou] # match one vowel
\\w* # followed by 0 or more word characters
) # end captured group #`
eed # followed by literal "eed"
(?: # start non-capturing group
ly # match literal "ly"
)? # end non-capturing group, ? makes it optional
Replacement is:
"$1ee" which means back reference to captured group #1 followed by "ee"

find dly before finding d. otherwise your regex evaluation stops after finding eed.
(?i)([aeiou]([aeiou])*[e{2}](dly|d))

Simple regex to match strings containing <n> chars

I'm writing this regexp as i need a method to find strings that does not have n dots,
I though that negative look ahead would be the best choice, so far my regexp is:
"^(?!\\.{3})$"
The way i read this is, between start and end of the string, there can be more or less then 3 dots but not 3.
Surprisingly for me this is not matching hello.here.im.greetings
Which instead i would expect to match.
I'm writing in Java so its a Perl like flavor, i'm not escaping the curly braces as its not needed in Java
Any advice?

You're on the right track:
"^(?!(?:[^.]*\\.){3}[^.]*$)"
will work as expected.
Your regex means
^ # Match the start of the string
(?!\\.{3}) # Make sure that there aren't three dots at the current position
$ # Match the end of the string
so it could only ever match the empty string.
My regex means:
^ # Match the start of the string
(?! # Make sure it's impossible to match...
(?: # the following:
[^.]* # any number of characters except dots
\\. # followed by a dot
){3} # exactly three times.
[^.]* # Now match only non-dot characters
$ # until the end of the string.
) # End of lookahead
Use it as follows:
Pattern regex = Pattern.compile("^(?!(?:[^.]*\\.){3}[^.]*$)");
Matcher regexMatcher = regex.matcher(subjectString);
foundMatch = regexMatcher.find();

Your regular expression only matches 'not' three consecutive dots. Your example seems to show you want to 'not' match 3 dots anywhere in the sentence.
Try this: ^(?!(?:.*\\.){3})
Demo+explanation: http://regex101.com/r/bS0qW1
Check out Tims answer instead.

Regex lookaround construct in Java: advise on optimization needed

I am trying to search for filenames in a comma-separated list in:
text.txt,temp_doc.doc,template.tmpl,empty.zip
I use Java's regex implementation. Requirements for output are as follows:
Display only filenames and not their respective extensions
Exclude files that begin with "temp_"
It should look like:
text
template
empty
So far I have managed to write more or less satisfactory regex to cope with the first task:
[^\\.,]++(?=\\.[^,]*+,?+)
I believe to make it comply with the second requirement best option is to use lookaround constructs, but not sure how to write a reliable and optimized expression. While the following regex does seem to do what is required, it is obviously a flawed solution if for no other reason than it relies on explicit maximum filename length.
(?!temp_|emp_|mp_|p_|_)(?<!temp_\\w{0,50})[^\\.,]++(?=\\.[^,]*+,?+)
P.S. I've been studying regexes only for a few days, so please don't laugh at this newbie-style overcomplicated code :)

Display only filenames and not their respective extensions
Exclude files that begin with "temp_"
One variant would be like this:
(?:^|,)(?!temp_)((?:(?!\.[^.]*(?:,|$)).)+)
This allows
file names that do not begin with a "word character" (Tim Pietzcker's solution does not)
file names that contain a dot (sth. like file.name.ext will be matched as file.name)
But actually, this is really complex. You'll be better off writing a small function that splits the input at the commas and strips the extension from the parts.
Anyway, here's the tear-down:
(?:^|,) # filename start: either start of the string or comma
(?!temp_) # negative look-ahead: disallow filenames starting with "temp_"
( # match group 1 (will contain your file name)
(?: # non-capturing group (matches one allowed character)
(?! # negative look-ahead (not followed by):
\. # a dot
[^.]* # any number of non-dots (this matches the extension)
(?:,|$) # filename-end (either end of string or comma)
) # end negative look-ahead
. # this character is valid, match it
)+ # end non-capturing group, repeat
) # end group 1
http://rubular.com/r/4jeHhsDuJG

How about this:
Pattern regex = Pattern.compile(
"\\b # Start at word boundary\n" +
"(?!temp_) # Exclude words starting with temp_\n" +
"[^,]+ # Match one or more characters except comma\n" +
"(?=\\.) # until the last available dot",
Pattern.COMMENTS);
This also allows dots within filenames.

Another option:
(?:temp_[^,.]*|([^,.]*))\.[^,]*
That pattern will match all file names, but will capture only valid names.
If at the current position the pattern can match temp_file.ext, it matches it and does not capture.
It it cannot match temp_, it tires to match ([^,.]*)\.[^,]*, and capture the file's name.
You can see an example here: http://www.rubular.com/r/QywiDgFxww

Regex for multiple lines

I am looking for a pattern for multiple lines
I am new to regex and heavily using them using in my project
I need to come up with a pattern that will match a few group of lines. The pattern should
match either these lines
* Source: Test *
* *
or
Ord. 429 Tckt. 1
or
Guest:
Yes, it is not clear. I got a pattern for the second line ( Ord. 429 Tckt. 1) which is:
[\s]+[\w]+[\.][\s]+[\d]+[\s]+[\w]+[\.][\s]+[\d]+

If you need one large regex to match all of these, the following should work if you have the Pattern.DOTALL and Pattern.MULTILINE flags set (see Rubular):
^\*[^\n]*\*$.*?^\*[^\n]*\*$|^\w+\.[ \t]+\d+[ \t]+\w+\.[ \t]+\d+$|^Guest:[^\n]*$
Here is a breakdown of the different sections (split by the |):
Your first group of lines:
^\*[^\n]*\*$.*?^\*[^\n]*\*$
---------------------------
^ # start of a line
\* # a literal '*'
[^\n]* # any number of non-newline characters
\* # a literal '*'
$ # end of a line
.*? # any number of characters, as few as possible (includes newlines)
^\*[^\n]*\*$ # repeat of the first six elements of pattern as described above
The second line portion (for lines like 'Ord. 429 Tckt. 1') is adapted from yours with some minor changes.
^\w+\.[ \t]+\d+[ \t]+\w+\.[ \t]+\d+$
As for the third, it should be pretty basic, start of a line followed by 'Guest:' and then any number of non-newline characters.
^Guest:[^\n]*$

Add the multi-line switch (?s) to the front of your regex:
(?s)[\s]+[\w]+[\.][\s]+[\d]+[\s]+[\w]+[\.][\s]+[\d]+

I'm assuming that you are using Java. You would be using java.util.Regex. You are probably looking for the Pattern.DOTALL flag on Pattern. This treats line terminators as a character that you can match with ..
Pattern.compile("^*\sSource: Test\s**\s*", Patther.DOTALL);
It depends on how strict you want to be, but the above will match the first line in the first snippet (including the line terminator).
If you need more help with the API or this is the wrong API, edit your question to be clearer.
Are you trying to match all three in a single regex? It can be done, but the patter will be a bit ugly. I can probably help with that too.
A decent regex tester page is: http://www.fileformat.info/tool/regex.htm. You can do a google search for something like regex java tester.
Just one last thing, the pattern at the bottom won't do what you want if I understand fully.
[\s]+ matches one or more spaces, so whitespace is required on the front. Also, you don't need the square brackets. They work, but are only needed for alternation. If you wanted to match either a or b but not both: [ab]. But, if you want to match just a, you just put a in your pattern.
\s+ one or more spaces
\w+ one or more word chars (no digits or punctuation,etc)
. period
\s+ some whitespace
\d+ some digits
\s+ some whitespace
\w some word chars
. period
\s+ some whitespace
\d+ a single digit
so,
\s+\w+\.\s+\d+\s+\w+\.\s+\d+
Are there supposed to be blank lines in between the Source: Test and the line with just the stars?
You are going to end up with something like this:
(?: # non-capturing group
\s*\* Source: Test\s+\* # first line of the of the first block
\s+\*\s+\* # second line, assuming that there is no space
# between lines or an arbitrary amout of whitespace
) # end of first group
| # or....
(?: # second group (non capturing)
\s+\w+\.\s+\d+\s+\w+\.\s+\d+ # what we discussed before for Org/Tckt
)
|
(?:\s+Guest:) # the last one is easy :)
You may or may not know this, but comments like I have up there can be put into your code via the Pattern.COMMENTS flag. Some people like that. I've also broken up the different groups into their own constant and then pasted them together when compiling the patter. I like that pretty well.
I hope all of this helps.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.