Regex best-practices

Regex best-practices - java

I'm just learning how to use regex's:
I'm reading in a text file that is split into sections of two different sorts, demarcated by
<:==]:> and <:==}:> . I need to know for each section whether it's a ] or } , so I can't just do
pattern.compile("<:==]:>|<:==}:>"); pattern.split(text)
Doing this:
pattern.compile("<:=="); pattern.split(text)
works, and then I can just look at the first char in each substring, but this seems sloppy to me, and I think I'm only resorting to it because I'm not fully grasping something I need to grasp about regex's:
What would be the best practice here? Also, is there any way to split a string up while leaving the delimiter in the resulting strings- such that each begins with the delimiter?
EDIT: the file is laid out like this:
Old McDonald had a farm
<:==}:>
EIEIO. And on that farm he had a cow
<:==]:>
And on that farm he....

It may be a better idea not to use split() for this. You could instead do a match:
List<String> delimList = new ArrayList<String>();
List<String> sectionList = new ArrayList<String>();
Pattern regex = Pattern.compile(
"(<:==[\\]}]:>) # Match a delimiter, capture it in group 1.\n" +
"( # Match and capture in group 2:\n" +
" (?: # the following group which matches...\n" +
" (?!<:==[\\]}]:>) # (unless we're at the start of another delimiter)\n" +
" . # any character\n" +
" )* # any number of times.\n" +
") # End of group 2",
Pattern.COMMENTS | Pattern.DOTALL);
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
delimList.add(regexMatcher.group(1));
sectionList.add(regexMatcher.group(2));
}

Related

How can I get non-matching groups using a Matcher in Java?

I'm trying to write a java regex to catch some groups of words from a String using a Matcher.
Say i got this string: "Hello, we are #happy# to see you today".
I would like to get 2 group of matches, one having
Hello, we are
to see you today
and the other
happy
So far, I was only able to match the word between the #s using this Pattern:
Pattern p = Pattern.compile("#(.+?)#");
I've read about negative lookahead and lookaround, played a bit with it but without success.
I assume I should do some sort of negation of the regex so far, but I couldn't come up with anything.
Any help would be really appreciated, thank you.

From comment:
I may incur in a string where I got more than one instances of words wrapped by #, such as "#Hello# kind #stranger#"
From comment:
I need to apply some different style format to both the text inside and outside.
Since you need to apply different stylings, the code need to process each block of text separately, and needs to know if the text is inside or outside a #..# section.
Note, in the following code, it will silently skip the last #, if there is an odd number of them.
String input = ...
for (Matcher m = Pattern.compile("([^#]+)|#([^#]+)#").matcher(input); m.find(); ) {
if (m.start(1) != -1) {
String outsideText = m.group(1);
System.out.println("Outside: \"" + outsideText + "\"");
} else {
String insideText = m.group(2);
System.out.println("Inside: \"" + insideText + "\"");
}
}
Output for input = "Hello, we are #happy# to see you today"
Outside: "Hello, we are "
Inside: "happy"
Outside: " to see you today"
Output for input = "#Hello# kind #stranger#"
Inside: "Hello"
Outside: " kind "
Inside: "stranger"
Output for input = "This #text# has unpaired # characters"
Outside: "This "
Inside: "text"
Outside: " has unpaired "
Outside: " characters"

The best I could do is splitting in 3 groups, then merging the group 1 and 4 :
(^.*)(\#(.+?)\#)(.*)
Test it here
EDIT: Taking remarks from the comments :
(^[^\#]*)(?:\#(.+?)\#)([^\#]*)
Thanks to #Lino we don't capture the useless group with # anymore, and we capture anything except #, instead of any non whitespace character in the 1st and 2nd groups.
Test it here

Is this solution fine?
Pattern pattern =
Pattern.compile("([^#]+)|#([^#]*)#");
Matcher matcher =
pattern.matcher("Hello, we are #happy# to see you today");
List<String> notBetween = new ArrayList<>(); // not surrounded by #
List<String> between = new ArrayList<>(); // surrounded by #
while (matcher.find()) {
if (Objects.nonNull(matcher.group(1))) notBetween.add(matcher.group(1));
if (Objects.nonNull(matcher.group(2))) between.add(matcher.group(2));
}
System.out.println("Printing group 1");
for (String string :
notBetween) {
System.out.println(string);
}
System.out.println("Printing group 2");
for (String string :
between) {
System.out.println(string);
}

regex to select specific multiple lines

I'm trying to capture group of lines from large number of lines(upto 100 to 130) after a specific term.
here is my code.
String inp = "Welcome!\n"
+" Welcome to the Apache ActiveMQ Console of localhost (ID:InternetXXX022-45298-5447895412354475-2:9) \n"
+" You can find more information about Apache ActiveMQ on the Apache ActiveMQ Site \n"
+" Broker\n"
+" Name localhost\n"
+" Version 5.13.3\n"
+" ID ID:InternetXXX022-45298-5447895412354475-2:9\n"
+" Uptime 14 days 14 hours\n"
+" Store percent used 19\n"
+" Memory percent used 0\n"
+" Temp percent used 0\n"
+ "Queue Views\n"
+ "Graph\n"
+ "Topic Views\n"
+ " \n"
+ "Subscribers Views\n";
Pattern rgx = Pattern.compile("(?<=Broker)\n((?:.*\n){1,7})", Pattern.DOTALL);
Matcher mtch = rgx.matcher(inp);
if (mtch.find()) {
String result = mtch.group();
System.out.println(result);
}
I want to capture below lines from above mentioned all lines in inp.
Name localhost\n
Version 5.13.3\n
ID ID:InternetXXX022-45298-5447895412354475-2:9\n
Uptime 14 days 14 hours\n
Store percent used 19\n
Memory percent used 0\n
Temp percent used 0\n
But my code giving me all lines after "Broker". May I know please what am doing wrong ?
Secondly, I want to understand, ?: means non capturing group but still why my regex((?:.*\n)) able to capture lines after Broker ?

You must remove Pattern.DOTALL since it makes . match newlines, too, and you grab the whole text with .* and the limiting quantifier is needless then.
Besides, your real data seems to contain CRLF line endings, so it is more convenient to use \R rather than \n to match line breaks. Else, you may use a Pattern.UNIX_LINES modifier (or its embedded flag equivalent, (?d), inside the pattern) and then you may keep your pattern as is (since only \n, LF, will be considered a line break and . will match carriage returns, CRs).
Also, I suggest trimming the result.
Use
Pattern rgx = Pattern.compile("(?<=Broker)\\R((?:.*\\R){1,7})");
// Or,
// Pattern rgx = Pattern.compile("(?d)(?<=Broker)\n((?:.*\n){1,7})");
Matcher mtch = rgx.matcher(inp);
if (mtch.find()) {
String result = mtch.group();
System.out.println(result.trim());
}
See the Java demo online.

How to email validate with apostrophe and limit the number of dots after the # symbol using Regular expression. (JAVA)

Can someone help me please? I am not very familiar with regEx and I am trying to validate email addresses.
The regEx and code that I have is:
public class TestRegExEmail {
public static void main(final String[] args) {
// List of valid URLs
List<String> validValues = new ArrayList<>();
validValues.add("wiliam#hotmail.com");
validValues.add("wiliam.ferraciolli#hotmail.com");
validValues.add("wiliam#hotmail.co.uk");
validValues.add("wiliam.ferraciolli#hotmail.co.uk");
validValues.add("wiliam_ferraciolli#hotmail.co.uk");
validValues.add("wiliam'ferraciolli#hotmail.co.uk");
validValues.add("wiliam334-1#mydomain.co.uk.me");
// List on invalid URLs
List<String> invalidValues = new ArrayList<>();
invalidValues.add("wiliam.ferraciolli#hotmail.com.dodge.too.many");
invalidValues.add("hwiliam#hotmail.com.otherdomain.uk.dodge");
invalidValues.add("wiliam.ferraciolli#hotmail.com.com.com.com");
invalidValues.add("wiliam.hotmail.com");
invalidValues.add("wiliam..ferraciolli#hotmail.com");
invalidValues.add("wiliam%ferraciolli.#hotmail.com");
invalidValues.add("wiliam$ferraciolli.#hotmail.com");
invalidValues.add("wiliam/ferraciolli.#hotmail.com");
// Pattern
String regex = "^[_A-Za-z0-9-\\+]+(\\.[_A-Za-z0-9-]+)*#[A-Za-z0-9-]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})$";
Pattern pattern = Pattern.compile(regex);
// print valid emails
for (String s : validValues) {
Matcher matcher = pattern.matcher(s);
System.out.println(s + " // " + matcher.matches());
}
System.out.println();
// print invalid emails
for (String s : invalidValues) {
Matcher matcher = pattern.matcher(s);
System.out.println(s + " // " + matcher.matches());
}
}
}
This regEx works fine but fails on emails with apostrophes. The other issue is that it would be ideal to allow only 3 dots after the # symbol.
Any suggestions would be appreciated.
Regards

This regEx works fine but fails on emails with apostrophes.
Followed #Am_I_Helpful comment and found a good solution. "[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[‌a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?"
Exactly, you need to include all the allowed characters in the character class:
[a-z0-9!#$%&'*+/=?^_`{|}~-]+
\_________________/
non-alphanumerics allowed
I'd also anchor your pattern to he beginning and end of string with ^ and $ as you had in your previous version.
The other issue is that it would be ideal to allow only 3 dots after the # symbol.
This non-capturing group from your regex is repeated once for every dot after the #:
#(?:[a-z0-9](?:[‌a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
\ \_/ /
\ literal dots /
\______________________/
repeated once for each dot
But you're using the + quantifier to repeat at least once but as many times as it can match. Instead, limit repetition with the {1,3} quantifier.
#(?:[a-z0-9](?:[‌a-z0-9-]*[a-z0-9])?\\.){1,3}[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
^^^^^
Regex
"^[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[‌a-z0-9-]*[a-z0-9])?\\.){1,3}[a-z0-9](?:[a-z0-9-]*[a-z0-9])?$"
regex101 demo

How to remove dot (.) character using a regex for email addresses of type "abcd.efgh#xyz.com" in java?

I was trying to write a regex to detect email addresses of the type 'abc#xyz.com' in java. I came up with a simple pattern.
String line = // my line containing email address
Pattern myPattern = Pattern.compile("()(\\w+)( *)#( *)(\\w+)\\.com");
Matcher myMatcher = myPattern.matcher(line);
This will however also detect email addresses of the type 'abcd.efgh#xyz.com'.
I went through http://www.regular-expressions.info/ and links on this site like
How to match only strings that do not contain a dot (using regular expressions)
Java RegEx meta character (.) and ordinary dot?
So I changed my pattern to the following to avoid detecting 'efgh#xyz.com'
Pattern myPattern = Pattern.compile("([^\\.])(\\w+)( *)#( *)(\\w+)\\.com");
Matcher myMatcher = myPattern.matcher(line);
String mailid = myMatcher.group(2) + "#" + myMatcher.group(5) + ".com";
If String 'line' contained the address 'abcd.efgh#xyz.com', my String mailid will come back with 'fgh#yyz.com'. Why does this happen? How do I write the regex to detect only 'abc#xyz.com' and not 'abcd.efgh#xyz.com'?
Also how do I write a single regex to detect email addresses like 'abc#xyz.com' and 'efg at xyz.com' and 'abc (at) xyz (dot) com' from strings. Basically how would I implement OR logic in regex for doing something like check for # OR at OR (at)?
After some comments below I tried the following expression to get the part before the # squared away.
Pattern.compile("((([\\w]+\\.)+[\\w]+)|([\\w]+))#(\\w+)\\.com")
Matcher myMatcher = myPattern.matcher(line);
what will the myMatcher.groups be? how are these groups considered when we have nested brackets?
System.out.println(myMatcher.group(1));
System.out.println(myMatcher.group(2));
System.out.println(myMatcher.group(3));
System.out.println(myMatcher.group(4));
System.out.println(myMatcher.group(5));
the output was like
abcd.efgh
abcd.efgh
abcd.
null
xyz
for abcd.efgh#xyz.com
abc
null
null
abc
xyz
for abc#xyz.com
Thanks.

You can use | operator in your regexps to detect #ORAT: #|OR|(at).
You can avoid having dot in email addresses by using ^ at the beginning of the pattern:
Try this:
Pattern myPattern = Pattern.compile("^(\\w+)\\s*(#|at|\\(at\\))\\s*(\\w+)\\.(\\w+)");
Matcher myMatcher = myPattern.matcher(line);
if (myMatcher.matches())
{
String mail = myMatcher.group(1) + "#" + myMatcher.group(3) + "." +myMatcher.group(4);
System.out.println(mail);
}

Your first pattern needs to combine the facts that you want word character and not dots, you currently have it separately, it should be:
[^\\.\W]+
This is 'not dots' and 'not not word characters'
So you have:
Pattern myPattern = Pattern.compile("([^\\.\W]+)( *)#( *)(\\w+)\\.com");
To answer your second question, you can use OR in REGEX with the | character
(#|at)

Regex for String with possible escape characters

I had asked this question some times back here Regular expression that does not contain quote but can contain escaped quote and got the response, but somehow i am not able to make it work in Java.
Basically i need to write a regular expression that matches a valid string beginning and ending with quotes, and can have quotes in between provided they are escaped.
In the below code, i essentially want to match all the three strings and print true, but cannot.
What should be the correct regex?
Thanks
public static void main(String[] args) {
String[] arr = new String[]
{
"\"tuco\"",
"\"tuco \" ABC\"",
"\"tuco \" ABC \" DEF\""
};
Pattern pattern = Pattern.compile("\"(?:[^\"\\\\]+|\\\\.)*\"");
for (String str : arr) {
Matcher matcher = pattern.matcher(str);
System.out.println(matcher.matches());
}
}

The problem is not so much your regex, but rather your test strings. The single backslash before the internal quotes on your second and third example strings are consumed when the literal string is parsed. The string being passed to the regex engine has no backslash before the quote. (Try printing it out.) Here is a tested version of your function which works as expected:
import java.util.regex.*;
public class TEST
{
public static void main(String[] args) {
String[] arr = new String[]
{
"\"tuco\"",
"\"tuco \\\" ABC\"",
"\"tuco \\\" ABC \\\" DEF\""
};
//old: Pattern pattern = Pattern.compile("\"(?:[^\"\\\\]+|\\\\.)*\"");
Pattern pattern = Pattern.compile(
"# Match double quoted substring allowing escaped chars. \n" +
"\" # Match opening quote. \n" +
"( # $1: Quoted substring contents. \n" +
" [^\"\\\\]* # {normal} Zero or more non-quote, non-\\. \n" +
" (?: # Begin {(special normal*)*} construct. \n" +
" \\\\. # {special} Escaped anything. \n" +
" [^\"\\\\]* # more {normal} non-quote, non-\\. \n" +
" )* # End {(special normal*)*} construct. \n" +
") # End $1: Quoted substring contents. \n" +
"\" # Match closing quote. ",
Pattern.DOTALL | Pattern.COMMENTS);
for (String str : arr) {
Matcher matcher = pattern.matcher(str);
System.out.println(matcher.matches());
}
}
}
I've substituted your regex for an improved version (taken from MRE3). Note that this question gets asked a lot. Please see this answer where I compare several functionally equivalent expressions.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex best-practices - java

Related

How can I get non-matching groups using a Matcher in Java?

regex to select specific multiple lines

How to email validate with apostrophe and limit the number of dots after the # symbol using Regular expression. (JAVA)

How to remove dot (.) character using a regex for email addresses of type "abcd.efgh#xyz.com" in java?

Regex for String with possible escape characters

Categories

Resources