How to optimize a regex pattern?

How to optimize a regex pattern? - java

I'm trying to fetch substring(s) from a text and using regex for it.
Sample text:
bla bla 1:30-2pm bla bla 5-6:30am some text 1-2:15am
I'm looking for the time frame entries(1-30-2pm...). Made them bold for readability only
Here's my regex:
\d{1,2}(:\d{1,2})? – \d{1,2}(:\d{1,2})?(am|pm)
java snippet:
public static List<String> foo(String text, String regex) {
List<String> entries = new ArrayList<>();
Matcher matcher = Pattern.compile(regex).matcher(text);
while (matcher.find()) {
entries.add(matcher.group());
}
return entries;
}
Can you help me optimize the regex pattern? There might be some use cases that i missed.

If we like to optimize our expression, we might want to add optional spaces, just in case our inputs might have any additional spaces, other than that, your expression looks great:
(\d{1,2})(:\d{1,2})?(\s+)?-(\s+)?(\d{1,2})(:\d{1,2})?(am|pm)
We have also added capturing groups, if we wish to get the data.
Demo 1
Or:
(\d{1,2})(:\d{1,2})?(\s+)?(am|pm)?(\s+)?-(\s+)?(\d{1,2})(:\d{1,2})?(\s+)?(am|pm)
Demo 2
whichever would be desired.
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "(\\d{1,2})(:\\d{1,2})?(\\s+)?-(\\s+)?(\\d{1,2})(:\\d{1,2})?(am|pm)";
final String string = "bla bla 1:30-2pm bla bla 5-6:30am some text 1-2:15am\n"
+ "bla bla 1:30 - 2pm bla bla 5 - 6:30am some text 1 - 2:15am";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE | Pattern.CASE_INSENSITIVE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
RegEx
If this expression wasn't desired and you wish to modify it, please visit this link at regex101.com.
RegEx Circuit
jex.im visualizes regular expressions:

I suggest using a regex like
String regex = "(?i)(?<!\\d)(?:0?[1-9]|1[0-2])(?::[0-5]\\d)?\\p{Pd}(?:0?[1-9]|1[0-2])(?::[0-5]\\d)?[ap]m\\b";
See the regex demo
Details
(?i) - case insensitive flag (for AM, PM, am, pm values etc.)
(?<!\d) - no digit immediately to the left is allowed
(?:0?[1-9]|1[0-2]) - an optional 0 and then a digit from 1 to 9, or 1 and then 0, 1 or 2
(?::[0-5]\d)? - an optional group: a digit from 0 to 5 and then any one digit
\p{Pd} - any hyphen
(?:0?[1-9]|1[0-2])(?::[0-5]\d)? - see above
[ap]m\b - a or p and then m and a word boundary.

Related

Merge two pattern into one

I need write a pattern to remove currency symbol and comma. eg Fr.-145,000.01
After the pattern matcher should return -145000.01.
The pattern i am using:
^[^0-9\\-]*([0-9\\-\\.\\,]*?)[^0-9\\-]*$
This will return -145,000.01
Then I remove the comma to get -145000.01, I want to ask if that's possible that I change the pattern and directly get -145000.01
String pattern = "^[^0-9\\-]*([0-9\\-\\.\\,]*?)[^0-9\\-]*$";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(str);
if(m.matches()) {
System.out.println(m.group(1));
}
I expect the output could resolve the comma

You can simply it with String.replaceAll() and simpler regex (providing you are expecting the input to be reasonably sane, i.e. without multiple decimal points embedded in the numbers or multiple negative signs)
String str = "Fr.-145,000.01";
str.replaceAll("[^\\d-.]\\.?", "")
If you are going down this route, I would sanity check it by parsing the output with BigDecimal or Double.

One approach would be to just collect our desired digits, ., + and - in a capturing group followed by an optional comma, and then join them:
([+-]?[0-9][0-9.]+),?
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "([+-]?[0-9][0-9.]+),?";
final String string = "Fr.-145,000.01\n"
+ "Fr.-145,000\n"
+ "Fr.-145,000,000\n"
+ "Fr.-145\n"
+ "Fr.+145,000.01\n"
+ "Fr.+145,000\n"
+ "Fr.145,000,000\n"
+ "Fr.145\n"
+ "Fr.145,000,000,000.01";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
Demo

String str = "Fr.-145,000.01";
Pattern regex = Pattern.compile("^[^0-9-]*(-?[0-9]+)(?:,([0-9]{3}))?(?:,([0-9]{3}))?(?:,([0-9]{3}))?(\\.[0-9]+)?[^0-9-]*$");
Matcher matcher = regex.matcher(str);
System.out.println(matcher.replaceAll("$1$2$3$4$5"));
Output:
-145000.01
It looks for number with up to 3 commas (Up to 999,999,999,999.99), and replaces it with the digits.

My approach would be to remove all the unnecessary parts using replaceAll.
The unnecessary parts are, apparently:
Any sequence which is not digits or minus at the beginning of the string.
Commas
The first pattern is represented by ^[^\\d-]+. The second is merely ,.
Put them together with an |:
Pattern p = Pattern.compile("(^[^\\d-]+)|,");
Matcher m = p.matcher(str);
String result = m.replaceAll("");

You could 2 capturing groups and make use of repeating matching using the \G anchor to assert the position at the end of the previous match.
(?:^[^0-9+-]+(?=[.+,\d-]*\.\d+$)([+-]?\d{1,3})|\G(?!^)),(\d{3})
In Java
String regex = "(?:^[^0-9+-]+(?=[.+,\\d-]*\\.\\d+$)([+-]?\\d{1,3})|\\G(?!^)),(\\d{3})";
Explanation
(?: Non capturing group
^[^0-9+-]+ Match 1+ times not a digit, + or -
(?= Positive lookahead, assert that what follows is:
[.+,\d-]*\.\d+$ Match 0+ times what is allowed and assert ending on . and 1+ digits
) Close positive lookahead
( Capturing group 1
[+-]?\d{1,3}) Match optional + or - followed by 1-3 digits
| Or
\G(?!^) Assert position at the end of prevous match, not at the start
), Close capturing group 1 and match ,
(\d{3}) Capture in group 2 matching 3 digits
In the replacement use the 2 capturing groups $1$2
See the Regex demo | Java demo

Regex to match strings in-between double quotes that are not containing some other strings

How to match words between double quotes in lines not containing specific words
input:
System.log("error");
new Exception("error");
view.setText("message");
From the above input, I would like to ignore lines with log and Exception words in them(Case sensitive) and match words in between double quotes.
Expected output
message
I have been trying to use look ahead without luck
(?s)^(?!log)".+"
I need this for a search in IntelliJ using regex

In your pattern (?s)^(?!log)".+" the negative lookahead does not contain a quantifier so it will assert that what is directly after the start of the string is not log
What you could do is use a quantifier .* with an alternation to match either log or Exception and add word boundaries \b to prevent them being part of a larger word.
Then you might use negated character classes [^"] to match not a double quote and use a capturing group ([^"]+) for the value between the double quotes.
^(?!.*\b(?:log|Exception)\b)[^"]*"([^"]+)"
In Java:
String regex = "^(?!.*\\b(?:log|Exception)\\b)[^\"]*\"([^\"]+)\"";
Regex demo
If you want to make the dot to match a newline you can prepend (?s) to the pattern.

My guess is that this expression would likely work for capturing the message,
^(?!.*log.*|.*exception.*).*?"(.+?)".*
Demo 1
Example
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "^(?!.*log.*|.*exception.*).*?"(.+?)".*";
final String string = "System.log(\"error\");\n"
+ "new Exception(\"error\");\n"
+ "view.setText(\"message\");";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE | Pattern.CASE_INSENSITIVE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}

Java regex to match double quoted substrings

I want to parse the following string:
String text = "\"w1 w\"2\" w3 | w4 w\"5 \"w6 w7\"";
// "w1 w"2" w3 | w4 w"5 "w6 w7"
I'm using Pattern.compile(regex).matcher(text), so what I'm missing here is the proper regex.
The rules are that regex has to:
isolate any single word
any substring surrounded by double quotes is a match
double quotes within a word have to be ignored (I will later replace them with a whitespace).
So the resulting matches should be:
w1 w"2
w3
|
w4
w"5
w6 w7
Whether the double quotes are included or not in the double quotes surrounded substrings is irrelevant (e.g. 1. could be either w1 w"2 or "w1 w"2").
What I came up with is something like this:
"\"(.*)\"|(\\S+)"
I also tried many diffent variants of the above regex (including lookbehind/forward) but none is giving me the expected result.
Any idea on how to improve this?

Try this Regex:
(?:(?<=^")|(?<=\s")).*?(?="(?:\s|$))|(?![\s"])\S+
Click for Demo
EXPLANATION:
(?:(?<=^")|(?<=\s")) - Positive Lookbehind to find the position which is preceeded by a ". This " either needs to be at the start of the string or after a whitespace
.*? - matches 0+ occurrences of any character other than a newline character lazily
(?="(?:\s|$)) - Positive lookahead to validate that whatever is matched so far is followed by either a whitespace or there is nothing after the match($).
| - OR (either the above match or the following)
(?![\s"]) - Negative lookahead to validate that the position in not followed by either a whitespace or a "
\S+ - matches 1+ occurrences of a non-whitespace character
Java Code(Generated from here):
Run code here to see the output
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class MyClass {
public static void main(String args[]) {
final String regex = "(?:(?<=^\")|(?<=\\s\")).*?(?=\"(?:\\s|$))|(?![\\s\"])\\S+";
final String string = "\"w1 w\"2\" w3 | w4 w\"5 \"w6 w7\"";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
}
}
OUTPUT:

This seems to do the job:
"(?:[^"]|\b"\b)+"|\S+
Debuggex Demo
Regex101 Demo
Note that in Java, because we're using string literals for regexes, a backslash needs to be preceded by another backslash:
String regex = "\"(?:[^\"]|\\b\"\\b)+\"|\\S+";

Regex matcher not giving expected result. Not matching number properly

I cannot understand why 2nd group is giving me only 0. I expect 3000. And do point me to a resource where I can understand better.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexMatches {
public static void main( String args[] ) {
// String to be scanned to find the pattern.
String line = "This order was placed for QT3000! OK?";
String pattern = "(.*)(\\d+)(.*)";
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher(line);
if (m.find( )) {
System.out.println("Found value: " + m.group(0) );
System.out.println("Found value: " + m.group(1) );
System.out.println("Found value: " + m.group(2) );//?
System.out.println("Found value: " + m.group(3) );
}else {
System.out.println("NO MATCH");
}
}
}

Precise the pattern, add QT before the \d pattern, or use .*? instead of the first .* to get as few chars as possible.
String pattern = "(.*QT)(\\d+)(.*)";
or
String pattern = "(.*?)(\\d+)(.*)";
will do. See a Java demo.
The (.*QT)(\\d+)(.*) will match and capture into Group 1 any 0+ chars other than line break chars, as many as possible, up to the last occurrence of QT (followed with the subsequent subpatterns), then will match and capture 1+ digits into Group 2, and then will match and capture into Group 3 the rest of the line.
The .*? in the alternative pattern will matchand capture into Group 1 any 0+ chars other than line break chars, as few as possible, up to the first chunk of 1 or more digits.
You may also use a simpler pattern like String pattern = "QT(\\d+)"; to get all digits after QT, and the result will be in Group 1 then (you won't have the text before and after the number).

The * quantifier will try to match as many as possible, because it is a greedy quantifier.
You can make it non-greedy (lazy) by changing it to *?
Then, your regex will become :
(.*?)(\d+)(.*)
And you will match 3000 in the 2nd capturing group.
Here is a regex101 demo

Java Regex: Grouping consecutive 1 or 0 in a binary string

I want to capture all the consecutive groups in a binary string
1000011100001100111100001
should give me
1
0000
111
0000
11
00
1111
0000
1
I have made ([1?|0?]+) regex in my java application to group the consequential 1 or 0 in the string like 10000111000011.
But when I run it in my code, there is nothing in the console printed:
String name ="10000111000011";
regex("(\\[1?|0?]+)" ,name);
public static void regex(String regex, String searchedString) {
Pattern pattern = Pattern.compile(regex);
Matcher regexMatcher = pattern.matcher(searchedString);
while (regexMatcher.find())
if (regexMatcher.group().length() > 0)
System.out.println(regexMatcher.group());
}
To avoid syntax error in the runtime of regex, I have changed the ([1?|0?]+) to the (\\[1?|0?]+)
Why there is no group based on regex?

First - just as an explanation - your regex defines a character class ([ ... ]) that matches any of the characters 1, ?, | or 0 one or more times (+). I think you mean to have ( ... ) in it, among other things, which would make the | an alternation lazy matching a 0 or a 1. But that's not either what you want (I think ;).
Now, the solution might be this:
([01])\1*
which matches a 0 or a 1, and captures it. Then it matches any number of the same digit (\1 is a back reference to what ever is captured in the first capture group - in this case the 0 or the 1) any number of times.
Check it out at ideone.

You can try this:
(1+|0+)
Explanation
Sample Code:
final String regex = "(1+|0+)";
final String string = "10000111000011\n"
+ "11001111110011";
final Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE | Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Group " + 1 + ": " + matcher.group(1));
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to optimize a regex pattern? - java

Related

Merge two pattern into one

Regex to match strings in-between double quotes that are not containing some other strings

Java regex to match double quoted substrings

Regex matcher not giving expected result. Not matching number properly

Java Regex: Grouping consecutive 1 or 0 in a binary string

Categories

Resources