Regex grouping and matching - java

I have a regex https://regex101.com/r/PPbhRn/1. Here i see that when "and" is captured, i am able to see some white spaces also captured above. Is there any way to get rid of that white spaces? and i want to know whether the pattern will match only when grouping is captured correctly?
String validRegex="(((?:[(]* ?[a-z][a-z]+ ?[)]*)|[(]* ?(NOT) (?:[(]* ?[a-z][a-z]+ ?[)]*) ?[)]*)( (AND|OR) ((?:[(]* ?[a-z][a-z]+ ?[)]*)|[(]* ?(NOT) (?:[(]* ?[a-z][a-z]+ ?[)]*) ?[)]*))*)";
String formula = "mean AND trip OR (mean OR mango) AND (mean AND orange) OR mango AND (test OR NOT help)";
Pattern p1 = Pattern.compile(validRegex, Pattern.CASE_INSENSITIVE | Pattern.DOTALL | Pattern.MULTILINE);
final Matcher matcher = p1.matcher(formula);
boolean result=MarketMeasureUtil.isValidFormula(formula);
System.out.println(result);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
System.out.println( matcher.group() + "starting at" + "index" + matcher.start()+ "and ending at index" +matcher.end() );
}
I'm not able to capture the groups properly, i need to capture groups like "mean AND trip" "OR" "mean or mango"..etc..
isValidFormula() invokes the regex.matches(). In our case matches works fine. Grouping is not working as expected

Regex are not suitable for this task. I doubt that it's even possible to validate the expression if you can add as many braces as you would like to.
You have to write a parser which builds the tree, using a class like:
class Node {
boolean[] isAnd = null;
Node[] children = null;
String literal = null;
Node(String literal) { // creator for literals
this.literal = literal;
}
Node(boolean[] isAnd) { // creator for intermediate nodes
this.isAnd = isAnd;
children = new Node[isAnd.length + 1];
}
}
And the method would look like this:
Node parse(String) throws ParseException { // returns the root
First you can remove the superfluous braces on the right and the left side by counting all braces, then you can find the 0-level ands and ors (i.e. those that are not in braces) and create an intermediate node, if you don't find any 0-level ands and ors then the string has to be a literal or it is invalid. If it is an intermediate node then you add the children by calling the parse method recursively with the substrings surrounding the 0-level ands and ors.

Looks like you created some kind of DSL.
You should consider using a parser or implementing your own if your "language" is not complex.
I assume you just evaluate OR/AND operations. It is very similar to code a calculator where AND (multiplication) takes precedence over OR (addition). Therefore you could implement your own.
You can first tokenize the statement and validate it but don't try to do both at the same time with regex. If validating is the only purpose you can end here.
Next if you have to evaluate the expression you can create a binary tree with the tokens (OR operand as left leaf and AND operand as right leaf for instance) and apply your grammar to evaluate the expression.

Related

Split a string by commas but no inside parenthesis [duplicate]

I have a string that looks something like the following:
12,44,foo,bar,(23,45,200),6
I'd like to create a regex that matches the commas, but only the commas that are not inside of parentheses (in the example above, all of the commas except for the two after 23 and 45). How would I do this (Java regular expressions, if that makes a difference)?
Assuming that there can be no nested parens (otherwise, you can't use a Java Regex for this task because recursive matching is not supported):
Pattern regex = Pattern.compile(
", # Match a comma\n" +
"(?! # only if it's not followed by...\n" +
" [^(]* # any number of characters except opening parens\n" +
" \\) # followed by a closing parens\n" +
") # End of lookahead",
Pattern.COMMENTS);
This regex uses a negative lookahead assertion to ensure that the next following parenthesis (if any) is not a closing parenthesis. Only then the comma is allowed to match.
Paul, resurrecting this question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
Also the existing solution checks that the comma is not followed by a parenthesis, but that does not guarantee that it is embedded in parentheses.
The regex is very simple:
\(.*?\)|(,)
The left side of the alternation matches complete set of parentheses. We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right commas because they were not matched by the expression on the left.
In this demo, you can see the Group 1 captures in the lower right pane.
You said you want to match the commas, but you can use the same general idea to split or replace.
To match the commas, you need to inspect Group 1. This full program's only goal in life is to do just that.
import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) throws java.lang.Exception {
String subject = "12,44,foo,bar,(23,45,200),6";
Pattern regex = Pattern.compile("\\(.*?\\)|(,)");
Matcher regexMatcher = regex.matcher(subject);
List<String> group1Caps = new ArrayList<String>();
// put Group 1 captures in a list
while (regexMatcher.find()) {
if(regexMatcher.group(1) != null) {
group1Caps.add(regexMatcher.group(1));
}
} // end of building the list
// What are all the matches?
System.out.println("\n" + "*** Matches ***");
if(group1Caps.size()>0) {
for (String match : group1Caps) System.out.println(match);
}
} // end main
} // end Program
Here is a live demo
To use the same technique for splitting or replacing, see the code samples in the article in the reference.
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
I don’t understand this obsession with regular expressions, given that they are unsuited to most tasks they are used for.
String beforeParen = longString.substring(longString.indexOf('(')) + longString.substring(longString.indexOf(')') + 1);
int firstComma = beforeParen.indexOf(',');
while (firstComma != -1) {
/* do something. */
firstComma = beforeParen.indexOf(',', firstComma + 1);
}
(Of course this assumes that there always is exactly one opening parenthesis and one matching closing parenthesis coming somewhen after it.)

Regex extract last numbers from String

I have some strings which are indexed and are dynamic.
For example:
name01,
name02,
name[n]
now I need to separate name from index.
I've come up with this regex which works OK to extract index.
([0-9]+(?!.*[0-9]))
But, there are some exceptions of these names. Some of them may have a number appended which is not the index.(These strings are limited and I know them, meaning I can add them as "exceptions" in the regex)
For example,
panLast4[01]
Here the last '4' is not part of the index, so I need to distinguish.
So I tried:
[^panLast4]([0-9]+(?!.*[0-9]))
Which works for panLast4[123] but not panLast4[43]
Note: the "[" and "]" is for explanation purposes only, it's not present in the strings
What is wrong?
Thanks
You can use the split method with this pattern:
(?<!^panLast(?=4)|^nm(?=14)|^nm1(?=4))(?=[0-9]+$)
The idea is to find the position where there are digits until the end of the string (?=[0-9]+$). But the match will succeed if the negative lookbehind allows it (to exclude particular names (panLast4 and nm14 here) that end with digits). When one of these particular names is found, the regex engine must go to the next position to obtain a match.
Example:
String s ="panLast412345";
String[] res = s.split("(?<!^panLast(?=4)|^nm(?=14)|^nm1(?=4))(?=[0-9]+$)", 2);
if ( res.length==2 ) {
System.out.println("name: " + res[0]);
System.out.println("ID: " + res[1]);
}
An other method with matches() that simply uses a lazy quantifier as last alternative:
Pattern p = Pattern.compile("(panLast4|nm14|.*?)([0-9]+)");
String s = "panLast42356";
Matcher m = p.matcher(s);
if ( m.matches() && m.group(1).length()>0 ) {
System.out.println("name: "+ m.group(1));
System.out.println("ID: "+ m.group(2));
}

Java recursive(?) repeated(?) deep(?) pattern matching

I'm trying to get ALL the substrings in the input string that match the given pattern.
For example,
Given string: aaxxbbaxb
Pattern: a[a-z]{0,3}b
(What I actually want to express is: all the patterns that starts with a and ends with b, but can have up to 2 alphabets in between them)
Exact results that I want (with their indexes):
aaxxb: index 0~4
axxb: index 1~4
axxbb: index 1~5
axb: index 6~8
But when I run it through the Pattern and Matcher classes using Pattern.compile() and Matcher.find(), it only gives me:
aaxxb : index 0~4
axb : index 6~8
This is the piece of code I used.
Pattern pattern = Pattern.compile("a[a-z]{0,3}b", Pattern.CASE_INSENSITIVE);
Matcher match = pattern.matcher("aaxxbbaxb");
while (match.find()) {
System.out.println(match.group());
}
How can I retrieve every single piece of string that matches the pattern?
Of course, it doesn't have to use Pattern and Matcher classes, as long as it's efficient :)
(see: All overlapping substrings matching a java regex )
Here is the full solution that I came up with. It can handle zero-width patterns, boundaries, etc. in the original regular expression. It looks through all substrings of the text string and checks whether the regular expression matches only at the specific position by padding the pattern with the appropriate number of wildcards at the beginning and end. It seems to work for the cases I tried -- although I haven't done extensive testing. It is most certainly less efficient than it could be.
public static void allMatches(String text, String regex)
{
for (int i = 0; i < text.length(); ++i) {
for (int j = i + 1; j <= text.length(); ++j) {
String positionSpecificPattern = "((?<=^.{"+i+"})("+regex+")(?=.{"+(text.length() - j)+"}$))";
Matcher m = Pattern.compile(positionSpecificPattern).matcher(text);
if (m.find())
{
System.out.println("Match found: \"" + (m.group()) + "\" at position [" + i + ", " + j + ")");
}
}
}
}
you are in effect searching for the strings ab, a_b, and a__b in an input string, where
_ denotes a non-whitespace character whose value you do not care about.
That's three search targets. The most efficient way I can think of to do this would be to use a search algorithm like the Knuth-Morris-Pratt algorithm, with a few modifications. In effect your pseudocode would be something like:
for i in 0 to sourcestring.length
check sourcestring[i] - is it a? if so, check sourcestring[i+x]
// where x is the index of the search string - 1
if matches then save i to output list
else i = i + searchstring.length
obviously if you have a position match you must then check the inner characters of the substring to make sure they are alphabetical.
run the algorithm 3 times, one for each search term. It will doubtless be much faster than trying to do the search using pattern matching.
edit - sorry, didn't read the question properly. If you have to use regex then the above will not work for you.
One thing you could do is:
Create all possible Substrings that are 4 characters or longer (good
luck with that if your String is large)
Create a new Matcher for each of these substrings
do a match() instead of a find()
calculate the absolute offset from the substring's relative offset and the matcher info

Validating an infix notation possibly using regex

I am thinking of validating an infix notation which consists of alphabets as operands and +-*/$ as operators [eg: A+B-(C/D)$(E+F)] using regex in Java. Is there any better way? Is there any regex pattern which I can use?
I am not familiar with the language syntax of infix, but you can certainly do a first pass validation check which simply verifies that all of the characters in the string are valid (i.e. acceptable characters = A-Z, +, -, *, /, $, ( and )). Here is a Java program which checks for valid characters and also includes a function which checks for unbalanced (possibly nested) parentheses:
import java.util.regex.*;
public class TEST {
public static void main(String[] args) {
String s = "A+B-(C/D)$(E+F)";
Pattern regex = Pattern.compile(
"# Verify that a string contains only specified characters.\n" +
"^ # Anchor to start of string\n" +
"[A-Z+\\-*/$()]+ # Match one or more valid characters\n" +
"$ # Anchor to end of string\n",
Pattern.COMMENTS);
Matcher m = regex.matcher(s);
if (m.find()) {
System.out.print("OK: String has only valid characters.\n");
} else {
System.out.print("ERROR: String has invalid characters.\n");
}
// Verify the string contains only balanced parentheses.
if (checkParens(s)) {
System.out.print("OK: String has no unbalanced parentheses.\n");
} else {
System.out.print("ERROR: String has unbalanced parentheses.\n");
}
}
// Function checks is string contains any unbalanced parentheses.
public static Boolean checkParens(String s) {
Pattern regex = Pattern.compile("\\(([^()]*)\\)");
Matcher m = regex.matcher(s);
// Loop removes matching nested parentheses from inside out.
while (m.find()) {
s = m.replaceFirst(m.group(1));
m.reset(s);
}
regex = Pattern.compile("[()]");
m = regex.matcher(s);
// Check if there are any erroneous parentheses left over.
if (m.find()) {
return false; // String has unbalanced parens.
}
return true; // String has balanced parens.
}
}
This does not validate the grammar, but may be useful as a first test to filter out obviously bad strings.
Possibly overkill, but you might consider using a fully fledged parser generator such as ANTLR (http://www.antlr.org/). With ANTLR you can create rules that will generate the java code for you automatically. Assuming you have only got valid characters in the input this is a syntax analysis problem, otherwise you would want to validate the character stream with lexical analysis first.
For syntax analysis you might have rules like:
PLUS : '+' ;
etc...
expression:
term ( ( PLUS | MINUS | MULTIPLY | DIVIDE )^ term )*
;
term:
constant
| OPENPAREN! expression CLOSEPAREN!
;
With constant being integers/reals whatever. If the ANTLR generated parser code can't match the input with your parser rules it will throw an exception so you can determine whether code is valid.
You probably could do it with recursive PCRE..but this may be a PITA.
since you only want to validate it, you can do it very simple. just use a stack, push all the elements one by one and remove valid expressions.
define some rules, for example:
an operator is only allowed if there is an alphabet on top of the stack
an alphabet or parentheses are only allowed if there is an operator on top of the stack
everything is allowed if the stack is empty
then:
if you encounter a closing parenthes remove everything up to the opening parenthes.
if you encounter an alphabet, remove the expression
after every removal of an expression, add an dummy alphabet. repeat the previous steps.
if the result is an alphabet, the expression is valid.
or something like that..

Regular expression, match content of specific XML tag, but without the tag itself

I am banging my head against this regular expression the whole day.
The task looks simple, I have a number of XML tag names and I must replace (mask) their content.
For example
<Exony_Credit_Card_ID>242394798</Exony_Credit_Card_ID>
Must become
<Exony_Credit_Card_ID>filtered</Exony_Credit_Card_ID>
There are multiple such tags with different names
How do I match any text inside but without matching the tag itself?
EDIT: I should clarify again. Grouping and then using the group to avoid replacing the text inside does not work in my case, because when I add the other tags to the expression, the group number is different for the subsequent matches. For example:
"(<Exony_Credit_Card_ID>).+(</Exony_Credit_Card_ID>)|(<Billing_Postcode>).+(</Billing_Postcode>)"
replaceAll with the string "$1filtered$2" does not work because when the regex matches Billing_Postcode its groups are 3 and 4 instead of 1 and 2
String resultString = subjectString.replaceAll(
"(?x) # (multiline regex): Match...\n" +
"<(Exony_Credit_Card_ID|Billing_Postcode)> # one of these opening tags\n" +
"[^<>]* # Match whatever is contained within\n" +
"</\\1> # Match corresponding closing tag",
"<$1>filtered</$1>");
In your situation, I'd use this:
(?<=<(Exony_Credit_Card_ID|tag1|tag2)>)(\\d+)(?=</(Exony_Credit_Card_ID|tag1|tag2)>)
And then replace the matches with filtered, as the tags are excluded from the returned match. As your goal is to hide sensitive data, it's better to be safe and use an "agressive" matching, trying to match as much possibly sensitive data, even if sometimes it is not.
You may need to adjust the tag content matcher ( \\d+ ) if the data contains other characters, like whitespaces, slashes, dashes and such.
I have not debugged this code but you should use something like this:
Pattern p = Pattern.compile("<\\w+>([^<]*)<\\w+>");
Matcher m = p.matcher(str);
if (m.find()) {
String tagContent = m.group(1);
}
I hope it is a good start.
I would use something like this :
private static final Pattern PAT = Pattern.compile("<(\\w+)>(.*?)</\\1>");
private static String replace(String s, Set<String> toReplace) {
Matcher m = PAT.matcher(s);
if (m.matches() && toReplace.contains(m.group(1))) {
return '<' + m.group(1) + '>' + "filtered" + "</" + m.group(1) + '>';
}
return s;
}
I know you said that relying on group numbers does not do in your case ... but I can't really see how. Could you not use something of the sort :
xmlString.replaceAll("<(Exony_Credit_Card_ID|tag2|tag3)>([^<]+)</(\\1)>", "<$1>filtered</$1>");
? This works on the basic samples I used as a test.
edit: just to decompose :
"<(Exony_Credit_Card_ID|tag2|tag3)>" + // matches the tag itself
"([^<]+)" + // then anything in between the opening and closing of the tag
"</(\\1)>" // and finally the end tag corresponding to what we matched as the first group (Exony_Credit_Card_ID, tag1 or tag2)
"<$1>" + // Replace using the first captured group (tag name)
"filtered" + // the "filtered" text
"</$1>" // and the closing tag corresponding to the first captured group

Categories

Resources