what is the regex format for the usage of ')' if and only if '(' is used earlier, or ')' is must if '(' is used? I have tried ^[a-zA-Z]+(([)]?[,]?[a-zA-Z0-9 ][. -/']?[(]?[a-zA-Z0-9][)]?)?[a-zA-Z:.])$ . But I can't make it use ')' only when '(' is being used.
Regex cannot take care of context. In your case you're seeking to find context. Regex is not meant for that. You need to write a function that checks this.
Citing from this link:
In the context of formal language theory, something is called
“regular” when it has a grammar where all production rules have one of
the following forms:
B -> a
B -> aC
B -> ε
You can read those -> rules as “The left hand side can be replaced
with the right hand side”. So the first rule would be “B can be
replaced with a”, the second one “B can be replaced with aC” and the
third one “B can be replaced with the empty string” (ε is the symbol
for the empty string).
So what are B, C and a? By convention, uppercase characters denote so
called “non-terminals” - symbols which can be broken down further -
and lowercase characters denote “terminals” - symbols which cannot be
broken down any further.
In your case you are looking for something like:
(\([x].*\)[x])*
I added the [x] to stand for an x number of times (it's not part of the regex convention of course). As you can see by the definition of regex, there's no way to represent such expression in a way that complies with regex definition.
This is not just a "grey" definition issue. Creating a regex-like language to solve problems like the one you noted here is much more complicated (algorithmic and complexity wise). It's a totally different problem domain to try and patternize the type of problems as the one you mentioned here.
Related
This regex: \p{L}+ matches these characters "ASKJKSDJKDSJÄÖÅüé" of the example string "ASKJKSDJK_-.;,DSJÄÖÅ!”#€%&/()=?`¨’<>üé" which is great but is the exact opposite of what I want. Which leads me to negating regexes.
Goal:
I want to match any and all characters that are not a letter nor a number in multiple languages.
Could a negative regex be a natural direction for this?
I should mention one intended use for the regex I'd like to find is to validate passwords for the rule:
that it needs to contain at least one special character, which I
define as not being a number nor a letter.
It would seem defining ranges of special characters should be avoided if possible, because why limit the possibilities? Thus my definition. I assume there could be some problems with such a wide definition, but it is a first step.
If you have some suggestions for a better solution I'm giving below or just have some thoughts on the subject, I'm sure I'm not the only one that would like to learn about it. Thanks.
Note I'm using double \\ in the Java code. Platform is Java 11.
You can shove those \\p things in []. And thus, use the fact that you can negate chargroups. This is all you need:
Pattern p = Pattern.compile("[^\\p{L}]");
Matcher m = p.matcher("ASKJKSDJK_-.;,DSJÄÖÅ!”#€%&/()=?`¨’<>üé");
while (m.find()) System.out.print(m.group(0));
That prints:
_-.;,!”#€%&/()=?`¨’<>
Which is exactly what you're looking for, no?
No need to mess with lookaheads here.
So after having read similar, though not identical questions and some equally great answers, I came up with this solution: (?=\P{L})(?=\P{N}) meaning match both not letters and not numbers. Even if I'm asserting numbers separately I need to negate both to meet the specification of special characters (See question).
This is making use of a non-consuming regular expression with the parentheses and the?=, first matching the expression in the first parenthesis and after that continue to match the whole in the second. Thanks to #Jason Cohen for this detail in the Regular Expressions: Is there an AND operator? discussion.
The upper case P in \P{L} and \P{N} expresses the "not belonging to a category" in Unicode Categories, where the uppercase P means "not", i e the opposite of a lowercase p.
It's not perfect for a real world solution, but works as a starting point at least. Note I'm using double \\ in the Java code. Platform is Java 11.
I'm writing a custom Assembler in Java for the 6502 microprocessor instruction set, obviously one of the main parts of an assembler is checking the syntax of the assembly program is valid before it can be converted into byte form.
So far I have worked out the following rules that will all need to be checked for each line of the assembly program
All instructions must be three letters long and match an instruction in the instruction table.
Branch labels cannot contain any characters aside from alphanumeric
Operands cannot contain symbols outside of "( ) $ # , + -"
Opening parentheses in operands must be closed.
Operands can only contain one pair of parentheses
$ and # must be followed by numeric characters in operands.
Commas must exist between a value and a value OR a parenthesis and a value i.e. (xxx,yyy) or (xxx),yyy
I am coding the assembler in Java, and as such I was thinking about using regex patterns in order to check the validity of the above rules. Is this something that regex can be used for? I have used regex in the past but usually just single checks and nothing as extensive as this.
I'm not asking anyone to work out the regex patterns that could be used for these rules (although I would be grateful if anyone could as I really have no idea how to do some of them), I just want to know if checking these rules is something that is possible with regex.
Regular expression can check that a string contains exactly 3 letters but it can't tell you if it exists in a table.
Regular expression can check that a string contains only alphanumeric characters.
Regular expression can check that a string only contains certain symbols.
Regular expression can count the number of opening parenthesis and closing parenthesis but cannot tell you if each opening has a matching closing.
The last three rules can also be checked via regular expression.
See javadoc for class java.util.regex.Pattern.
For example, regex for first rule is \p{Alnum}{3}
I have a pattern, which I'll refer to as Z (actual pattern is a bit long and not important to the question). Simply put, I want to be able to match either \*\sZ, or Z\:, but not both nor neither.
I attempted using lookaheads (similar to below), however because of the pattern between the prefix and suffix they wouldn't work.
(\*\s(?!\:))Z((?<!\*)\:)
Is there a way of accomplishing this without having to duplicate the pattern (e.g. (\*\sZ|Z\:))?
A quick note about my pattern is there is no \* in the Z pattern, only in the prefix. Conversely there's also no \: in the Z pattern, it's only in the suffix if immediately proceeding Z, but not after any other characters (there's a .* capture after the suffix)
Is there a way of accomplishing this without having to duplicate the
pattern?
The answer is "NO". Unlike and and or which are fundamental properties of regular expression. In regular expression, you can easily construct and expression by using concatenation and construct or expression by using | respectively.
But anyway, if you still want get your job done I suggest you to do this.
First, you already had two patterns here
\*\sZ
and
Z\:
So, as you said, these two patterns could not be occurred at the same time.
So from properties of xor:
A xor B = (A & ~B)|(~A & B).
Finally, we can get
\*\sZ(?!\:)|(?<!\*\s)Z\:
See a DEMO
To clarify, I want to match:
ab
aabb
aaabbb
...
This works in Perl:
if ($exp =~ /^(a(?1)?b)$/)
To understand this, look at the string as though it grows from the outside-in, not left-right:
ab
a(ab)b
aa(ab)bb
(?1) is a reference to the outer set of parentheses. We need the ? afterwards for the last case (going from outside in), nothing is left and ? means 0 or 1 of the preceding expression (so it essentially acts as our base case).
I posted a similar question asking what is the equivalent (?1) in Java? Today I found out that \\1 refers to the first capturing group. So, I assumed that this would work:
String pattern = "^(a(?:\\1)?b)$";
but it did not. Does anyone know why?
NB: I know there are other, better, ways to do this. This is strictly an educational question. As in I want to know why this particular way does not work and if there is a way to fix it.
The \\1 is a backreference and refers to the value of the group, not to the pattern as the recursion (?1) does in Perl. Unfortunately, Java regexes do not support recursion, but the pattern can be expressed using lookarounds and backrefs.
Which regular expression engine does Java uses?
In a tool like RegexBuddy if I use
[a-z&&[^bc]]
that expression in Java is good but in RegexBuddy it has not been understood.
In fact it reports:
Match a single character present in
the list below [a-z&&[^bc]
A character in the range between a and z : a-z
One of the characters &[^bc : &&[^bc
Match the character ] literally : ]
but i want to match a character between a and z intersected with a character that is not b or c
Like most regex flavors, java.util.regex.Pattern has its own specific features with syntax that may not be fully compatible with others; this includes character class union, intersection and subtraction:
[a-d[m-p]] : a through d, or m through p: [a-dm-p] (union)
[a-z&&[def]] : d, e, or f (intersection)
[a-z&&[^bc]] : a through z, except for b and c: [ad-z] (subtraction)
The most important "caveat" of Java regex is that matches attempts to match a pattern against the whole string. This is atypical of most engines, and can be a source of confusion at times.
See also
regular-expressions.info/Flavor Comparison and Java Flavor Notes
On character class subtraction
Subtraction allows you to define for example "all consonants" in Java as [a-z&&[^aeiou]].
This syntax is specific to Java. In XML Schema, .NET, JGSoft and RegexBuddy, it's [a-z-[aeiou]]. Other flavors may not support this feature at all.
References
regular-expressions.info/Character Classes in XML Regular Expressions
MSDN - Regular Expression Character Classes - Subtraction
Related questions
What is the point behind character class intersections in Java’s Regex?
Java uses its own regular expression engine, which behaviour is defined in the Pattern class.
You can test it with an Eclipse plugin or online.
RegexBuddy does not yet support the character class union, intersection, and subtraction syntax that is unique to the Java regular expression flavor. This is the only part of the Java regex syntax that RegexBuddy does not yet support. We're planning to implement this in a future version of RegexBuddy. The reason this has been postponed is because no other regular expression flavor supports this syntax.
P.S.: If you have a question about RegexBuddy in particular, please add the "regexbuddy" tag to your question. Then the question automatically shows up in my RSS reader. I don't follow the "regex" tag because far too many questions use that tag, and most are already answered by the time I see them.