Which regular expression engine does Java uses?
In a tool like RegexBuddy if I use
[a-z&&[^bc]]
that expression in Java is good but in RegexBuddy it has not been understood.
In fact it reports:
Match a single character present in
the list below [a-z&&[^bc]
A character in the range between a and z : a-z
One of the characters &[^bc : &&[^bc
Match the character ] literally : ]
but i want to match a character between a and z intersected with a character that is not b or c
Like most regex flavors, java.util.regex.Pattern has its own specific features with syntax that may not be fully compatible with others; this includes character class union, intersection and subtraction:
[a-d[m-p]] : a through d, or m through p: [a-dm-p] (union)
[a-z&&[def]] : d, e, or f (intersection)
[a-z&&[^bc]] : a through z, except for b and c: [ad-z] (subtraction)
The most important "caveat" of Java regex is that matches attempts to match a pattern against the whole string. This is atypical of most engines, and can be a source of confusion at times.
See also
regular-expressions.info/Flavor Comparison and Java Flavor Notes
On character class subtraction
Subtraction allows you to define for example "all consonants" in Java as [a-z&&[^aeiou]].
This syntax is specific to Java. In XML Schema, .NET, JGSoft and RegexBuddy, it's [a-z-[aeiou]]. Other flavors may not support this feature at all.
References
regular-expressions.info/Character Classes in XML Regular Expressions
MSDN - Regular Expression Character Classes - Subtraction
Related questions
What is the point behind character class intersections in Java’s Regex?
Java uses its own regular expression engine, which behaviour is defined in the Pattern class.
You can test it with an Eclipse plugin or online.
RegexBuddy does not yet support the character class union, intersection, and subtraction syntax that is unique to the Java regular expression flavor. This is the only part of the Java regex syntax that RegexBuddy does not yet support. We're planning to implement this in a future version of RegexBuddy. The reason this has been postponed is because no other regular expression flavor supports this syntax.
P.S.: If you have a question about RegexBuddy in particular, please add the "regexbuddy" tag to your question. Then the question automatically shows up in my RSS reader. I don't follow the "regex" tag because far too many questions use that tag, and most are already answered by the time I see them.
Related
This regex: \p{L}+ matches these characters "ASKJKSDJKDSJÄÖÅüé" of the example string "ASKJKSDJK_-.;,DSJÄÖÅ!”#€%&/()=?`¨’<>üé" which is great but is the exact opposite of what I want. Which leads me to negating regexes.
Goal:
I want to match any and all characters that are not a letter nor a number in multiple languages.
Could a negative regex be a natural direction for this?
I should mention one intended use for the regex I'd like to find is to validate passwords for the rule:
that it needs to contain at least one special character, which I
define as not being a number nor a letter.
It would seem defining ranges of special characters should be avoided if possible, because why limit the possibilities? Thus my definition. I assume there could be some problems with such a wide definition, but it is a first step.
If you have some suggestions for a better solution I'm giving below or just have some thoughts on the subject, I'm sure I'm not the only one that would like to learn about it. Thanks.
Note I'm using double \\ in the Java code. Platform is Java 11.
You can shove those \\p things in []. And thus, use the fact that you can negate chargroups. This is all you need:
Pattern p = Pattern.compile("[^\\p{L}]");
Matcher m = p.matcher("ASKJKSDJK_-.;,DSJÄÖÅ!”#€%&/()=?`¨’<>üé");
while (m.find()) System.out.print(m.group(0));
That prints:
_-.;,!”#€%&/()=?`¨’<>
Which is exactly what you're looking for, no?
No need to mess with lookaheads here.
So after having read similar, though not identical questions and some equally great answers, I came up with this solution: (?=\P{L})(?=\P{N}) meaning match both not letters and not numbers. Even if I'm asserting numbers separately I need to negate both to meet the specification of special characters (See question).
This is making use of a non-consuming regular expression with the parentheses and the?=, first matching the expression in the first parenthesis and after that continue to match the whole in the second. Thanks to #Jason Cohen for this detail in the Regular Expressions: Is there an AND operator? discussion.
The upper case P in \P{L} and \P{N} expresses the "not belonging to a category" in Unicode Categories, where the uppercase P means "not", i e the opposite of a lowercase p.
It's not perfect for a real world solution, but works as a starting point at least. Note I'm using double \\ in the Java code. Platform is Java 11.
Im looking to match a string to the format of an expression in formal logic, where two alphabetic characters are operated on by v|^|>|=, where the characters can be preceded by ~|!|?, and where the characters maybe be surrounded by brackets and preceded again by ~|!|?. At first I thought that the following expression might do it:
s.matches("^[!?~]*[(]*[!?~]*[a-z]{1}\\s[v>=^]{1}\\s[!?~]*[a-z]{1}[)]*$")
However, I have realised that these expressions can be stacked onto one another, and I dont know how to account for that in the regex.
Examples of acceptable matches:
~p v q
~?(p ^ ~r)
!p
p v ~(!r ^ t)
~!(p = (~!q ^ t))
It is possible to add as many operators as you want, to create an enormously long expression. How do I account for this with the regex in a general format?
Thanks heaps :)
You can't fully describe that language with a plain regular expression. The problem is that any letter can be replaced by an expression. You need recursive regular expressions, and these aren't supported by Java's java.util.regex package.
This is a feature that, as far as I know, started in Perl 4 or so and has appeared in a few package that advertise "Perl-Compatible Regular Expressions" (PCRE). It's not part of standard Java, Python, Ruby, C++, and I don't believe the .NET libraries for C#, VB.Net, C++/CLI etc. have it either.
I am trying the regex ([[.ch.]]*)c against the test string chchch. According to the spec:
[[.ch.]]*c matches the first to fifth character in the string chchch
When I test it in Java, it indeed matches those characters, but so does [[ch]]*c. Thus I am not sure if the collating symbol is respected. Is it?
TL;DR - No.
The specification you are reading/quoting is the Open Group's SUS (Single UNIX® Specification) version of the regular expression part of IEEE's POSIX (Portable Operating System Interface for uniX) collection of standards. (See https://www.regular-expressions.info/posix.html ¹)
In general, only POSIX-compliant regular expression engines fully support POSIX bracket expressions, which are essentially what other regex flavors call character classes but with a few special features, one being that [. and .] are interpreted as the start and end of a collating sequence when used within the expressions.
Unfortunately, very few regex engines are POSIX-compliant and, in fact, some claiming to implement POSIX regexes just use the regular expression syntax defined by POSIX and don't have full locale support. Thus they don't implement all/any of the bracket expression features/quirks.
Java's regular expressions are in no way POSIX-compliant, as can be seen from this Regular Expression Engine Comparison Chart ². Its regex package implements a "Perl-like" regex engine, missing a few features (e.g. conditional expressions and comments), but including some extra ones (e.g. possessive quantifiers and variable-length, but finite, look-behind assertions).
Neither Perl nor Java support the collation-related bracket delimiters [= and =] (character equivalence), or [. and .] (collating sequence). Perl does support character classes using the POSIX [: and :] delimiters, but Java only supports them using the \p operator (with a few caveats as explained here).
So, what is going on with the regex [[.ch.]]*c in Java? (I'm ignoring the capturing group as it doesn't change the analysis.)
Well, it turns out that Java's regex package supports unions in its character classes. This is achieved by nesting. For example, [set1[set2]] is equivalent to [set3] where the characters in set3 are the union of the characters in set1 and the characters in set2. (As an aside, note that [[set1][set2]] and [[set1]set2] also produce the same result.)
So, [[.ch.]] is simply the character class containing the union of an empty set of characters with the set of characters in the character class [.ch.], so basically it's the same as the character class [.ch.]. This is equivalent to [.ch] (since the second . is redundant) and thus [[.ch.]]*c is the same as [.ch]*c.
Similarly, [[ch]]*c simplifies to [ch]*c.
Finally, since there aren't any . characters in the string chchch, the regexes [.ch]*c and [ch]*c will produce the same result. (Try testing against the string c.hchch to see the difference and prove the above.)
Notes:
This is not a very good example for either demonstrating collating sequences or for detecting if they are implemented, as [[.ch.]]*c will match chchc in chchch both when collating sequences are supported (and ch is a valid sequence in the current locale) and when they are not but unions are.
A much better demo/test is to use the regex [[.ch.]] with the test string ch:
Collating sequences are supported if ch is matched.
Any other match means they are not.
They may be supported if an error is returned, as this is what happens if ch is not a valid sequence in the current locale (it's a valid collating sequence in the Czech locale):
If the error specifies that ch is not a valid collating sequence, then they are supported.
If the error returned is that the delimiter/token [. and/or .] is invalid/unsupported, then collating sequences are not supported.
If the error is ambiguous, or for a guaranteed way to check for support, you need to switch to the Czech locale (and confirm that ch is indeed a valid collating sequence) or switch to any other locale that has at least one defined collating sequence which can be used instead of ch.
¹ I am neither Jan Goyvaerts nor in no way affiliated with the Regular-Expressions.info site.
² Nor am I CMCDragonkai.
Is there a simple way to match all characters in a class except a certain set of them? For example if in a lanaguage where I can use \w to match the set of all unicode word characters, is there a way to just exclude a character like an underscore "_" from that match?
Only idea that came to mind was to use negative lookahead/behind around each character but that seems more complex than necessary when I effectively just want to match a character against a positive match AND negative match. For example if & was an AND operator I could do this...
^(\w&[^_])+$
It really depends on your regex flavor.
.NET
... provides only one simple character class set operation: subtraction. This is enough for your example, so you can simply use
[\w-[_]]
If a - is followed by a nested character class, it's subtracted. Simple as that...
Java
... provides a much richer set of character class set operations. In particular you can get the intersection of two sets like [[abc]&&[cde]] (which would give c in this case). Intersection and negation together give you subtraction:
[\w&&[^_]]
Perl
... supports set operations on extended character classes as an experimental feature (available since Perl 5.18). In particular, you can directly subtract arbitrary character classes:
(?[ \w - [_] ])
All other flavors
... (that support lookaheads) allow you to mimic the subtraction by using a negative lookahead:
(?!_)\w
This first checks that the next character is not a _ and then matches any \w (which can't be _ due to the negative lookahead).
Note that each of these approaches is completely general in that you can subtract two arbitrarily complex character classes.
You can use a negation of the \w class (--> \W) and exclude it:
^([^\W_]+)$
A negative lookahead is the correct way to go insofar as I understand your question:
^((?!_)\w)+$
This can be done in python with the regex module. Something like:
import regex as re
pattern = re.compile(r'[\W_--[ ]]+')
cleanString = pattern.sub('', rawString)
You'd typically install the regex module with pip:
pip install regex
EDIT:
The regex module has two behaviours, version 0 and version 1. Set substraction (as above) is a version 1 behaviour. The pypi docs claim version 1 is the default behaviour, but you may find this is not the case. You can check with
import regex
if regex.DEFAULT_VERSION == regex.VERSION1:
print("version 1")
To set it to version 1:
regex.DEFAULT_VERSION = regex.VERSION1
or to use version one in a single expression:
pattern = re.compile(r'(?V1)[\W_--[ ]]+')
Try using subtraction:
[\w&&[^_]]+
Note: This will work in Java, but might not in some other Regex engine.
I am writing a tool to help students learn regular expressions. I will probably be writing it in Java.
The idea is this: the student types in a regular expression and the tool shows which parts of a text will get matched by the regex. Simple enough.
But I want to support several different regex "flavors" such as:
Basic regular expressions (think: grep)
Extended regular expressions (think: egrep)
A subset of Perl regular expressions, including the character classes \w, \s, etc.
Sed-style regular expressions
Java has the java.util.Regex class, but it supports only Perl-style regular expressions, which is a superset of the basic and extended REs. What I think I need is a way to take any given regular expression and escape the meta-characters that aren't part of a given flavor. Then I could give it to the Regex object and it would behave as if it was written for the selected RE interpreter.
For example, given the following regex:
^\w+[0-9]{5}-(\d{4})?$
As a basic regular expression, it would be interpreted as:
^\\w\+[0-9]\{5\}-\(\\d\{4\}\)\?$
As an extended regular expression, it would be:
^\\w+[0-9]{5}-(\\d{4})?$
And as a Perl-style regex, it would be the same as the original expression.
Is there a "regular expression for regular expressions" than I could run through a regex search-and-replace to quote the non-meta characters? What else could I do? Are there alternative Java classes I could use?
Alternatively, you could use Jakarta ORO?
This supports the following regex 'flavors':
Perl5 compatible regular expressions
AWK-like regular expressions
glob expressions
check out this post for a 'regular expression for regular expressions': Is there a regular expression to detect a valid regular expression?
You can use this as a basis for your module.
I have written something similar: Is there a regular expression to detect a valid regular expression?
You could take part of that expression, and match each token separatly:
[^?+*{}()[\]\\] # literal characters
\\[A-Za-z] # Character classes
\\\d+ # Back references
\\\W # Escaped characters
\[\^?(?:\\.|[^\\])+?\] # Character classs
\((?:\?[:=!>]|\?<[=!])? # Beginning of a group
\) # End of a group
(?:[?+*]|\{\d+(?:,\d*)?\})\?? # Repetition
\| # Alternation
For each match, you could have some dictionary of appropriate replacements in the target flavor.
If your target is a Unix / Linux system, why just shell out to the definitive host of each regex? ie, use grep for BRE, egrep for ERE, perl for PCRE, etc? The only thing your module would need to do is the UI. Most of the regex testers that I have seen (that are decent) use a variant of this approach.
If you want yet another library suggestion, look at TRE for the BRE / ERE / POSIX / AWK part. It does not support back references, so PCRE / Python / Ruby / JS / Java is out...
if you want your students to learn regex,why not use a freely available tool -- regex Coach -- http://www.weitz.de/regex-coach/ on the net that is pretty good to learn and evaluate regexes ?
look at this SO thread on a similar issue -- https://stackoverflow.com/questions/89718/is-there-anything-like-regexbuddy-in-the-open-source-world
BR,
~A