Im looking to match a string to the format of an expression in formal logic, where two alphabetic characters are operated on by v|^|>|=, where the characters can be preceded by ~|!|?, and where the characters maybe be surrounded by brackets and preceded again by ~|!|?. At first I thought that the following expression might do it:
s.matches("^[!?~]*[(]*[!?~]*[a-z]{1}\\s[v>=^]{1}\\s[!?~]*[a-z]{1}[)]*$")
However, I have realised that these expressions can be stacked onto one another, and I dont know how to account for that in the regex.
Examples of acceptable matches:
~p v q
~?(p ^ ~r)
!p
p v ~(!r ^ t)
~!(p = (~!q ^ t))
It is possible to add as many operators as you want, to create an enormously long expression. How do I account for this with the regex in a general format?
Thanks heaps :)
You can't fully describe that language with a plain regular expression. The problem is that any letter can be replaced by an expression. You need recursive regular expressions, and these aren't supported by Java's java.util.regex package.
This is a feature that, as far as I know, started in Perl 4 or so and has appeared in a few package that advertise "Perl-Compatible Regular Expressions" (PCRE). It's not part of standard Java, Python, Ruby, C++, and I don't believe the .NET libraries for C#, VB.Net, C++/CLI etc. have it either.
Related
This regex: \p{L}+ matches these characters "ASKJKSDJKDSJÄÖÅüé" of the example string "ASKJKSDJK_-.;,DSJÄÖÅ!”#€%&/()=?`¨’<>üé" which is great but is the exact opposite of what I want. Which leads me to negating regexes.
Goal:
I want to match any and all characters that are not a letter nor a number in multiple languages.
Could a negative regex be a natural direction for this?
I should mention one intended use for the regex I'd like to find is to validate passwords for the rule:
that it needs to contain at least one special character, which I
define as not being a number nor a letter.
It would seem defining ranges of special characters should be avoided if possible, because why limit the possibilities? Thus my definition. I assume there could be some problems with such a wide definition, but it is a first step.
If you have some suggestions for a better solution I'm giving below or just have some thoughts on the subject, I'm sure I'm not the only one that would like to learn about it. Thanks.
Note I'm using double \\ in the Java code. Platform is Java 11.
You can shove those \\p things in []. And thus, use the fact that you can negate chargroups. This is all you need:
Pattern p = Pattern.compile("[^\\p{L}]");
Matcher m = p.matcher("ASKJKSDJK_-.;,DSJÄÖÅ!”#€%&/()=?`¨’<>üé");
while (m.find()) System.out.print(m.group(0));
That prints:
_-.;,!”#€%&/()=?`¨’<>
Which is exactly what you're looking for, no?
No need to mess with lookaheads here.
So after having read similar, though not identical questions and some equally great answers, I came up with this solution: (?=\P{L})(?=\P{N}) meaning match both not letters and not numbers. Even if I'm asserting numbers separately I need to negate both to meet the specification of special characters (See question).
This is making use of a non-consuming regular expression with the parentheses and the?=, first matching the expression in the first parenthesis and after that continue to match the whole in the second. Thanks to #Jason Cohen for this detail in the Regular Expressions: Is there an AND operator? discussion.
The upper case P in \P{L} and \P{N} expresses the "not belonging to a category" in Unicode Categories, where the uppercase P means "not", i e the opposite of a lowercase p.
It's not perfect for a real world solution, but works as a starting point at least. Note I'm using double \\ in the Java code. Platform is Java 11.
I'm trying to write a regex for:
Strings of characters beginning and ending with a double quote character, that do not contain control characters, and for which the backslash is used to escape the next character.
The paren-star form of comments in Pascal: strings beginning with (* and ending with *) that do not contain *)
I'm trying to write a version in Ruby, then another in Java, but I'm having trouble finding the differences in regex expressions for both. Any help is appreciated!
Here is a good place to start:
specifics for Java (mostly usage of regex in general)
specifics for Ruby (mostly usage of regex in general)
flavor comparison (mostly regex syntax and features)
Mostly note that in Ruby your write regexes by delimiting them with /, and in Java you need to double-escape everything (\\ instead of \) so that the backslashes get through to the regex engine. Everything else you should find within those links I gave you above.
For the sake of completeness of this answer, I would also like to include Tom's Link to this online regex tester, that supports a multitude of regex flavors.
You should go ahead and give both regexes a go. If you encounter any problems, you are more than welcome to ask a new (specific) question, showing your own attempts.
I'm trying to validate a simple arithmetic expression to insure it fits within the format operand operator operand : 234.34 + 5. I figured out how to validate this easy enough, but I want the users to be able to continue inputting more than 2 values such as: 234.34 + 5 / 6 * 7 - -34. So far my regex is as follows:
[-]*\d+[.\d+[E\d+]*]*[\s+[*+/-]\s+[-]*\d+[.\d+[E\d+]*]*]*
This partially works, but the problem I have is it allows for some strange things I don't want such as -4.34.1 - 34 +
Any suggestions?
Try this. It's ugly as hell but it should work (if you aren't using any parentheses):
-?\d+(?:\.\d+(?:E\d+)?)?(\s*[-+/\*]\s+-?\d+(?:\.\d+(?:E\d+)?)?)+
Explanation
This will math a number followed by an operator and a number indefinitely
-?\d+(?:\.\d+(?:E\d+)?)? Match a number
(
\s* optional whitespace
[-+/\*] any operator: +, -, *, /
\s+ at least one whitespace (to avoid a --b)
-?\d+(?:\.\d+(?:E\d+)?)? match another number
)+ repeat this block one or more times
And the number expression:
-? optional -
\d+ digits (one or more)
(?: start of optional part
\. dot
\d+ digits
(?: start of optional scientific notation part
E match E char
\d+ match digitx
)? close of the optional scientific notatotion part
)? close optional group
But i strongly suggest trying to write a proper parser for this, it will also allow supporting of parentheses: a + (b + c).
I hate to be "that guy" but why not just write a simple validator that parses the string without using regular expressions? What's the reasoning behind using regular expressions for this? If you were to write your own parser, not only will the solution be easier to understand and maintain but with a little bit more work you would be able to evaluate the expression as well.
It may be best to just write a parser. I know, that sounds scary, but this is actually a second-year homework exercise at college.
See Dijkstra's Shunting-yard algorithm. This will allow you to both verify and evaluate the expression, so if that is where you're going with this project, you're going to have to implement it anyways...
i released an expression evaluator based on Dijkstra's Shunting Yard algorithm, under the terms of the Apache License 2.0:
http://projects.congrace.de/exp4j/index.html
Why not use string.split to get each operand and value by itself. Then you can parse it using much simpler regex ([\d*.\d*|\d|+|-|*|/]) or just Integer.getInterger for your values.
Which regular expression engine does Java uses?
In a tool like RegexBuddy if I use
[a-z&&[^bc]]
that expression in Java is good but in RegexBuddy it has not been understood.
In fact it reports:
Match a single character present in
the list below [a-z&&[^bc]
A character in the range between a and z : a-z
One of the characters &[^bc : &&[^bc
Match the character ] literally : ]
but i want to match a character between a and z intersected with a character that is not b or c
Like most regex flavors, java.util.regex.Pattern has its own specific features with syntax that may not be fully compatible with others; this includes character class union, intersection and subtraction:
[a-d[m-p]] : a through d, or m through p: [a-dm-p] (union)
[a-z&&[def]] : d, e, or f (intersection)
[a-z&&[^bc]] : a through z, except for b and c: [ad-z] (subtraction)
The most important "caveat" of Java regex is that matches attempts to match a pattern against the whole string. This is atypical of most engines, and can be a source of confusion at times.
See also
regular-expressions.info/Flavor Comparison and Java Flavor Notes
On character class subtraction
Subtraction allows you to define for example "all consonants" in Java as [a-z&&[^aeiou]].
This syntax is specific to Java. In XML Schema, .NET, JGSoft and RegexBuddy, it's [a-z-[aeiou]]. Other flavors may not support this feature at all.
References
regular-expressions.info/Character Classes in XML Regular Expressions
MSDN - Regular Expression Character Classes - Subtraction
Related questions
What is the point behind character class intersections in Java’s Regex?
Java uses its own regular expression engine, which behaviour is defined in the Pattern class.
You can test it with an Eclipse plugin or online.
RegexBuddy does not yet support the character class union, intersection, and subtraction syntax that is unique to the Java regular expression flavor. This is the only part of the Java regex syntax that RegexBuddy does not yet support. We're planning to implement this in a future version of RegexBuddy. The reason this has been postponed is because no other regular expression flavor supports this syntax.
P.S.: If you have a question about RegexBuddy in particular, please add the "regexbuddy" tag to your question. Then the question automatically shows up in my RSS reader. I don't follow the "regex" tag because far too many questions use that tag, and most are already answered by the time I see them.
I am writing a tool to help students learn regular expressions. I will probably be writing it in Java.
The idea is this: the student types in a regular expression and the tool shows which parts of a text will get matched by the regex. Simple enough.
But I want to support several different regex "flavors" such as:
Basic regular expressions (think: grep)
Extended regular expressions (think: egrep)
A subset of Perl regular expressions, including the character classes \w, \s, etc.
Sed-style regular expressions
Java has the java.util.Regex class, but it supports only Perl-style regular expressions, which is a superset of the basic and extended REs. What I think I need is a way to take any given regular expression and escape the meta-characters that aren't part of a given flavor. Then I could give it to the Regex object and it would behave as if it was written for the selected RE interpreter.
For example, given the following regex:
^\w+[0-9]{5}-(\d{4})?$
As a basic regular expression, it would be interpreted as:
^\\w\+[0-9]\{5\}-\(\\d\{4\}\)\?$
As an extended regular expression, it would be:
^\\w+[0-9]{5}-(\\d{4})?$
And as a Perl-style regex, it would be the same as the original expression.
Is there a "regular expression for regular expressions" than I could run through a regex search-and-replace to quote the non-meta characters? What else could I do? Are there alternative Java classes I could use?
Alternatively, you could use Jakarta ORO?
This supports the following regex 'flavors':
Perl5 compatible regular expressions
AWK-like regular expressions
glob expressions
check out this post for a 'regular expression for regular expressions': Is there a regular expression to detect a valid regular expression?
You can use this as a basis for your module.
I have written something similar: Is there a regular expression to detect a valid regular expression?
You could take part of that expression, and match each token separatly:
[^?+*{}()[\]\\] # literal characters
\\[A-Za-z] # Character classes
\\\d+ # Back references
\\\W # Escaped characters
\[\^?(?:\\.|[^\\])+?\] # Character classs
\((?:\?[:=!>]|\?<[=!])? # Beginning of a group
\) # End of a group
(?:[?+*]|\{\d+(?:,\d*)?\})\?? # Repetition
\| # Alternation
For each match, you could have some dictionary of appropriate replacements in the target flavor.
If your target is a Unix / Linux system, why just shell out to the definitive host of each regex? ie, use grep for BRE, egrep for ERE, perl for PCRE, etc? The only thing your module would need to do is the UI. Most of the regex testers that I have seen (that are decent) use a variant of this approach.
If you want yet another library suggestion, look at TRE for the BRE / ERE / POSIX / AWK part. It does not support back references, so PCRE / Python / Ruby / JS / Java is out...
if you want your students to learn regex,why not use a freely available tool -- regex Coach -- http://www.weitz.de/regex-coach/ on the net that is pretty good to learn and evaluate regexes ?
look at this SO thread on a similar issue -- https://stackoverflow.com/questions/89718/is-there-anything-like-regexbuddy-in-the-open-source-world
BR,
~A