I have a syntatic doubt about a Java program grammar rule mentioned in Oracle Java specification manual. Here is an approximation of that rule, to the extent that SO's HTML restrictions allow it.
ArrayInitializer:
{ [VariableInitializerList] [,] }
VariableInitializerList:
VariableInitializer {, VariableInitializer}
It is given in section 2.4 of Java manual. [x] denotes zero or one occurrences of x and {x} denotes zero or more.
However, I have the following doubts,
For ArrayInitializer non-terminal, the first curly brace { denotes a terminal curly brace or the syntactic meaning I mentioned above.
Also, for VariableInitializerList non-terminal, I know that {, VariableInitializer} means something equivalent to regex (a,b)*, but this kind of grammar will also accept some other string which does not actually fit the criterion, won't it?
I also want to confirm if the square brackets in the first production denote the regex or simple terminals.
I find this grammar specification confusing. Can you help me understand it?
The Java specification uses font styles to distinguish between literal characters, as found in the input, and grammatical symbols (non-terminals and grammar operators). Literals are shown in fixed width, while grammar symbols are shown in italics.
That's a pretty subtle distinction, particularly for certain punctuation symbols [Note 1]. Fortunately, the only punctuation used as grammar operators are brackets and braces, and it's not that hard to see whether a brace is slanting (italic) or upright. The brace in ArrayInitializer is upright, and the bracket is slanting, as is the brace in VariableInitializerList. So the brace in ArrayInitializer is a literal character. The brackets in that production indicate that the enclosed grammar symbols are optional, and the braces in VariableInitializerList indicate that the enclosed symbols can be repeated any number of times, including zero. (That's effectively the Kleene *-operator, which, as you say, is used in regular expressions.)
I trust that answers your questions (1) and (3). I don't really understand your question 2. Note that the comma in VariableInitializer { , VariableInitializer } is a literal character (it's not in italic) so what's being described is a non-empty comma-separated list of initializers. I don't know why you think that differs from other Kleene star operators.
Notes
It doesn't help that a CSS bug affects the examples in section 2.4, which supposedly illustrate the grammar. The CSS forces everything in a "note" to be italicized, thereby hiding the distinction between grammar operators and literal characters.
Related
I am trying the regex ([[.ch.]]*)c against the test string chchch. According to the spec:
[[.ch.]]*c matches the first to fifth character in the string chchch
When I test it in Java, it indeed matches those characters, but so does [[ch]]*c. Thus I am not sure if the collating symbol is respected. Is it?
TL;DR - No.
The specification you are reading/quoting is the Open Group's SUS (Single UNIX® Specification) version of the regular expression part of IEEE's POSIX (Portable Operating System Interface for uniX) collection of standards. (See https://www.regular-expressions.info/posix.html ¹)
In general, only POSIX-compliant regular expression engines fully support POSIX bracket expressions, which are essentially what other regex flavors call character classes but with a few special features, one being that [. and .] are interpreted as the start and end of a collating sequence when used within the expressions.
Unfortunately, very few regex engines are POSIX-compliant and, in fact, some claiming to implement POSIX regexes just use the regular expression syntax defined by POSIX and don't have full locale support. Thus they don't implement all/any of the bracket expression features/quirks.
Java's regular expressions are in no way POSIX-compliant, as can be seen from this Regular Expression Engine Comparison Chart ². Its regex package implements a "Perl-like" regex engine, missing a few features (e.g. conditional expressions and comments), but including some extra ones (e.g. possessive quantifiers and variable-length, but finite, look-behind assertions).
Neither Perl nor Java support the collation-related bracket delimiters [= and =] (character equivalence), or [. and .] (collating sequence). Perl does support character classes using the POSIX [: and :] delimiters, but Java only supports them using the \p operator (with a few caveats as explained here).
So, what is going on with the regex [[.ch.]]*c in Java? (I'm ignoring the capturing group as it doesn't change the analysis.)
Well, it turns out that Java's regex package supports unions in its character classes. This is achieved by nesting. For example, [set1[set2]] is equivalent to [set3] where the characters in set3 are the union of the characters in set1 and the characters in set2. (As an aside, note that [[set1][set2]] and [[set1]set2] also produce the same result.)
So, [[.ch.]] is simply the character class containing the union of an empty set of characters with the set of characters in the character class [.ch.], so basically it's the same as the character class [.ch.]. This is equivalent to [.ch] (since the second . is redundant) and thus [[.ch.]]*c is the same as [.ch]*c.
Similarly, [[ch]]*c simplifies to [ch]*c.
Finally, since there aren't any . characters in the string chchch, the regexes [.ch]*c and [ch]*c will produce the same result. (Try testing against the string c.hchch to see the difference and prove the above.)
Notes:
This is not a very good example for either demonstrating collating sequences or for detecting if they are implemented, as [[.ch.]]*c will match chchc in chchch both when collating sequences are supported (and ch is a valid sequence in the current locale) and when they are not but unions are.
A much better demo/test is to use the regex [[.ch.]] with the test string ch:
Collating sequences are supported if ch is matched.
Any other match means they are not.
They may be supported if an error is returned, as this is what happens if ch is not a valid sequence in the current locale (it's a valid collating sequence in the Czech locale):
If the error specifies that ch is not a valid collating sequence, then they are supported.
If the error returned is that the delimiter/token [. and/or .] is invalid/unsupported, then collating sequences are not supported.
If the error is ambiguous, or for a guaranteed way to check for support, you need to switch to the Czech locale (and confirm that ch is indeed a valid collating sequence) or switch to any other locale that has at least one defined collating sequence which can be used instead of ch.
¹ I am neither Jan Goyvaerts nor in no way affiliated with the Regular-Expressions.info site.
² Nor am I CMCDragonkai.
I'm writing a custom Assembler in Java for the 6502 microprocessor instruction set, obviously one of the main parts of an assembler is checking the syntax of the assembly program is valid before it can be converted into byte form.
So far I have worked out the following rules that will all need to be checked for each line of the assembly program
All instructions must be three letters long and match an instruction in the instruction table.
Branch labels cannot contain any characters aside from alphanumeric
Operands cannot contain symbols outside of "( ) $ # , + -"
Opening parentheses in operands must be closed.
Operands can only contain one pair of parentheses
$ and # must be followed by numeric characters in operands.
Commas must exist between a value and a value OR a parenthesis and a value i.e. (xxx,yyy) or (xxx),yyy
I am coding the assembler in Java, and as such I was thinking about using regex patterns in order to check the validity of the above rules. Is this something that regex can be used for? I have used regex in the past but usually just single checks and nothing as extensive as this.
I'm not asking anyone to work out the regex patterns that could be used for these rules (although I would be grateful if anyone could as I really have no idea how to do some of them), I just want to know if checking these rules is something that is possible with regex.
Regular expression can check that a string contains exactly 3 letters but it can't tell you if it exists in a table.
Regular expression can check that a string contains only alphanumeric characters.
Regular expression can check that a string only contains certain symbols.
Regular expression can count the number of opening parenthesis and closing parenthesis but cannot tell you if each opening has a matching closing.
The last three rules can also be checked via regular expression.
See javadoc for class java.util.regex.Pattern.
For example, regex for first rule is \p{Alnum}{3}
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
This question's answers are a community effort. Edit existing answers to improve this post. It is not currently accepting new answers or interactions.
I don't really understand regular expressions. Can you explain them to me in an easy-to-follow manner? If there are any online tools or books, could you also link to them?
The most important part is the concepts. Once you understand how the building blocks work, differences in syntax amount to little more than mild dialects. A layer on top of your regular expression engine's syntax is the syntax of the programming language you're using. Languages such as Perl remove most of this complication, but you'll have to keep in mind other considerations if you're using regular expressions in a C program.
If you think of regular expressions as building blocks that you can mix and match as you please, it helps you learn how to write and debug your own patterns but also how to understand patterns written by others.
Start simple
Conceptually, the simplest regular expressions are literal characters. The pattern N matches the character 'N'.
Regular expressions next to each other match sequences. For example, the pattern Nick matches the sequence 'N' followed by 'i' followed by 'c' followed by 'k'.
If you've ever used grep on Unix—even if only to search for ordinary looking strings—you've already been using regular expressions! (The re in grep refers to regular expressions.)
Order from the menu
Adding just a little complexity, you can match either 'Nick' or 'nick' with the pattern [Nn]ick. The part in square brackets is a character class, which means it matches exactly one of the enclosed characters. You can also use ranges in character classes, so [a-c] matches either 'a' or 'b' or 'c'.
The pattern . is special: rather than matching a literal dot only, it matches any character†. It's the same conceptually as the really big character class [-.?+%$A-Za-z0-9...].
Think of character classes as menus: pick just one.
Helpful shortcuts
Using . can save you lots of typing, and there are other shortcuts for common patterns. Say you want to match a digit: one way to write that is [0-9]. Digits are a frequent match target, so you could instead use the shortcut \d. Others are \s (whitespace) and \w (word characters: alphanumerics or underscore).
The uppercased variants are their complements, so \S matches any non-whitespace character, for example.
Once is not enough
From there, you can repeat parts of your pattern with quantifiers. For example, the pattern ab?c matches 'abc' or 'ac' because the ? quantifier makes the subpattern it modifies optional. Other quantifiers are
* (zero or more times)
+ (one or more times)
{n} (exactly n times)
{n,} (at least n times)
{n,m} (at least n times but no more than m times)
Putting some of these blocks together, the pattern [Nn]*ick matches all of
ick
Nick
nick
Nnick
nNick
nnick
(and so on)
The first match demonstrates an important lesson: * always succeeds! Any pattern can match zero times.
A few other useful examples:
[0-9]+ (and its equivalent \d+) matches any non-negative integer
\d{4}-\d{2}-\d{2} matches dates formatted like 2019-01-01
Grouping
A quantifier modifies the pattern to its immediate left. You might expect 0abc+0 to match '0abc0', '0abcabc0', and so forth, but the pattern immediately to the left of the plus quantifier is c. This means 0abc+0 matches '0abc0', '0abcc0', '0abccc0', and so on.
To match one or more sequences of 'abc' with zeros on the ends, use 0(abc)+0. The parentheses denote a subpattern that can be quantified as a unit. It's also common for regular expression engines to save or "capture" the portion of the input text that matches a parenthesized group. Extracting bits this way is much more flexible and less error-prone than counting indices and substr.
Alternation
Earlier, we saw one way to match either 'Nick' or 'nick'. Another is with alternation as in Nick|nick. Remember that alternation includes everything to its left and everything to its right. Use grouping parentheses to limit the scope of |, e.g., (Nick|nick).
For another example, you could equivalently write [a-c] as a|b|c, but this is likely to be suboptimal because many implementations assume alternatives will have lengths greater than 1.
Escaping
Although some characters match themselves, others have special meanings. The pattern \d+ doesn't match backslash followed by lowercase D followed by a plus sign: to get that, we'd use \\d\+. A backslash removes the special meaning from the following character.
Greediness
Regular expression quantifiers are greedy. This means they match as much text as they possibly can while allowing the entire pattern to match successfully.
For example, say the input is
"Hello," she said, "How are you?"
You might expect ".+" to match only 'Hello,' and will then be surprised when you see that it matched from 'Hello' all the way through 'you?'.
To switch from greedy to what you might think of as cautious, add an extra ? to the quantifier. Now you understand how \((.+?)\), the example from your question works. It matches the sequence of a literal left-parenthesis, followed by one or more characters, and terminated by a right-parenthesis.
If your input is '(123) (456)', then the first capture will be '123'. Non-greedy quantifiers want to allow the rest of the pattern to start matching as soon as possible.
(As to your confusion, I don't know of any regular-expression dialect where ((.+?)) would do the same thing. I suspect something got lost in transmission somewhere along the way.)
Anchors
Use the special pattern ^ to match only at the beginning of your input and $ to match only at the end. Making "bookends" with your patterns where you say, "I know what's at the front and back, but give me everything between" is a useful technique.
Say you want to match comments of the form
-- This is a comment --
you'd write ^--\s+(.+)\s+--$.
Build your own
Regular expressions are recursive, so now that you understand these basic rules, you can combine them however you like.
Tools for writing and debugging regexes:
RegExr (for JavaScript)
Perl: YAPE: Regex Explain
Regex Coach (engine backed by CL-PPCRE)
RegexPal (for JavaScript)
Regular Expressions Online Tester
Regex Buddy
Regex 101 (for PCRE, JavaScript, Python, Golang, Java 8)
I Hate Regex
Visual RegExp
Expresso (for .NET)
Rubular (for Ruby)
Regular Expression Library (Predefined Regexes for common scenarios)
Txt2RE
Regex Tester (for JavaScript)
Regex Storm (for .NET)
Debuggex (visual regex tester and helper)
Books
Mastering Regular Expressions, the 2nd Edition, and the 3rd edition.
Regular Expressions Cheat Sheet
Regex Cookbook
Teach Yourself Regular Expressions
Free resources
RegexOne - Learn with simple, interactive exercises.
Regular Expressions - Everything you should know (PDF Series)
Regex Syntax Summary
How Regexes Work
JavaScript Regular Expressions
Footnote
†: The statement above that . matches any character is a simplification for pedagogical purposes that is not strictly true. Dot matches any character except newline, "\n", but in practice you rarely expect a pattern such as .+ to cross a newline boundary. Perl regexes have a /s switch and Java Pattern.DOTALL, for example, to make . match any character at all. For languages that don't have such a feature, you can use something like [\s\S] to match "any whitespace or any non-whitespace", in other words anything.
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 6 years ago.
I have a question about removing unwanted character, or in a better sense, keep only certain ones. I have stumbled upon something called String literal and I don't understand how it can help me with achieving my goal. I stumbled upon this somewhere before but don't understand how to use it.
The String literal "[^\p{Alpha}-']" may be used to match any
character that is NOT alphabetic, a dash, or apostrophe; you may find
this useful when using replaceAll()
I understand what replaceAll() does, but other things I don't understand are the little codes like [a-zA-Z] that you can use in it and where to look to find more of them. So I pretty much want to do what the quotes says, and only keep the letters and some punctuation.
The process you are describing is called Regular Expressions or regex for short. It's a tool implemented in many programming languages (including Java) which allows you to handle strings with one line of code, which would otherwise be more complicated and annoying.
I suggest this link for a more in depth tutorial.
replaceAll() uses regexes.
There's too much to explain in a single post, but I will explain a little.
Here's a regex: [^A-Za-z.?!]
[] signifies a character class. It will match one of the contained characters (as modified by meta-characters).
^ When this is the first character in a char class, it is a meta-character meaning NOT.
A-Z signifies a range. Anything between those ASCII/Unicode values will be matched
The ., ?, ! are treated as literals (in other contexts they can become meta-characters).
So, the regex, if quoted and put in a replaceAll() will change everything that's not alphabetic, ., ?, or !.
The second parameter in replaceAll() also accepts some special regex-related characters, like $1 does not literally mean $1.
You'll need to learn about more advanced regex things (capture groups) before you use $1.
what is the regex format for the usage of ')' if and only if '(' is used earlier, or ')' is must if '(' is used? I have tried ^[a-zA-Z]+(([)]?[,]?[a-zA-Z0-9 ][. -/']?[(]?[a-zA-Z0-9][)]?)?[a-zA-Z:.])$ . But I can't make it use ')' only when '(' is being used.
Regex cannot take care of context. In your case you're seeking to find context. Regex is not meant for that. You need to write a function that checks this.
Citing from this link:
In the context of formal language theory, something is called
“regular” when it has a grammar where all production rules have one of
the following forms:
B -> a
B -> aC
B -> ε
You can read those -> rules as “The left hand side can be replaced
with the right hand side”. So the first rule would be “B can be
replaced with a”, the second one “B can be replaced with aC” and the
third one “B can be replaced with the empty string” (ε is the symbol
for the empty string).
So what are B, C and a? By convention, uppercase characters denote so
called “non-terminals” - symbols which can be broken down further -
and lowercase characters denote “terminals” - symbols which cannot be
broken down any further.
In your case you are looking for something like:
(\([x].*\)[x])*
I added the [x] to stand for an x number of times (it's not part of the regex convention of course). As you can see by the definition of regex, there's no way to represent such expression in a way that complies with regex definition.
This is not just a "grey" definition issue. Creating a regex-like language to solve problems like the one you noted here is much more complicated (algorithmic and complexity wise). It's a totally different problem domain to try and patternize the type of problems as the one you mentioned here.