Related
This question already has answers here:
Regular expression to match balanced parentheses
(21 answers)
Closed 3 years ago.
Is it possible to write a regular expression that matches a nested pattern that occurs an unknown number of times? For example, can a regular expression match an opening and closing brace when there are an unknown number of open/close braces nested within the outer braces?
For example:
public MyMethod()
{
if (test)
{
// More { }
}
// More { }
} // End
Should match:
{
if (test)
{
// More { }
}
// More { }
}
No. It's that easy. A finite automaton (which is the data structure underlying a regular expression) does not have memory apart from the state it's in, and if you have arbitrarily deep nesting, you need an arbitrarily large automaton, which collides with the notion of a finite automaton.
You can match nested/paired elements up to a fixed depth, where the depth is only limited by your memory, because the automaton gets very large. In practice, however, you should use a push-down automaton, i.e a parser for a context-free grammar, for instance LL (top-down) or LR (bottom-up). You have to take the worse runtime behavior into account: O(n^3) vs. O(n), with n = length(input).
There are many parser generators avialable, for instance ANTLR for Java. Finding an existing grammar for Java (or C) is also not difficult.
For more background: Automata Theory at Wikipedia
Using regular expressions to check for nested patterns is very easy.
'/(\((?>[^()]+|(?1))*\))/'
Probably working Perl solution, if the string is on one line:
my $NesteD ;
$NesteD = qr/ \{( [^{}] | (??{ $NesteD }) )* \} /x ;
if ( $Stringy =~ m/\b( \w+$NesteD )/x ) {
print "Found: $1\n" ;
}
HTH
EDIT: check:
http://dev.perl.org/perl6/rfc/145.html
ruby information: http://www.ruby-forum.com/topic/112084
more perl: http://www.perlmonks.org/?node_id=660316
even more perl: https://metacpan.org/pod/Text::Balanced
perl, perl, perl: http://perl.plover.com/yak/regex/samples/slide083.html
And one more thing by Torsten Marek (who had pointed out correctly, that it's not a regex anymore):
http://coding.derkeiler.com/Archive/Perl/comp.lang.perl.misc/2008-03/msg01047.html
The Pumping lemma for regular languages is the reason why you can't do that.
The generated automaton will have a finite number of states, say k, so a string of k+1 opening braces is bound to have a state repeated somewhere (as the automaton processes the characters). The part of the string between the same state can be duplicated infinitely many times and the automaton will not know the difference.
In particular, if it accepts k+1 opening braces followed by k+1 closing braces (which it should) it will also accept the pumped number of opening braces followed by unchanged k+1 closing brases (which it shouldn't).
Yes, if it is .NET RegEx-engine. .Net engine supports finite state machine supplied with an external stack. see details
Proper Regular expressions would not be able to do it as you would leave the realm of Regular Languages to land in the Context Free Languages territories.
Nevertheless the "regular expression" packages that many languages offer are strictly more powerful.
For example, Lua regular expressions have the "%b()" recognizer that will match balanced parenthesis. In your case you would use "%b{}"
Another sophisticated tool similar to sed is gema, where you will match balanced curly braces very easily with {#}.
So, depending on the tools you have at your disposal your "regular expression" (in a broader sense) may be able to match nested parenthesis.
YES
...assuming that there is some maximum number of nestings you'd be happy to stop at.
Let me explain.
#torsten-marek is right that a regular expression cannot check for nested patterns like this, BUT it is possible to define a nested regex pattern which will allow you to capture nested structures like this up to some maximum depth. I created one to capture EBNF-style comments (try it out here), like:
(* This is a comment (* this is nested inside (* another level! *) hey *) yo *)
The regex (for single-depth comments) is the following:
m{1} = \(+\*+(?:[^*(]|(?:\*+[^)*])|(?:\(+[^*(]))*\*+\)+
This could easily be adapted for your purposes by replacing the \(+\*+ and \*+\)+ with { and } and replacing everything in between with a simple [^{}]:
p{1} = \{(?:[^{}])*\}
(Here's the link to try that out.)
To nest, just allow this pattern within the block itself:
p{2} = \{(?:(?:p{1})|(?:[^{}]))*\}
...or...
p{2} = \{(?:(?:\{(?:[^{}])*\})|(?:[^{}]))*\}
To find triple-nested blocks, use:
p{3} = \{(?:(?:p{2})|(?:[^{}]))*\}
...or...
p{3} = \{(?:(?:\{(?:(?:\{(?:[^{}])*\})|(?:[^{}]))*\})|(?:[^{}]))*\}
A clear pattern has emerged. To find comments nested to a depth of N, simply use the regex:
p{N} = \{(?:(?:p{N-1})|(?:[^{}]))*\}
where N > 1 and
p{1} = \{(?:[^{}])*\}
A script could be written to recursively generate these regexes, but that's beyond the scope of what I need this for. (This is left as an exercise for the reader. 😉)
Using the recursive matching in the PHP regex engine is massively faster than procedural matching of brackets. especially with longer strings.
http://php.net/manual/en/regexp.reference.recursive.php
e.g.
$patt = '!\( (?: (?: (?>[^()]+) | (?R) )* ) \)!x';
preg_match_all( $patt, $str, $m );
vs.
matchBrackets( $str );
function matchBrackets ( $str, $offset = 0 ) {
$matches = array();
list( $opener, $closer ) = array( '(', ')' );
// Return early if there's no match
if ( false === ( $first_offset = strpos( $str, $opener, $offset ) ) ) {
return $matches;
}
// Step through the string one character at a time storing offsets
$paren_score = -1;
$inside_paren = false;
$match_start = 0;
$offsets = array();
for ( $index = $first_offset; $index < strlen( $str ); $index++ ) {
$char = $str[ $index ];
if ( $opener === $char ) {
if ( ! $inside_paren ) {
$paren_score = 1;
$match_start = $index;
}
else {
$paren_score++;
}
$inside_paren = true;
}
elseif ( $closer === $char ) {
$paren_score--;
}
if ( 0 === $paren_score ) {
$inside_paren = false;
$paren_score = -1;
$offsets[] = array( $match_start, $index + 1 );
}
}
while ( $offset = array_shift( $offsets ) ) {
list( $start, $finish ) = $offset;
$match = substr( $str, $start, $finish - $start );
$matches[] = $match;
}
return $matches;
}
as zsolt mentioned, some regex engines support recursion -- of course, these are typically the ones that use a backtracking algorithm so it won't be particularly efficient. example: /(?>[^{}]*){(?>[^{}]*)(?R)*(?>[^{}]*)}/sm
No, you are getting into the realm of Context Free Grammars at that point.
This seems to work: /(\{(?:\{.*\}|[^\{])*\})/m
I have a Lexer Rule as follows:
PREFIX : [abcd]'_';
EXTRA : ('xyz' | 'XYZ' );
SUFFIX : [ab];
TCHAN : PREFIX EXTRA? DIGIT+ SUFFIX?;
and a parser rule:
tpin : TCHAN
;
In the exit_tpin() Listiner method, is there a syntax where I can extract the DIGIT component of the token? Right now I can get the ctx.TCHAN() element, but this is a string. I just want the digit portion of TCHAN.
Or should I remove TCHAN as a TOKEN and move that rule to be tpin (i.e)
tpin : PREFIX EXTRA? DIGIT+ SUFFIX?
Where I know how to extract DIGIT from the listener.
My guess is that by the time the TOKEN is presented to the parser it is too late to deconstruct it... but I was wondering if some ANTLR guru's out there knew of a technique.
If I re-write my TOKENIZER, there is a possiblity that TCHAN tokens will be missed for INT/ID tokens (I think thats why I ended up parsing as I do).
I can always do some regexp work in the listener method... but that seemed like bad form ... as I had the individual components earlier. I'm just lazy, and was wondering if a techniqe other than refactoring the parsing grammar was possible.
In The Definitive ANTLR Reference you can find examples of complex lexers where much of the work is done. But when learning ANTLR, I would advise to consider the lexer mostly for its splitting function of the input stream into small tokens. Then do the big work in the parser. In the present case I would do :
grammar Question;
/* extract digit */
question
: tpin EOF
;
tpin
// : PREFIX EXTRA? DIGIT+ SUFFIX?
// {System.out.println("The only useful information is " + $DIGIT.text);}
: PREFIX EXTRA? number SUFFIX?
{System.out.println("The only useful information is " + $number.text);}
;
number
: DIGIT+
;
PREFIX : [abcd]'_';
EXTRA : ('xyz' | 'XYZ' );
DIGIT : [0-9] ;
SUFFIX : [ab];
WS : [ \t\r\n]+ -> skip ;
Say the input is d_xyz123456b. With the first version
: PREFIX EXTRA? DIGIT+ SUFFIX?
you get
$ grun Question question -tokens data.txt
[#0,0:1='d_',<PREFIX>,1:0]
[#1,2:4='xyz',<EXTRA>,1:2]
[#2,5:5='1',<DIGIT>,1:5]
[#3,6:6='2',<DIGIT>,1:6]
[#4,7:7='3',<DIGIT>,1:7]
[#5,8:8='4',<DIGIT>,1:8]
[#6,9:9='5',<DIGIT>,1:9]
[#7,10:10='6',<DIGIT>,1:10]
[#8,11:11='b',<SUFFIX>,1:11]
[#9,13:12='<EOF>',<EOF>,2:0]
The only useful information is 6
Because the parsing of DIGIT+ translates to a loop which reuses DIGIT
setState(12);
_errHandler.sync(this);
_la = _input.LA(1);
do {
{
{
setState(11);
((TpinContext)_localctx).DIGIT = match(DIGIT);
}
}
setState(14);
_errHandler.sync(this);
_la = _input.LA(1);
} while ( _la==DIGIT );
and $DIGIT.text translates to ((TpinContext)_localctx).DIGIT.getText(), only the last digit is retained. That's why I define a subrule number
: PREFIX EXTRA? number SUFFIX?
which makes it easy to capture the value :
[#0,0:1='d_',<PREFIX>,1:0]
[#1,2:4='xyz',<EXTRA>,1:2]
[#2,5:5='1',<DIGIT>,1:5]
[#3,6:6='2',<DIGIT>,1:6]
[#4,7:7='3',<DIGIT>,1:7]
[#5,8:8='4',<DIGIT>,1:8]
[#6,9:9='5',<DIGIT>,1:9]
[#7,10:10='6',<DIGIT>,1:10]
[#8,11:11='b',<SUFFIX>,1:11]
[#9,13:12='<EOF>',<EOF>,2:0]
The only useful information is 123456
You can even make it simpler :
tpin
: PREFIX EXTRA? INT SUFFIX?
{System.out.println("The only useful information is " + $INT.text);}
;
PREFIX : [abcd]'_';
EXTRA : ('xyz' | 'XYZ' );
INT : [0-9]+ ;
SUFFIX : [ab];
WS : [ \t\r\n]+ -> skip ;
$ grun Question question -tokens data.txt
[#0,0:1='d_',<PREFIX>,1:0]
[#1,2:4='xyz',<EXTRA>,1:2]
[#2,5:10='123456',<INT>,1:5]
[#3,11:11='b',<SUFFIX>,1:11]
[#4,13:12='<EOF>',<EOF>,2:0]
The only useful information is 123456
In the listener you have a direct access to these values through the rule context TpinContext :
public static class TpinContext extends ParserRuleContext {
public Token INT;
public TerminalNode PREFIX() { return getToken(QuestionParser.PREFIX, 0); }
public TerminalNode INT() { return getToken(QuestionParser.INT, 0); }
public TerminalNode EXTRA() { return getToken(QuestionParser.EXTRA, 0); }
public TerminalNode SUFFIX() { return getToken(QuestionParser.SUFFIX, 0); }
I cannot get JavaCC to properly disambiguate tokens by their place in a grammar. I have the following JJTree file (I'll call it bug.jjt):
options
{
LOOKAHEAD = 3;
CHOICE_AMBIGUITY_CHECK = 2;
OTHER_AMBIGUITY_CHECK = 1;
SANITY_CHECK = true;
FORCE_LA_CHECK = true;
}
PARSER_BEGIN(MyParser)
import java.util.*;
public class MyParser {
public static void main(String[] args) throws ParseException {
MyParser parser = new MyParser(new java.io.StringReader(args[0]));
SimpleNode root = parser.production();
root.dump("");
}
}
PARSER_END(MyParser)
SKIP:
{
" "
}
TOKEN:
{
<STATE: ("state")>
|<PROD_NAME: (["a"-"z"])+ >
}
SimpleNode production():
{}
{
(
<PROD_NAME>
<STATE>
<EOF>
)
{return jjtThis;}
}
Generate the parser code with the following:
java -cp C:\path\to\javacc.jar jjtree bug.jjt
java -cp C:\path\to\javacc.jar javacc bug.jj
Now after compiling this, you can give run MyParser from the command line with a string to parse as the argument. It prints production if successful and spews an error if it fails.
I tried two simple inputs: foo state and state state. The first one parses, but the second one does not, since both state strings are tokenized as <STATE>. As I set LOOKAHEAD to 3, I expected it to use the grammar and see that one string state must be <STATE> and the other must be <PROD_NAME. However, no such luck. I have tried changing the various lookahead parameters to no avail. I am also not able to use tokenizer states (where you define different tokens allowable in different states), as this example is part of a more complicated system that will probably have a lot of these types of ambiguities.
Can anyone tell me how to make JavaCC properly disambiguate these tokens, without using tokenizer states?
This is covered in the FAQ under question 4.19.
There are three strategies outlined there
Putting choices in the grammar. See Bart Kiers's answer.
Using semantic look ahead. For this approach you get rid of the production defining STATE and write your grammar like this
void SimpleNode production():
{}
{
(
<PROD_NAME>
( LOOKAHEAD({getToken(1).kind == PROD_NAME && getToken(1).image.equals("state")})
<PROD_NAME>
...
|
...other choices...
)
)
{return jjtThis;}
}
If there are no other choices, then
void SimpleNode production():
{}
{
(
<PROD_NAME>
( LOOKAHEAD({getToken(1).kind == PROD_NAME && getToken(1).image.equals("state")})
<PROD_NAME>
...
|
{ int[][] expTokSeqs = new int[][] { new int[] {STATE } } ;
throw new ParseException(token, expTokSeqs, tokenImage) ; }
)
)
{return jjtThis;}
}
But, in this case, you need a production for STATE, as it is mentioned in the initialization of expTokSeqs. So you need a production.
< DUMMY > TOKEN : { < STATE : "state" > }
where DUMMY is a state that is never gone to.
Using lexical states. The title of the OP's question suggests he doesn't want to do this, but not why. It can be done if the state switching can be contained in the token manager. Suppose a file is a sequence of productions and each of production look like this.
name state : "a" | "b" name ;
That is it starts with a name, then the keyword "state" a colon, some tokens and finally a semicolon. (I'm just making this up as I have no idea what sort of language the OP is trying to parse.) Then you can use three lexical states DEFAULT, S0, and S1.
In the DEFAULT any sequence of letters (including "state") is a PROD_NAME. In DEFAULT, recognizing a PROD_NAME switches the state to S0.
In S0 any sequence of letters except "state" is a PROD_NAME and "state" is a STATE. In S0, recognizing a STATE token causes the tokenizer to switch to state S1.
In S1 any any sequence of letters (including "state") is a PROD_NAME. In S1, recognizing a SEMICOLON switches the state to DEFAULT.
So our example is tokenized like this
name state : "a" | "b" name ;
|__||______||_________________||_________
DEF- S0 S1 DEFAULT
AULT
The productions are written like this
<*> SKIP: { " " }
<S0> TOKEN: { <STATE: "state"> : S1 }
<DEFAULT> TOKEN:{ <PROD_NAME: (["a"-"z"])+ > : S0 }
<S0,S1> TOKEN:{ <PROD_NAME: (["a"-"z"])+ > }
<S1> TOKEN: { <SEMICOLON : ";" > : DEFAULT
<S0, DEFAULT> TOKEN : { <SEMICOLON : ";" > }
<*> TOKEN {
COLON : ":"
| ...etc...
}
It is possible for the parser to send state switching commands back to the tokenizer, but it is tricky to get it right and fragile. Se question 3.12 of the FAQ.
Lookahead does not concern the lexer while it composes characters to tokens. It is used by the parser when it matches non-terminals as composed from terminals (tokens).
If you define "state" to result in a token STATE, well, then that's what it is.
I agree with you, that tokenizer states aren't a good solution for permitting keywords to be used as identifiers. Is this really necessary? There's a good reason for HLL's not to permit this.
OTOH, if you can rewrite your grammar using just <PROD_NAME>s you might postpone the recognitions of the keywords during semantic analysis.
The LOOKAHEAD option only applies to the parser (production rules). The tokenizer is not affected by this: it will produce tokens without worrying what a production rule is trying to match. The input "state" will always be tokenized as a STATE, even if the parser is trying to match a PROD_NAME.
You could do something like this (untested, pseudo-ish grammar code ahead!):
SimpleNode production():
{}
{
(
prod_name_or_state()
<STATE>
<EOF>
)
{return jjtThis;}
}
SimpleNode prod_name_or_state():
{}
{
(
<PROD_NAME>
| <STATE>
)
{return jjtThis;}
}
which would match both "foo state" and "state state".
Or the equivalent, but more compact:
SimpleNode production():
{}
{
(
( <PROD_NAME> | <STATE> )
<STATE>
<EOF>
)
{return jjtThis;}
}
I'm missing some basic knowledge. Started playing around with ATLR today missing any source telling me how to do the following:
I'd like to parse a configuration file a program of mine currently reads in a very ugly way. Basically it looks like:
A [Data] [Data]
B [Data] [Data] [Data]
where A/B/... are objects with their associated data following (dynamic amount, only simple digits).
A grammar should not be that hard but how to use ANTLR now?
lexer only: A/B are tokens and I ask for the tokens he read. How to ask this and how to detect malformatted input?
lexer & parser: A/B are parser rules and... how to know the parser processed successfully A/B? The same object could appear multiple times in the file and I need to consider every single one. It's more like listing instances in the config file.
Edit:
My problem is not the grammer but how to get informed by parser/lexer what they actually found/parsed? Best would be: invoke a function upon recognition of a rule like recursive descent
ANTLR production rules can have return value(s) you can use to get the contents of your configuration file.
Here's a quick demo:
grammar T;
parse returns [java.util.Map<String, List<Integer>> map]
#init{$map = new java.util.HashMap<String, List<Integer>>();}
: (line {$map.put($line.key, $line.values);} )+ EOF
;
line returns [String key, List<Integer> values]
: Id numbers (NL | EOF)
{
$key = $Id.text;
$values = $numbers.list;
}
;
numbers returns [List<Integer> list]
#init{$list = new ArrayList<Integer>();}
: (Num {$list.add(Integer.parseInt($Num.text));} )+
;
Num : '0'..'9'+;
Id : ('a'..'z' | 'A'..'Z')+;
NL : '\r'? '\n' | '\r';
Space : (' ' | '\t')+ {skip();};
If you runt the class below:
import org.antlr.runtime.*;
import java.util.*;
public class Main {
public static void main(String[] args) throws Exception {
String input = "A 12 34\n" +
"B 5 6 7 8\n" +
"C 9";
TLexer lexer = new TLexer(new ANTLRStringStream(input));
TParser parser = new TParser(new CommonTokenStream(lexer));
Map<String, List<Integer>> values = parser.parse();
System.out.println(values);
}
}
the following will be printed to the console:
{A=[12, 34], B=[5, 6, 7, 8], C=[9]}
The grammar should be something like this (it's pseudocode not ANTLR):
FILE ::= STATEMENT ('\n' STATEMENT)*
STATEMENT ::= NAME ITEM*
ITEM = '[' \d+ ']'
NAME = \w+
If you are looking for way to execute code when something is parsed, you should either use actions or AST (look them up in the documentation).
I have some code in php I made using preg_grep for matching several words in any order that can exist in any context. I'm trying to convert it to java but i can't seem to figure it out.
My php code for converting a keyword to a regex string is:
function createRegexSearch($keywords)
{
$regex = '';
foreach ($keywords as $key)
$regex .= '(?=.*' . $key . ')';
return '/^' . $regex . '/i';
}
It would create a regex string similar to: /^(?=.*bot)/i - which should match robot, robots, bots etc. The same regex string doesn't seem to work in java which is leaving me confused. Currently in java I created a similar effect with contains but would rather use regex.
for (Map.Entry<String, String> entry : mKeyList.entrySet())
{
boolean found = true;
String val = entry.getValue().toLowerCase();
for (int i = 0; i < keywords.length; i++)
{
if (!val.contains(keywords[i].toLowerCase()))
found = false;
}
if (found)
ret.add(entry.getValue());
}
One thing that Java does differently than many languages is have two different ways of "matching" a regex against a target - "matches()" vs "find()" - matches is the equivalent of putting ^ and $ at the beginning and end of your expression, while find finds the first match (wherever it might be in the string) - for example while you might be able to find() .*bot in the target string robots, it would not be true to say that it matches() the target... I'm not entirely sure how the lookahead might affect this...
Without posted Java code (containing the problem), it's hard to tell you where you might be going wrong, but my guess is that it could very easily be in this area.
Also, the equivalent of putting /i at the end of your expression in Java (and .Net) is putting (?i) at the beginning of your expression (or any region you want to be case sensitive). Thus, /[a-f0-9]/i is equivalent to (?i)[a-f0-9]
The String contains is case sensitive, so the first set (PHP Code) will behave case in-sensitive since the usage of \i. But the java code will behave case sensitive. So there will be differences in behavior.
So if this is difference, you convert both the end to specific char set, say toUpperCase() before the contains check.
Also you are using a regex in PHP code and not in Java, any specific reason behind this?
Regards
Ajai G
You can use the embedded flag extension (?i) so the regex you should be using to match bot, robot, bots and robots is (?i)^(.*bots?)$ This should work with either String.matches or Pattern/Matcher
JMPL is simple java library, which could emulate some of the features pattern matching, using Java 8 features.
import org.kl.state.Else;
import static org.kl.pattern.DeconstructPattern.matches;
import static org.kl.pattern.DeconstructPattern.foreach;
import static org.kl.pattern.DeconstructPattern.let;
let(figure, (int w, int h) -> {
System.out.println("border: " + w + " " + h));
});
matches(figure).as(
Rectangle.class, (int w, int h) -> System.out.println("square: " + (w * h)),
Circle.class, (int r) -> System.out.println("square: " + (2 * Math.PI * r)),
Else.class, () -> System.out.println("Default square: " + 0)
);
foreach(listRectangles, (int w, int h) -> {
System.out.println("square: " + (w * h));
});