How to parse this grammar? - java

I want to create a recursive descendant parser in java for following grammar (I have managed to create tokens). This is the relevant part of the grammar:
expression ::= numeric_expression | identifier | "null"
identifier ::= "a..z,$,_"
numeric_expression ::= ( ( "-" | "++" | "--" ) expression )
| ( expression ( "++" | "--" ) )
| ( expression ( "+" | "+=" | "-" | "-=" | "*" | "*=" | "/" | "/=" | "%" | "%=" ) expression )
arglist ::= expression { "," expression }
I have written code for parsing numeric_expression (assuming if invalid token, return null):
NumericAST<? extends OpAST> parseNumericExpr() {
OpAST op;
if (token.getCodes() == Lexer.CODES.UNARY_OP) { //Check for unary operator like "++" or "--" etc
op = new UnaryOpAST(token.getValue());
token = getNextToken();
AST expr = parseExpr(); // Method that returns expression node.
if (expr == null) {
op = null;
return null;
} else {
if (checkSemi()) {
System.out.println("UNARY AST CREATED");
return new NumericAST<OpAST>(expr, op, false);
}
else {
return null;
}
}
} else { // Binary operation like "a+b", where a,b ->expression
AST expr = parseExpr();
if (expr == null) {
return null;
} else {
token = getNextToken();
if (token.getCodes() == Lexer.CODES.UNARY_OP) {
op = new UnaryOpAST(token.getValue());
return new NumericAST<OpAST>(expr, op, true);
} else if (token.getCodes() == Lexer.CODES.BIN_OP) {
op = new BinaryOpAST(token.getValue());
token = getNextToken();
AST expr2 = parseExpr();
if (expr2 == null) {
op = null;
expr = null;
return null;
} else {
if (checkSemi()) {
System.out.println("BINARY AST CREATED");
return new NumericAST<OpAST>(expr, op, expr2);
}
else {
return null;
}
}
} else {
expr = null;
return null;
}
}
}
}
Now, if i get a unary operator like ++ i can directly call this method, but I dont know to recognize other grammar, starting with same productions, like arglist and numeric_expression having "expression" as start production.
My question is:
How to recognize whether to call parseNumericExpr() or parseArgList() (method not mentioned above) if i get an expression token?

In order to write a recursive descent parser, you need an appropriate top-down grammar, normally an LL(1) grammar, although it's common to write the grammar using EBNF operators, as shown in the example grammar on Wikipedia's page on recursive descent grammars.
Unfortunately, your grammar is not LL(1), and the question you raise is a consequence of that fact. An LL(1) grammar has the property that the parser can always determine which production to use by examining only the next input token, which puts some severe constraints on the grammar, including:
No two productions for the same non-terminal can start with the same symbol.
No production can be left-recursive (i.e. the first symbol on the right-hand side is the defining non-terminal).
Here's a small rearrangement of your grammar which will work:
-- I added number here in order to be explicit.
atom ::= identifier | number | "null" | "(" expression ")"
-- I added function calls here, but it's arguable that this syntax accepts
-- a lot of invalid expressions
primary ::= atom { "++" | "--" | "(" [ arglist ] ")" }
factor ::= [ "-" | "++" | "--" ] primary
term ::= factor { ( "*" | "/" | "%" ) factor }
value ::= term { ( "+" | "-" ) term }
-- This adds the ordinary "=" assignment to the list in case it was
-- omitted by accident. Also, see the note below.
expression ::= { value ( "=" | "+#" | "-=" | "*=" | "/=" | "%=" ) } value
arglist ::= expression { "," expression }
The last expression rule is an attempt to capture the usual syntax of assignment operators (which associate to the right, not to the left), but it suffers from a classic problem address by this highly related question. I don't think I have a better answer to this issue than the one I wrote three years ago, so I hope it is still useful.

Related

Integrate OOPath Syntax in a drools like rule, using Xtext

I try to implement oopath syntax in a drools like rule, but I have some issues regarding the non variables oopaths. For example, here is what I try to generate in the when block:
when $rt : string(Variables
$yr1t : /path1/F
$yr2t : /path2/F
$yr3t : path3.path4/PATH5
$yr4t : path3
$yr5t : /path3
Conditions
$yr4t == $yr5t + 3
$yr3t != $yr2t
//FROM HERE IS THE PROBLEM:
$yr3t == p/path/f
$yr3t == /g/t
/path2/F[g==$yr1t]
)
The problem I am facing is that my grammar doesn't support this format and I don't know how to modify the existing one in order to support even the last 3 statements.
Here is what I've tried so far:
Model:
declarations+=Declaration*
;
Temp:
elementType=ElementType
;
ElementType:
typeName=('string'|'int'|'boolean');
Declaration:
Rule
;
#Override
terminal ID: ('^'|'$')('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'_'|'0'..'9')*;
terminal PATH: ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'_'|'0'..'9')*;
Rule:
'Filter'
'rule' ruleDescription=STRING
'#specification' QualifiedNameSpecification
'ruleflow-group' ruleflowDescription=STRING
'when' (name += ID ':' atribute += Temp '(' 'Variables'?
//(varName += ID ':' QualifiedNameVariabilePath)*
(variablesList += Variable)*
'Conditions'?
(exp += EvalExpression)*
')'
)*
;
QualifiedNameSpecification: '(' STRING ')';
QualifiedNameVariabilePath: (('/'|'.')? PATH)* ;
ExpressionsModel:
elements += AbstractElement*;
AbstractElement:
Variable | EvalExpression ;
Variable:
//'var'
name=ID ':' QualifiedNameVariabilePath; //expression=Expression;
EvalExpression:
//'eval'
expression=Expression;
Expression: Or;
Or returns Expression:
And ({Or.left=current} "||" right=And)*
;
And returns Expression:
Equality ({And.left=current} "&&" right=Equality)*
;
Equality returns Expression:
Comparison (
{Equality.left=current} op=("=="|"!=")
right=Comparison
)*
;
Comparison returns Expression:
PlusOrMinus (
{Comparison.left=current} op=(">="|"<="|">"|"<")
right=PlusOrMinus
)*
;
PlusOrMinus returns Expression:
MulOrDiv (
({Plus.left=current} '+' | {Minus.left=current} '-')
right=MulOrDiv
)*
;
MulOrDiv returns Expression:
Primary (
{MulOrDiv.left=current} op=('*'|'/')
right=Primary
)*
;
Primary returns Expression:
'(' Expression ')' |
{Not} "!" expression=Primary |
Atomic
;
Atomic returns Expression:
{IntConstant} value=INT |
{StringConstant} value=STRING |
{BoolConstant} value=('true'|'false') |
{VariableRef} variable=[Variable]
;
EDIT: To make it more clear, the question is how do I modify my grammar in order to support oopath syntax without binding them to a variable, something like a temporary object.

Indentation management in ANTLR4 for a python interpreter

I'm implementing a python interpreter using ANTLR4 like lexer and parser generator. I used the BNF defined at this link:
https://github.com/antlr/grammars-v4/blob/master/python3/Python3.g4.
However the implementation of indentation with the INDENT and DEDENT tokens within the lexer::members do not work when i define a compound statement.
For example if i define the following statement:
x=10
while x>2 :
print("hello")
x=x-3
So in the line when i reassign the value of x variable i should have an indentation error that i don't have in my currest state.
Should i edit something into the lexer code or what?
This is the BNF that i'm using with the lexer::members and the NEWLINE rules defined in the above link.
grammar python;
tokens { INDENT, DEDENT }
#lexer::members {
// A queue where extra tokens are pushed on (see the NEWLINE lexer rule).
private java.util.LinkedList<Token> tokens = new java.util.LinkedList<>();
// The stack that keeps track of the indentation level.
private java.util.Stack<Integer> indents = new java.util.Stack<>();
// The amount of opened braces, brackets and parenthesis.
private int opened = 0;
// The most recently produced token.
private Token lastToken = null;
#Override
public void emit(Token t) {
super.setToken(t);
tokens.offer(t);
}
#Override
public Token nextToken() {
// Check if the end-of-file is ahead and there are still some DEDENTS expected.
if (_input.LA(1) == EOF && !this.indents.isEmpty()) {
// Remove any trailing EOF tokens from our buffer.
for (int i = tokens.size() - 1; i >= 0; i--) {
if (tokens.get(i).getType() == EOF) {
tokens.remove(i);
}
}
// First emit an extra line break that serves as the end of the statement.
this.emit(commonToken(pythonParser.NEWLINE, "\n"));
// Now emit as much DEDENT tokens as needed.
while (!indents.isEmpty()) {
this.emit(createDedent());
indents.pop();
}
// Put the EOF back on the token stream.
this.emit(commonToken(pythonParser.EOF, "<EOF>"));
//throw new Exception("indentazione inaspettata in riga "+this.getLine());
}
Token next = super.nextToken();
if (next.getChannel() == Token.DEFAULT_CHANNEL) {
// Keep track of the last token on the default channel.
this.lastToken = next;
}
return tokens.isEmpty() ? next : tokens.poll();
}
private Token createDedent() {
CommonToken dedent = commonToken(pythonParser.DEDENT, "");
dedent.setLine(this.lastToken.getLine());
return dedent;
}
private CommonToken commonToken(int type, String text) {
int stop = this.getCharIndex() - 1;
int start = text.isEmpty() ? stop : stop - text.length() + 1;
return new CommonToken(this._tokenFactorySourcePair, type, DEFAULT_TOKEN_CHANNEL, start, stop);
}
// Calculates the indentation of the provided spaces, taking the
// following rules into account:
//
// "Tabs are replaced (from left to right) by one to eight spaces
// such that the total number of characters up to and including
// the replacement is a multiple of eight [...]"
//
// -- https://docs.python.org/3.1/reference/lexical_analysis.html#indentation
static int getIndentationCount(String spaces) {
int count = 0;
for (char ch : spaces.toCharArray()) {
switch (ch) {
case '\t':
count += 8 - (count % 8);
break;
default:
// A normal space char.
count++;
}
}
return count;
}
boolean atStartOfInput() {
return super.getCharPositionInLine() == 0 && super.getLine() == 1;
}
}
parse
:( NEWLINE parse
| block ) EOF
;
block
: (statement NEWLINE?| functionDecl)*
;
statement
: assignment
| functionCall
| ifStatement
| forStatement
| whileStatement
| arithmetic_expression
;
assignment
: IDENTIFIER indexes? '=' expression
;
functionCall
: IDENTIFIER OPAREN exprList? CPAREN #identifierFunctionCall
| PRINT OPAREN? exprList? CPAREN? #printFunctionCall
;
arithmetic_expression
: expression
;
ifStatement
: ifStat elifStat* elseStat?
;
ifStat
: IF expression COLON NEWLINE INDENT block DEDENT
;
elifStat
: ELIF expression COLON NEWLINE INDENT block DEDENT
;
elseStat
: ELSE COLON NEWLINE INDENT block DEDENT
;
functionDecl
: DEF IDENTIFIER OPAREN idList? CPAREN COLON NEWLINE INDENT block DEDENT
;
forStatement
: FOR IDENTIFIER IN expression COLON NEWLINE INDENT block DEDENT elseStat?
;
whileStatement
: WHILE expression COLON NEWLINE INDENT block DEDENT elseStat?
;
idList
: IDENTIFIER (',' IDENTIFIER)*
;
exprList
: expression (COMMA expression)*
;
expression
: '-' expression #unaryMinusExpression
| '!' expression #notExpression
| expression '**' expression #powerExpression
| expression '*' expression #multiplyExpression
| expression '/' expression #divideExpression
| expression '%' expression #modulusExpression
| expression '+' expression #addExpression
| expression '-' expression #subtractExpression
| expression '>=' expression #gtEqExpression
| expression '<=' expression #ltEqExpression
| expression '>' expression #gtExpression
| expression '<' expression #ltExpression
| expression '==' expression #eqExpression
| expression '!=' expression #notEqExpression
| expression '&&' expression #andExpression
| expression '||' expression #orExpression
| expression '?' expression ':' expression #ternaryExpression
| expression IN expression #inExpression
| NUMBER #numberExpression
| BOOL #boolExpression
| NULL #nullExpression
| functionCall indexes? #functionCallExpression
| list indexes? #listExpression
| IDENTIFIER indexes? #identifierExpression
| STRING indexes? #stringExpression
| '(' expression ')' indexes? #expressionExpression
| INPUT '(' STRING? ')' #inputExpression
;
list
: '[' exprList? ']'
;
indexes
: ('[' expression ']')+
;
PRINT : 'print';
INPUT : 'input';
DEF : 'def';
IF : 'if';
ELSE : 'else';
ELIF : 'elif';
RETURN : 'return';
FOR : 'for';
WHILE : 'while';
IN : 'in';
NULL : 'null';
OR : '||';
AND : '&&';
EQUALS : '==';
NEQUALS : '!=';
GTEQUALS : '>=';
LTEQUALS : '<=';
POW : '**';
EXCL : '!';
GT : '>';
LT : '<';
ADD : '+';
SUBTRACT : '-';
MULTIPLY : '*';
DIVIDE : '/';
MODULE : '%';
OBRACE : '{' {opened++;};
CBRACE : '}' {opened--;};
OBRACKET : '[' {opened++;};
CBRACKET : ']' {opened--;};
OPAREN : '(' {opened++;};
CPAREN : ')' {opened--;};
SCOLON : ';';
ASSIGN : '=';
COMMA : ',';
QMARK : '?';
COLON : ':';
BOOL
: 'true'
| 'false'
;
NUMBER
: INT ('.' DIGIT*)?
;
IDENTIFIER
: [a-zA-Z_] [a-zA-Z_0-9]*
;
STRING
: ["] (~["\r\n] | '\\\\' | '\\"')* ["]
| ['] (~['\r\n] | '\\\\' | '\\\'')* [']
;
SKIPS
: ( SPACES | COMMENT | LINE_JOINING ){firstLine();} -> skip
;
NEWLINE
: ( {atStartOfInput()}? SPACES
| ( '\r'? '\n' | '\r' | '\f' ) SPACES?
)
{
String newLine = getText().replaceAll("[^\r\n\f]+", "");
String spaces = getText().replaceAll("[\r\n\f]+", "");
int next = _input.LA(1);
if (opened > 0 || next == '\r' || next == '\n' || next == '\f' || next == '#') {
// If we're inside a list or on a blank line, ignore all indents,
// dedents and line breaks.
skip();
}
else {
emit(commonToken(NEWLINE, newLine));
int indent = getIndentationCount(spaces);
int previous = indents.isEmpty() ? 0 : indents.peek();
if (indent == previous) {
// skip indents of the same size as the present indent-size
skip();
}
else if (indent > previous) {
indents.push(indent);
emit(commonToken(pythonParser.INDENT, spaces));
}
else {
// Possibly emit more than 1 DEDENT token.
while(!indents.isEmpty() && indents.peek() > indent) {
this.emit(createDedent());
indents.pop();
}
}
}
}
;
fragment INT
: [1-9] DIGIT*
| '0'
;
fragment DIGIT
: [0-9]
;
fragment SPACES
: [ \t]+
;
fragment COMMENT
: '#' ~[\r\n\f]*
;
fragment LINE_JOINING
: '\\' SPACES? ( '\r'? '\n' | '\r' | '\f' )
;
No, this should not be handled in the grammar. The lexer should simply emit the (faulty) INDENT token. The parser should, at runtime, produce an error. Something like this:
String source = "x=10\n" +
"while x>2 :\n" +
" print(\"hello\")\n" +
" x=x-3\n";
Python3Lexer lexer = new Python3Lexer(CharStreams.fromString(source));
Python3Parser parser = new Python3Parser(new CommonTokenStream(lexer));
// Remove default error-handling
parser.removeErrorListeners();
// Add custom error-handling
parser.addErrorListener(new BaseErrorListener() {
#Override
public void syntaxError(Recognizer<?, ?> recognizer, Object o, int i, int i1, String s, RecognitionException e) {
CommonToken token = (CommonToken) o;
if (token.getType() == Python3Parser.INDENT) {
// The parser encountered an unexpected INDENT token
// TODO throw your exception
}
// TODO handle other errors
}
});
// Trigger the error
parser.file_input();

Cannot parse System.out.println() in JavaCC

I was trying to parse System.out.println() statement as an OutputStatement for Java Grammar. Here's the production rule in EBNF:
Statement::=( LabeledStatement | AssertStatement | Block | EmptyStatement | StatementExpression | SwitchStatement | IfStatement | WhileStatement | DoStatement | ForStatement | BreakStatement | ContinueStatement | ReturnStatement | ThrowStatement | SynchronizedStatement | TryStatement|OutputStatement)
OutputStatement::="System.out.print"["ln"]"("Arguments")" ";"
This is strictly according to the Java Grammar as specified in the javacc folder file C:\javacc-6.0\examples\JavaGrammars\Java 1.0.2.jj
Now when I coded the production rule in JavaCC it came as:
OutputStmt OutputStatement():
{
Token tk;
Expression args;
boolean ln=false;
int line;
int column;
}
{
{line=token.beginLine;column=token.beginColumn;args=null;ln=false;}
tk=<STRING_LITERAL> LOOKAHEAD({tk.image.equals("System")})
"."
tk=<STRING_LITERAL> LOOKAHEAD({tk.image.equals("out")})
"."
tk=<STRING_LITERAL> LOOKAHEAD({tk.image.equals("print")})
[
tk=<STRING_LITERAL> LOOKAHEAD({tk.image.equals("ln")})
{
ln=true;
}
]
"("
args=Expression()
")" ";"
{
return new OutputStmt(line,column,token.endLine,token.endColumn,ln,args);
}
}
Now this throws LOOKAHEAD Warnings and Errors in the Parser generated.Can anyone please help?
EDIT: The main problem as it seems is that JavaCC is generating methods which are not initializing Token tk and which is giving me the error tk not resolved.
The following will work.
OutputStmt OutputStatement() :
{
Token tk;
Expression args;
boolean ln;
int line;
int column;
}
{
{line=token.beginLine;column=token.beginColumn;args=null;ln=false;}
LOOKAHEAD({getToken(1).image.equals("System")})
<ID>
"."
LOOKAHEAD({getToken(1).image.equals("out")})
<ID>
"."
LOOKAHEAD({getToken(1).image.equals("println") || getToken(1).image.equals("print") })
tk=<ID> { ln = tk.image.equals("println" ) ; }
"("
args=Expression()
")" ";"
{ return new OutputStmt(line,column,token.endLine,token.endColumn,ln,args); }
}
Note that I changed STRING_LITERAL to the more traditional ID.

how can I make this JAVACC grammar work with [ ]?

I'm trying to change a grammar in the JSqlParser project, which deals with a javacc grammar file .jj specifying the standard SQL syntax. I had difficulty getting one section to work, I narrowed it down to the following , much simplified grammar.
basically I have a def of Column : [table ] . field
but table itself could also contain the "." char, which causes confusion.
I think intuitively the following grammar should accept all the following sentences:
select mytable.myfield
select myfield
select mydb.mytable.myfield
but in practice it only accepts the 2nd and 3rd above. whenever it sees the ".", it progresses to demanding the 2-dot version of table (i.e. the first derivation rule for table)
how can I make this grammar work?
Thanks a lot
Yang
options{
IGNORE_CASE=true ;
STATIC=false;
DEBUG_PARSER=true;
DEBUG_LOOKAHEAD=true;
DEBUG_TOKEN_MANAGER=false;
// FORCE_LA_CHECK=true;
UNICODE_INPUT=true;
}
PARSER_BEGIN(TT)
import java.util.*;
public class TT {
}
PARSER_END(TT)
///////////////////////////////////////////// main stuff concerned
void Statement() :
{ }
{
<K_SELECT> Column()
}
void Column():
{
}
{
[LOOKAHEAD(3) Table() "." ]
//[
//LOOKAHEAD(2) (
// LOOKAHEAD(5) <S_IDENTIFIER> "." <S_IDENTIFIER>
// |
// LOOKAHEAD(3) <S_IDENTIFIER>
//)
//
//
//
//]
Field()
}
void Field():
{}{
<S_IDENTIFIER>
}
void Table():
{}{
LOOKAHEAD(5) <S_IDENTIFIER> "." <S_IDENTIFIER>
|
LOOKAHEAD(3) <S_IDENTIFIER>
}
////////////////////////////////////////////////////////
SKIP:
{
" "
| "\t"
| "\r"
| "\n"
}
TOKEN: /* SQL Keywords. prefixed with K_ to avoid name clashes */
{
<K_CREATE: "CREATE">
|
<K_SELECT: "SELECT">
}
TOKEN : /* Numeric Constants */
{
< S_DOUBLE: ((<S_LONG>)? "." <S_LONG> ( ["e","E"] (["+", "-"])? <S_LONG>)?
|
<S_LONG> "." (["e","E"] (["+", "-"])? <S_LONG>)?
|
<S_LONG> ["e","E"] (["+", "-"])? <S_LONG>
)>
| < S_LONG: ( <DIGIT> )+ >
| < #DIGIT: ["0" - "9"] >
}
TOKEN:
{
< S_IDENTIFIER: ( <LETTER> | <ADDITIONAL_LETTERS> )+ ( <DIGIT> | <LETTER> | <ADDITIONAL_LETTERS> | <SPECIAL_CHARS>)* >
| < #LETTER: ["a"-"z", "A"-"Z", "_", "$"] >
| < #SPECIAL_CHARS: "$" | "_" | "#" | "#">
| < S_CHAR_LITERAL: "'" (~["'"])* "'" ("'" (~["'"])* "'")*>
| < S_QUOTED_IDENTIFIER: "\"" (~["\n","\r","\""])+ "\"" | ("`" (~["\n","\r","`"])+ "`") | ( "[" ~["0"-"9","]"] (~["\n","\r","]"])* "]" ) >
/*
To deal with database names (columns, tables) using not only latin base characters, one
can expand the following rule to accept additional letters. Here is the addition of german umlauts.
There seems to be no way to recognize letters by an external function to allow
a configurable addition. One must rebuild JSqlParser with this new "Letterset".
*/
| < #ADDITIONAL_LETTERS: ["ä","ö","ü","Ä","Ö","Ü","ß"] >
}
You could rewrite your grammar like this
Statement --> "select" Column
Column --> Prefix <ID>
Prefix --> (<ID> ".")*
Now the only choice is whether to iterate or not. Assuming a "." can't follow a Column, this is easily done with a lookahead of 2:
Statement --> "select" Column
Column --> Prefix <ID>
Prefix --> (LOOKAHEAD( <ID> ".") <ID> ".")*
indeed the following grammar in flex+bison (LR parser) works fine , recognizing all the following sentences correctly:
create mydb.mytable
create mytable
select mydb.mytable.myfield
select mytable.myfield
select myfield
so it is indeed due to limitation of LL parser
%%
statement:
create_sentence
|
select_sentence
;
create_sentence: CREATE table
;
select_sentence: SELECT table '.' ID
|
SELECT ID
;
table : table '.' ID
|
ID
;
%%
If you need Table to be its own nonterminal, you can do this by using a boolean parameter that says whether the table is expected to be followed by a dot.
void Statement():{}{
"select" Column() | "create" "table" Table(false) }
void Column():{}{
[LOOKAHEAD(<ID> ".") Table(true) "."] <ID> }
void Table(boolean expectDot):{}{
<ID> MoreTable(expectDot) }
void MoreTable(boolean expectDot) {
LOOKAHEAD("." <ID> ".", {expectDot}) "." <ID> MoreTable(expectDot)
|
LOOKAHEAD(".", {!expectDot}) "." <ID> MoreTable(expectDot)
|
{}
}
Doing it this way precludes using Table in any syntactic lookahead specifications either directly or indirectly. E.g. you shouldn't have LOOKAHEAD( Table()) anywhere in your grammar, because semantic lookahead is not used during syntactic lookahead. See the FAQ for more information on that.
Your examples are parsed perfectly well using JSqlParser V0.9.x (https://github.com/JSQLParser/JSqlParser)
CCJSqlParserUtil.parse("SELECT mycolumn");
CCJSqlParserUtil.parse("SELECT mytable.mycolumn");
CCJSqlParserUtil.parse("SELECT mydatabase.mytable.mycolumn");

Regular Expressions - tree grammar Antlr Java

I'm trying to write a program in ANTLR (Java) concerning simplifying regular expression. I have already written some code (grammar file contents below)
grammar Regexp_v7;
options{
language = Java;
output = AST;
ASTLabelType = CommonTree;
backtrack = true;
}
tokens{
DOT;
REPEAT;
RANGE;
NULL;
}
fragment
ZERO
: '0'
;
fragment
DIGIT
: '1'..'9'
;
fragment
EPSILON
: '#'
;
fragment
FI
: '%'
;
ID
: EPSILON
| FI
| 'a'..'z'
| 'A'..'Z'
;
NUMBER
: ZERO
| DIGIT (ZERO | DIGIT)*
;
WHITESPACE
: ('\r' | '\n' | ' ' | '\t' ) + {$channel = HIDDEN;}
;
list
: (reg_exp ';'!)*
;
term
: ID -> ID
| '('! reg_exp ')'!
;
repeat_exp
: term ('{' range_exp '}')+ -> ^(REPEAT term (range_exp)+)
| term -> term
;
range_exp
: NUMBER ',' NUMBER -> ^(RANGE NUMBER NUMBER)
| NUMBER (',') -> ^(RANGE NUMBER NULL)
| ',' NUMBER -> ^(RANGE NULL NUMBER)
| NUMBER -> ^(RANGE NUMBER NUMBER)
;
kleene_exp
: repeat_exp ('*'^)*
;
concat_exp
: kleene_exp (kleene_exp)+ -> ^(DOT kleene_exp (kleene_exp)+)
| kleene_exp -> kleene_exp
;
reg_exp
: concat_exp ('|'^ concat_exp)*
;
My next goal is to write down tree grammar code, which is able to simplify regular expressions (e.g. a|a -> a , etc.). I have done some coding (see text below), but I have troubles with defining rule that treats nodes as subtrees (in order to simplify following kind of expressions e.g.: (a|a)|(a|a) to a, etc.)
tree grammar Regexp_v7Walker;
options{
language = Java;
tokenVocab = Regexp_v7;
ASTLabelType = CommonTree;
output=AST;
backtrack = true;
}
tokens{
NULL;
}
bottomup
: ^('*' ^('*' e=.)) -> ^('*' $e) //a** -> a*
| ^('|' i=.* j=.* {$i.tree.toStringTree() == $j.tree.toStringTree()} )
-> $i // There are 3 errors while this line is up and running:
// 1. CommonTree cannot be resolved,
// 2. i.tree cannot be resolved or is not a field,
// 3. i cannot be resolved.
;
Small driver class:
public class Regexp_Test_v7 {
public static void main(String[] args) throws RecognitionException {
CharStream stream = new ANTLRStringStream("a***;a|a;(ab)****;ab|ab;ab|aa;");
Regexp_v7Lexer lexer = new Regexp_v7Lexer(stream);
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
Regexp_v7Parser parser = new Regexp_v7Parser(tokenStream);
list_return list = parser.list();
CommonTree t = (CommonTree) list.getTree();
System.out.println("Original tree: " + t.toStringTree());
CommonTreeNodeStream nodes = new CommonTreeNodeStream(t);
Regexp_v7Walker s = new Regexp_v7Walker(nodes);
t = (CommonTree)s.downup(t);
System.out.println("Simplified tree: " + t.toStringTree());
Can anyone help me with solving this case?
Thanks in advance and regards.
Now, I'm no expert, but in your tree grammar:
add filter=true
change the second line of bottomup rule to:
^('|' i=. j=. {i.toStringTree().equals(j.toStringTree()) }? ) -> $i }
If I'm not mistaken by using i=.* you're allowing i to be non-existent and you'll get a NullPointerException on conversion to a String.
Both i and j are of type CommonTree because you've set it up this way: ASTLabelType = CommonTree, so you should call i.toStringTree().
And since it's Java and you're comparing Strings, use equals().
Also to make the expression in curly brackets a predicate, you need a question mark after the closing one.

Categories

Resources