ANTLR4 doesn't manage with left recursion properly - java

I'm trying to describe the grammar for logical expressions using ANTLR4.
Surely, this grammar has direct left recursion, and as I've read ANTLR4 supports it.
grammar Logic;
#header {
package parser;
import expression.*;
}
expression returns [Expression value] : disjunction {$value = $disjunction.value;}
| disjunction IMPLIES expression {$value = new Implication($disjunction.value, $expression.value);};
disjunction returns [Expression value] : conjunction {$value = $conjunction.value;}
| disjunction OR conjunction {$value = new Disjunction($disjunction.value, $conjunction.value);};
conjunction returns [Expression value] : negation {$value = $negation.value;}
| conjunction AND negation {$value = new Conjunction($conjunction.value, $negation.value);};
negation returns [Expression value] : variable {$value = $variable.value;}
| NOT negation {$value = new Negation($negation.value);}
| OB expression CB {$value = $expression.value;};
variable returns [Expression value] : VAR {$value = new Variable($VAR.text);};
IMPLIES : '->';
OR : '|';
AND : '&';
NOT : '!';
OB : '(';
CB : ')';
VAR : [A-Z]([0-9])*;
But when I'm running antlr to generate parser for my grammar, it gives some strange errors:
error(65): Logic.g4:5:125: unknown attribute value for rule disjunction in $disjunction.value
error(65): Logic.g4:5:125: unknown attribute value for rule conjunction in $conjunction.value
When I swap disjunction and conjunction in disjunction rule, making disjunction right-associative, it works, but this can cause some mistakes in my work.
As it might be important, I generate parsers using ANTLR4-plugin for Intellij Idea.
What am I doing wrong?
Thank you in advance.

When referring to a disjunction in your code:
disjunction returns [Expression value]
: conjunction {$value = $conjunction.value;}
| disjunction OR conjunction {$value = new Disjunction($disjunction.value, $conjunction.value);}
;
ANTLR tries to get the value of the entire enclosing rule if you do $disjunction instead of the disjunction in disjunction OR ....
If you want to refer to disjunction in disjunction OR ..., you need to label it before referencing it:
disjunction returns [Expression value]
: c1=conjunction {$value = $c1.value;}
| d=disjunction OR c2=conjunction {$value = new Disjunction($d.value, $c2.value);}
;
Here's a full working (tested) example:
grammar Logic;
#header {
import expression.*;
}
expression returns [Expression value]
: d1=disjunction {$value = $d1.value;}
| d2=disjunction IMPLIES e=expression {$value = new Implication($d2.value, $e.value);}
;
disjunction returns [Expression value]
: c1=conjunction {$value = $c1.value;}
| d=disjunction OR c2=conjunction {$value = new Disjunction($d.value, $c2.value);}
;
conjunction returns [Expression value]
: n1=negation {$value = $n1.value;}
| c=conjunction AND n2=negation {$value = new Conjunction($c.value, $n2.value);}
;
negation returns [Expression value]
: variable {$value = $variable.value;}
| NOT n=negation {$value = new Negation($n.value);}
| OB expression CB {$value = $expression.value;}
;
variable returns [Expression value]
: VAR {$value = new Variable($VAR.text);}
;
IMPLIES : '->';
OR : '|';
AND : '&';
NOT : '!';
OB : '(';
CB : ')';
VAR : [A-Z]([0-9])*;
SPACE : [ \t\r\n] -> skip;

Related

ANTLR grammar returning java method name which can be executed using java compiler

I have the below ANTLR grammar:
tree grammar findApp;
options {
tokenVocab=TSRTemplate;
ASTLabelType=CommonTree;
backtrack=true;
}
#header {
}
#members {
}
prog returns [String value]
: ^(ROOT f=formula) {$value = $f.value; }
;
formula returns [String value]
: ^(DEFINITION INTEGER t=calc_expr) {$value = $t.value;}
;
calculation_expr returns [String value]
#init {
$value = null;
}
: calculation_unit {$value = $calculation_unit.value;}
| ^(PLUS n1=calculation_expr n2=calculation_expr) {
$value="addApproach(<calculation_expr1>, <calculation_expr2>)";
}
;
calc_unit returns [String value]
: cf_expr {$value = $cf_expr.value;}
;
cf_expr returns [String value]
: ^(GET_APP string_expr)
{
$value = "getApp("+$STRING.text + ")";
}
;
What I need to achieve is as below:
For example, the below grammar:
cf_expr returns [String value]
: ^(GET_APP string_expr)
{
$value = "getApp("+$STRING.text + ")";
}
;
will return getApp(<some parameter>), which is a method name.
This method is implemented in a java class (Method.java),
I want my grammar to return this method name and compile this java class (Method.java) and execute the method implementation in this class.
Any suggestions on how to integrate java class compiling with ANTLR will be helpful.

ANTLR4 Grammar only matching first part of parser rule

I'm using ANTLR 4 to try and parse task definitions. The task definitions look a little like the following:
task = { priority = 10; };
My grammar file then looks like the following:
grammar TaskGrammar;
/* Parser rules */
task : 'task' ASSIGNMENT_OP block EOF;
logical_entity : (TRUE | FALSE) # LogicalConst
| IDENTIFIER # LogicalVariable
;
numeric_entity : DECIMAL # NumericConst
| IDENTIFIER # NumericVariable
;
block : LBRACE (statement)* RBRACE SEMICOLON;
assignment : IDENTIFIER ASSIGNMENT_OP DECIMAL SEMICOLON
| IDENTIFIER ASSIGNMENT_OP block SEMICOLON
| IDENTIFIER ASSIGNMENT_OP QUOTED_STRING SEMICOLON
| IDENTIFIER ASSIGNMENT_OP CONSTANT SEMICOLON;
functionCall : IDENTIFIER LPAREN (parameter)*? RPAREN SEMICOLON;
parameter : DECIMAL
| QUOTED_STRING;
statement : assignment
| functionCall;
/* Lexxer rules */
IF : 'if' ;
THEN : 'then';
AND : 'and' ;
OR : 'or' ;
TRUE : 'true' ;
FALSE : 'false' ;
MULT : '*' ;
DIV : '/' ;
PLUS : '+' ;
MINUS : '-' ;
GT : '>' ;
GE : '>=' ;
LT : '<' ;
LE : '<=' ;
EQ : '==' ;
ASSIGNMENT_OP : '=' ;
LPAREN : '(' ;
RPAREN : ')' ;
LBRACE : '{' ;
RBRACE : '}' ;
SEMICOLON : ';' ;
// DECIMAL, IDENTIFIER, COMMENTS, WS are set using regular expressions
DECIMAL : '-'?[0-9]+('.'[0-9]+)? ;
IDENTIFIER : [a-zA-Z_][a-zA-Z_0-9]* ;
Value: STR_EXT | QUOTED_STRING | SINGLE_QUOTED
;
STR_EXT
:
[a-zA-Z0-9_/\.,\-:=~+!?$&^*\[\]#|]+;
Comment
:
'#' ~[\r\n]*;
CONSTANT : StringCharacters;
QUOTED_STRING
:
'"' StringCharacters? '"'
;
fragment
StringCharacters
: (~["\\] | EscapeSequence)+
;
fragment
EscapeSequence
: '\\' [btnfr"'\\]?
;
SINGLE_QUOTED
:
'\'' ~['\\]* '\'';
// COMMENT and WS are stripped from the output token stream by sending
// to a different channel 'skip'
COMMENT : '//' .+? ('\n'|EOF) -> skip ;
WS : [ \r\t\u000C\n]+ -> skip ;
This grammar compiles fine in ANTLR, but when it comes to trying to use the parser, I get the following error:
line 1:0 mismatched input 'task = { priority = 10; return = AND; };' expecting 'task'
org.antlr.v4.runtime.InputMismatchException
It looks like the parser isn't recognising the block part of the definition, but I can't quite see why. The block parse rule definition should match as far as I can tell. I would expect to have a TaskContext, with a child BlockContext containing a single AssignmentContext. I get the TaskContext, but it has the above exception.
Am I missing something here? This is my first attempt at using Antler, so may be getting confused between Lexxer and Parser rules...
Your STR_EXT consumes the entire input. That rule has to go: ANTLR's lexer will always try to match as much characters as possible.
I also see that CONSTANT might consume that entire input. It has to go to, or at least be changed to consume less chars.

Regular Expressions - tree grammar Antlr Java

I'm trying to write a program in ANTLR (Java) concerning simplifying regular expression. I have already written some code (grammar file contents below)
grammar Regexp_v7;
options{
language = Java;
output = AST;
ASTLabelType = CommonTree;
backtrack = true;
}
tokens{
DOT;
REPEAT;
RANGE;
NULL;
}
fragment
ZERO
: '0'
;
fragment
DIGIT
: '1'..'9'
;
fragment
EPSILON
: '#'
;
fragment
FI
: '%'
;
ID
: EPSILON
| FI
| 'a'..'z'
| 'A'..'Z'
;
NUMBER
: ZERO
| DIGIT (ZERO | DIGIT)*
;
WHITESPACE
: ('\r' | '\n' | ' ' | '\t' ) + {$channel = HIDDEN;}
;
list
: (reg_exp ';'!)*
;
term
: ID -> ID
| '('! reg_exp ')'!
;
repeat_exp
: term ('{' range_exp '}')+ -> ^(REPEAT term (range_exp)+)
| term -> term
;
range_exp
: NUMBER ',' NUMBER -> ^(RANGE NUMBER NUMBER)
| NUMBER (',') -> ^(RANGE NUMBER NULL)
| ',' NUMBER -> ^(RANGE NULL NUMBER)
| NUMBER -> ^(RANGE NUMBER NUMBER)
;
kleene_exp
: repeat_exp ('*'^)*
;
concat_exp
: kleene_exp (kleene_exp)+ -> ^(DOT kleene_exp (kleene_exp)+)
| kleene_exp -> kleene_exp
;
reg_exp
: concat_exp ('|'^ concat_exp)*
;
My next goal is to write down tree grammar code, which is able to simplify regular expressions (e.g. a|a -> a , etc.). I have done some coding (see text below), but I have troubles with defining rule that treats nodes as subtrees (in order to simplify following kind of expressions e.g.: (a|a)|(a|a) to a, etc.)
tree grammar Regexp_v7Walker;
options{
language = Java;
tokenVocab = Regexp_v7;
ASTLabelType = CommonTree;
output=AST;
backtrack = true;
}
tokens{
NULL;
}
bottomup
: ^('*' ^('*' e=.)) -> ^('*' $e) //a** -> a*
| ^('|' i=.* j=.* {$i.tree.toStringTree() == $j.tree.toStringTree()} )
-> $i // There are 3 errors while this line is up and running:
// 1. CommonTree cannot be resolved,
// 2. i.tree cannot be resolved or is not a field,
// 3. i cannot be resolved.
;
Small driver class:
public class Regexp_Test_v7 {
public static void main(String[] args) throws RecognitionException {
CharStream stream = new ANTLRStringStream("a***;a|a;(ab)****;ab|ab;ab|aa;");
Regexp_v7Lexer lexer = new Regexp_v7Lexer(stream);
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
Regexp_v7Parser parser = new Regexp_v7Parser(tokenStream);
list_return list = parser.list();
CommonTree t = (CommonTree) list.getTree();
System.out.println("Original tree: " + t.toStringTree());
CommonTreeNodeStream nodes = new CommonTreeNodeStream(t);
Regexp_v7Walker s = new Regexp_v7Walker(nodes);
t = (CommonTree)s.downup(t);
System.out.println("Simplified tree: " + t.toStringTree());
Can anyone help me with solving this case?
Thanks in advance and regards.
Now, I'm no expert, but in your tree grammar:
add filter=true
change the second line of bottomup rule to:
^('|' i=. j=. {i.toStringTree().equals(j.toStringTree()) }? ) -> $i }
If I'm not mistaken by using i=.* you're allowing i to be non-existent and you'll get a NullPointerException on conversion to a String.
Both i and j are of type CommonTree because you've set it up this way: ASTLabelType = CommonTree, so you should call i.toStringTree().
And since it's Java and you're comparing Strings, use equals().
Also to make the expression in curly brackets a predicate, you need a question mark after the closing one.

Recursively Processing Rules in ANTLR

Okay, for my third ANTLR question in two days:
My Grammar is meant to parse boolean statements, something like this:
AGE > 21 AND AGE < 35
Since this is a relatively simple grammar, I embedded the code rather than using an AST. The rule looks like this:
: a=singleEvaluation { $evalResult = $a.evalResult;}
(('AND') b=singleEvaluation {$evalResult = $evalResult && $b.evalResult;})+
{
// code
}
;
Now I need to implement order of operations using parenthesis, to parse something like this:
AGE >= 21 AND (DEPARTMENT=1000 OR DEPARTMENT=1001)
or even worse:
AGE >= 21 AND (DEPARTMENT=1000 OR (EMPID=1000 OR EMPID=1001))
Can anyone suggest a way to implement the recursion needed? I'd rather not switch to an AST at this late stage, and I'm still a relative noob at this.
Jason
Since some of your rules evaluate to a boolean, and others to an integer (or only compare integers), you'd best let your rules return a generic Object, and cast accordingly.
Here's a quick demo (including making a recursive call in case of parenthesized expressions):
grammar T;
#parser::members {
private java.util.Map<String, Integer> memory = new java.util.HashMap<String, Integer>();
}
parse
#init{
// initialize some test values
memory.put("AGE", 42);
memory.put("DEPARTMENT", 999);
memory.put("EMPID", 1001);
}
: expression EOF {System.out.println($text + " -> " + $expression.value);}
;
expression returns [Object value]
: logical {$value = $logical.value;}
;
logical returns [Object value]
: e1=equality {$value = $e1.value;} ( 'AND' e2=equality {$value = (Boolean)$value && (Boolean)$e2.value;}
| 'OR' e2=equality {$value = (Boolean)$value || (Boolean)$e2.value;}
)*
;
equality returns [Object value]
: r1=relational {$value = $r1.value;} ( '=' r2=relational {$value = $value.equals($r2.value);}
| '!=' r2=relational {$value = !$value.equals($r2.value);}
)*
;
relational returns [Object value]
: a1=atom {$value = $a1.value;} ( '>=' a2=atom {$value = (Integer)$a1.value >= (Integer)$a2.value;}
| '>' a2=atom {$value = (Integer)$a1.value > (Integer)$a2.value;}
| '<=' a2=atom {$value = (Integer)$a1.value <= (Integer)$a2.value;}
| '<' a2=atom {$value = (Integer)$a1.value < (Integer)$a2.value;}
)?
;
atom returns [Object value]
: INTEGER {$value = Integer.valueOf($INTEGER.text);}
| ID {$value = memory.get($ID.text);}
| '(' expression ')' {$value = $expression.value;}
;
INTEGER : '0'..'9'+;
ID : ('a'..'z' | 'A'..'Z')+;
SPACE : ' ' {$channel=HIDDEN;};
Parsing the input "AGE >= 21 AND (DEPARTMENT=1000 OR (EMPID=1000 OR EMPID=1001))" would result in the following output:
AGE >= 21 AND (DEPARTMENT=1000 OR (EMPID=1000 OR EMPID=1001)) -> true
I would do it like this:
program : a=logicalExpression {System.out.println($a.evalResult);}
;
logicalExpression returns [boolean evalResult] : a=andExpression { $evalResult = $a.evalResult;} (('OR') b=andExpression {$evalResult = $evalResult || $b.evalResult;})*
;
andExpression returns [boolean evalResult] : a=atomicExpression { $evalResult = $a.evalResult;} (('AND') b=atomicExpression {$evalResult = $evalResult && $b.evalResult;})*
;
atomicExpression returns [boolean evalResult] : a=singleEvaluation {$evalResult = $a.evalResult;}
| '(' b=logicalExpression ')' {$evalResult = $b.evalResult;}
;
singleEvaluation returns [boolean evalResult ] : 'TRUE' {$evalResult = true;}
| 'FALSE' {$evalResult = false;}
;

lexer that takes "not" but not "not like"

I need a small trick to get my parser completely working.
I use antlr to parse boolean queries.
a query is composed of elements, linked together by ands, ors and nots.
So I can have something like :
"(P or not Q or R) or (( not A and B) or C)"
Thing is, an element can be long, and is generally in the form :
a an_operator b
for example :
"New-York matches NY"
Trick, one of the an_operator is "not like"
So I would like to modify my lexer so that the not checks that there is no like after it, to avoid parsing elements containing "not like" operators.
My current grammar is here :
// save it in a file called Logic.g
grammar Logic;
options {
output=AST;
}
// parser/production rules start with a lower case letter
parse
: expression EOF! // omit the EOF token
;
expression
: orexp
;
orexp
: andexp ('or'^ andexp)* // make `or` the root
;
andexp
: notexp ('and'^ notexp)* // make `and` the root
;
notexp
: 'not'^ atom // make `not` the root
| atom
;
atom
: ID
| '('! expression ')'! // omit both `(` andexp `)`
;
// lexer/terminal rules start with an upper case letter
ID : ('a'..'z' | 'A'..'Z')+;
Space : (' ' | '\t' | '\r' | '\n')+ {$channel=HIDDEN;};
Any help would be appreciated.
Thanks !
Here's a possible solution:
grammar Logic;
options {
output=AST;
}
tokens {
NOT_LIKE;
}
parse
: expression EOF!
;
expression
: orexp
;
orexp
: andexp (Or^ andexp)*
;
andexp
: fuzzyexp (And^ fuzzyexp)*
;
fuzzyexp
: (notexp -> notexp) ( Matches e=notexp -> ^(Matches $fuzzyexp $e)
| Not Like e=notexp -> ^(NOT_LIKE $fuzzyexp $e)
| Like e=notexp -> ^(Like $fuzzyexp $e)
)?
;
notexp
: Not^ atom
| atom
;
atom
: ID
| '('! expression ')'!
;
And : 'and';
Or : 'or';
Not : 'not';
Like : 'like';
Matches : 'matches';
ID : ('a'..'z' | 'A'..'Z')+;
Space : (' ' | '\t' | '\r' | '\n')+ {$channel=HIDDEN;};
which will parse the input "A not like B or C like D and (E or not F) and G matches H" into the following AST:

Categories

Resources