I'm learning Antlr4 to write a language for basic arithmetics. Currently, I have written a grammar with Antlr4 for the basic arithmetic operators * + - /.
Here is my grammar:
grammar Expr; // rename to distinguish from Expr.g4
prog: stat (';' stat)* ;
stat: ID '=' expr (';'|',')? # assign
| expr (';')? # printExpr
;
expr: op=('-'|'+') expr # signed
| expr op=('*'|'/') expr # MulDiv
| expr op=('+'|'-') expr # AddSub
| ID # id
| DOUBLE # Double
| '(' expr ')' # parens
;
MUL : '*' ; // assigns token name to '*' used above in grammar
DIV : '/' ;
ADD : '+' ;
SUB : '-' ;
ID : [a-zA-Z]+ [0-9]* ; // match identifiers
DOUBLE : [0-9]+ ('.' [0-9]+)? ;
WS : [ \t\r\n]+ -> skip ;
The Problem is that my grammar accepts inputs like 2++++3 due to rule: op=('-'|'+') expr. However, I didn't find another way to implements signed expressions such as -2 + 3, x = 6; y = -x, +3 -2.
How can I fix the bug?
Try breaking up your grammar, now it is a bit of a monster rule (expr). You probably don't want to sign an entire expression, but rather a single value. How about something like this
expr: add value
| expr mult expr
| expr add expr
| value
;
value: ID
| DOUBLE
| '(' expr ')'
;
add: '+' | '-';
mult: '*' | '/';
This way, you can build signed expressions like -2, +x or -(2+3), but not 2++3.
Related
I am making a simple programming language. It has the following grammar:
program: declaration+;
declaration: varDeclaration
| statement
;
varDeclaration: 'var' IDENTIFIER ('=' expression)?';';
statement: exprStmt
| assertStmt
| printStmt
| block
;
exprStmt: expression';';
assertStmt: 'assert' expression';';
printStmt: 'print' expression';';
block: '{' declaration* '}';
//expression without left recursion
/*
expression: assignment
;
assignment: IDENTIFIER '=' assignment
| equality;
equality: comparison (op=('==' | '!=') comparison)*;
comparison: addition (op=('>' | '>=' | '<' | '<=') addition)* ;
addition: multiplication (op=('-' | '+') multiplication)* ;
multiplication: unary (op=( '/' | '*' ) unary )* ;
unary: op=( '!' | '-' ) unary
| primary
;
*/
//expression with left recursion
expression: IDENTIFIER '=' expression
| expression op=('==' | '!=') expression
| expression op=('>' | '>=' | '<' | '<=') expression
| expression op=('-' | '+') expression
| expression op=( '/' | '*' ) expression
| op=( '!' | '-' ) expression
| primary
;
primary: intLiteral
| booleanLiteral
| stringLiteral
| identifier
| group
;
intLiteral: NUMBER;
booleanLiteral: value=('True' | 'False');
stringLiteral: STRING;
identifier: IDENTIFIER;
group: '(' expression ')';
TRUE: 'True';
FALSE: 'False';
NUMBER: [0-9]+ ;
STRING: '"' ~('\n'|'"')* '"' ;
IDENTIFIER : [a-zA-Z]+ ;
This left recursive grammar is useful because it ensures every node in the parse tree has at most 2 children. For example,
var a = 1 + 2 + 3 will turn into two nested addition expressions, rather than one addition expression with three children. That behavior is useful because it makes writing an interpreter easy, since I can just do (highly simplified):
public Object visitAddition(AdditionContext ctx) {
return visit(ctx.addition(0)) + visit(ctx.addition(1));
}
instead of iterating through all the child nodes.
However, this left recursive grammar has one flaw, which is that it accepts invalid statements.
For example:
var a = 3;
var b = 4;
a = b == b = a;
is valid under this grammar even though the expected behavior would be
b == b is parsed first since == has higher precedence than assignment (=).
Because b == b is parsed first, the expression becomes incoherent. Parsing fails.
Instead, the following undesired behavior occurs: the final line is parsed as (a = b) == (b = a).
How can I prevent left recursion from parsing incoherent statements, such as a = b == b = a?
The non-left-recursive grammar recognizes this input is correct and throws a parsing error, which is the desired behavior.
I apologize for the extremely long explanation but I'm stuck for a month now and I really can't figure out how to solve this.
I have to derive, as a project, a compiler with antlr4 for a regex grammar that generate a program (JAVA) able to distinguish words belonging to the language generated by a regex used as input for antlr4 compiler.
The grammar that we have to use is this one:
RE ::= union | simpleRE
union ::= simpleRE + RE
simpleRE ::= concatenation | basicRE
concatenation ::= basicRE simpleRE
basicRE ::= group | any | char
group ::= (RE) | (RE)∗ | (RE)+
any ::= ?
char ::= a | b | c | ··· | z | A | B | C | ··· | Z | 0 | 1 | 2 | ··· | 9 | . | − | _
and from that, I gave this grammar to antrl4
Regexp.g4
grammar Regxp;
start_rule
: re # start
;
re
: union
| simpleRE
;
union
: simpleRE '+' re # unionOfREs
;
simpleRE
: concatenation
| basicRE
;
concatenation
: basicRE simpleRE #concatOfREs
;
basicRE
: group
| any
| cHAR
;
group
: LPAREN re RPAREN '*' # star
| LPAREN re RPAREN '+' # plus
| LPAREN re RPAREN # singleWithParenthesis
;
any
: '?'
;
cHAR
: CHAR #singleChar
;
WS : [ \t\r\n]+ -> skip ;
LPAREN : '(' ;
RPAREN : ')' ;
CHAR : LETTER | DIGIT | DOT | D | UNDERSCORE
;
/* tokens */
fragment LETTER: [a-zA-Z]
;
fragment DIGIT: [0-9]
;
fragment DOT: '.'
;
fragment D: '-'
;
fragment UNDERSCORE: '_'
;
Then i generated the java files from antlr4 with visitors.
As far as i understood the logic of the project, when the visitor is traversing the parse tree, it has to generate lines of code to fill the transition table of the NFA derived as applying the Thompson rules on the input regexp.
Then these lines of code are to be saved as a .java text file, and compiled to a program that takes in input a string (word) and tells if the word belongs or not to the language generated by the regex.
The result should be like this:
RE word Result
a+b a OK
b OK
ac KO
a∗b aab OK
b OK
aaaab OK
abb KO
So I'm asking, how can I represent the transition table in a way such that it can be filled during the visit of the parse tree and then exported in order to be used by a simple java program implementing the acceptance algorithm for an NFA? (i'm considering this pseudo-code):
S = ε−closure(s0);
c = nextChar();
while (c ≠ eof) do
S = ε−closure(move(S,c));
c = nextChar();
end while
if (S ∩ F ≠ ∅) then return “yes”;
else return “no”;
end if
As of now I managed to make that, when the visitor is for example in the unionOfREs rule, it will do something like this:
MyVisitor.java
private List<String> generatedCode = new ArrayList<String>();
/* ... */
#Override
public String visitUnionOfREs(RegxpParser.UnionOfREsContext ctx) {
System.out.println("unionOfRExps");
String char1 = visit(ctx.simpleRE());
String char2 = visit(ctx.re());
generatedCode.add("tTable.addUnion("+char1+","+char2+");");
//then this line of code will populate the transition table
return char1+"+"+char2;
}
/* ... */
The addUnion it's inside a java file that will contains all the methods to fill the transition table. I wrote code for the union, but i dont' like it because it's like to write the transition table of the NFA, as you would write it on a paper: example.
I got this when I noticed that by building the table iteratively, you can define 2 "pointers" on the table, currentBeginning and currentEnd, that tell you where to expand again the character written on the table, with the next rule that the visitor will find on the parse tree. Because this character can be another production or just a single character. On the link it is represented the written-on-paper example that convinced me to use this approach.
TransitionTable.java
/* ... */
public void addUnion(String char1, String char2) {
if (transitionTable.isEmpty()) {
List<List<Integer>> lc1 = Arrays.asList(Arrays.asList(null)
,Arrays.asList(currentBeginning+3)
,Arrays.asList(null)
,Arrays.asList(null)
,Arrays.asList(null)
,Arrays.asList(null));
List<List<Integer>> lc2 = Arrays.asList(Arrays.asList(null)
,Arrays.asList(null)
,Arrays.asList(currentBeginning+4)
,Arrays.asList(null)
,Arrays.asList(null)
,Arrays.asList(null));
List<List<Integer>> le = Arrays.asList(Arrays.asList(currentBeginning+1,currentBeginning+2)
,Arrays.asList(null)
,Arrays.asList(null)
,Arrays.asList(currentBeginning+5)
,Arrays.asList(currentBeginning+5)
,Arrays.asList(null));
transitionTable.put(char1, lc1);
transitionTable.put(char2, lc2);
transitionTable.put("epsilon", le);
//currentBeginning += 2;
//currentEnd = transitionTable.get(char2).get(currentBeginning).get(0);
currentEnd = transitionTable.get("epsilon").size()-1;//il 5
} else { //not the first time it encounters this rule, beginning and end changed
//needs to add 2 less states
}
}
/* ... */
At the moment I'm representing the transition table as HashMap<String, List<List<Integer>>> strings are for chars on the edges of the NFA and List<List<Integer>> because by being non deterministic, it needs to represent more transitions from a single state.
But going this way, for a parse tree like this i will obtain this line of code for the union : "tTable.addUnion("tTable.addConcat(a,b)","+char2+");"
And i'm blocked here, i don't know how to solve this and i really can't think a different way to represent the transition table or to fill it while visiting the parse tree.
Thank You.
Using Thompson's construction, every regular (sub-)expression produces an NFA, and every regular expression operator (union, cat, *) can be implemented by adding a couple states and connecting them to states that already exists. See:
https://en.wikipedia.org/wiki/Thompson%27s_construction
So, when parsing the regex, every terminal or non-terminal production should add the required states and transitions to the NFA, and return its start and end state to the containing production. Non-terminal productions will combine their children and return their own start+end states so that your NFA can be built from the leaves of the regular expression up.
The representation of the state table is not critical for building. Thompson's construction will never require you to modify a state or transition that you built before, so you just need to be able to add new ones. You will also never need more than one transition from a state on the same character, or even more than one non-epsilon transition. In fact, if all your operators are binary you will never need more than 2 transitions on a state. Usually the representation is designed to make it easy to do the next steps, like DFA generation or direct execution of the NFA against strings.
For example, a class like this can completely represent a state:
class State
{
public char matchChar;
public State matchState; //where to go if you match matchChar, or null
public State epsilon1; //or null
public State epsilon2; //or null
}
This would actually be a pretty reasonable representation for directly executing an NFA. But if you already have code for directly executing an NFA, then you should probably just build whatever it uses so you don't have to do another transformation.
I would like to return an ExprData. ExprData is class inside my project. When i try to compile the grammar i get:
SASGrammarParser.java:684: error: cannot find symbol
It is a import problem. And how do i instantiate the ExprData?
expr returns [ExprData exprData]
: expr AND expr #AndExpr
| expr OR expr #OrExpr
| expr IN '(' constant_list ')' #InExpr
| expr (EQ | ASSIGN) expr #EqualExpr
| expr op=(MULT | DIV) expr #DivMultExpr
| expr op=(PLUS | MINUS) expr #PlusMinusExpr
| expr LTEQ expr #LessEqualExpr
| expr LT expr #LessExpr
| expr GT expr #GreaterExpr
| expr GTEQ expr #GreaterEqualExpr
| '-' expr #MinusExpr
| '(' expr ')' #SimpleExpr
| variable #VariableExp
| constant #ConstantExp
| function #FunctionExp
;
If you want to use some class in the grammar (and therefore in the generated parser) you need to import all of them in the grammar with
#parser::header {
import packageName.ExprData;
}
And I'm not sure on what do you mean on how to instantiate? exprData is the return variable here, so you can assign to it by referring it from the action with $exprData. Just form the top of my head (maybe that labels can't be used like this:
expr OR expr #OrExpr {$exprData=someFuncitonThatReturnsExprDataObject();}
I keep getting MissingTokenException, NullPointerException, and if I remember correctly NoViableAlterativeException. The logfile / console output from ANTLRWorks is not helpful enough for me.
What I'm after is a rewrite such as the following:
(expression | FLOAT) '(' -> (expression | FLOAT) '*('
Here below is a sample of my grammar that I snatched out to create a test file with.
grammar Test;
expression
: //FLOAT '(' -> (FLOAT '*(')+
| add EOF!
;
term
:
| '(' add ')'
| FLOAT
| IMULT
;
IMULT
: (add ('(' add)*) -> (add ('*' add)*)
;
negation
: '-'* term
;
unary
: ('+' | '-')* negation
;
mult
: unary (('*' | '/') unary)*
;
add
: mult (('+' | '-') mult)*
;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
FLOAT
: ('0'..'9')+ '.' ('0'..'9')*// EXPONENT?
| '.' ('0'..'9')+ //EXPONENT?
| ('0'..'9')+ //EXPONENT
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
I've also tried :
imult
: FLOAT '(' -> FLOAT '*('
;
And this:
IMULT / imult
: expression '(' -> expression '*'
;
As well as countless other versions (hacks) that I have lost count of.
Can anyone help me out with this ?
I've run into this problem before. The basic answer is that ANTLR doesn't allow you to use tokens on the right hand side of a '->' statement that weren't present on the left hand side. However, what you can do is use extra tokens defined specifically for AST's.
Just create a tokens block before the grammar rules as follows:
tokens { ABSTRACTTOKEN; }
You can use them on the right hand side of the grammar statement like this.
imult
: FLOAT '(' -> ^(ABSTRACTTOKEN FLOAT)
;
Hope that helps.
I'd like to parse an UTF8 encoded text file that may contain something like this:
int 1
text " some text with \" and \\ "
int list[-45,54, 435 ,-65]
float list [ 4.0, 5.2,-5.2342e+4]
The numbers in the list are separated by commas. Whitespace is permitted but not required between any number and any symbol like commas and brackets here. Similarly for words and symbols, like in the case of list[
I've done the quoted string reading by forcing Scanner to give me single chars (setting its delimiter to an empty pattern) because I still thought it'll be useful for reading the ints and floats, but I'm not sure anymore.
The Scanner always takes a complete token and then tries to match it. What I need is try to match as much (or as little) as possible, disregarding delimiters.
Basically for this input
int list[-45,54, 435 ,-65]
I'd like to be able to call and get this
s.nextWord() // int
s.nextWord() // list
s.nextSymbol() // [
s.nextInt() // -45
s.nextSymbol() // ,
s.nextInt() // 54
s.nextSymbol() // ,
s.nextInt() // 435
s.nextSymbol() // ,
s.nextInt() // -65
s.nextSymbol() // ]
and so on.
Or, if it couldn't parse doubles and other types itself, at least a method that takes a regex, returns the biggest string that matches it (or an error) and sets the stream position to just after what it matched.
Can the Scanner somehow be used for this? Or is there another approach? I feel this must be quite a common thing to do, but I don't seem to be able to find the right tool for it.
I'm not an ANTLR expert, but this ANTLR grammar is capable to parse your code:
grammar Expressions;
expressions
: expression+ EOF
;
expression
: intExpression
| intListExpression
| floatExpression
| floatListExpression
| textExpression
| textListExpression
;
intExpression : intType INT;
intListExpression : intType listType '[' ( INT (',' INT)* )? ']';
floatExpression : floatType FLOAT;
floatListExpression : floatType listType '[' ( (INT|FLOAT) (',' (INT|FLOAT))* )? ']';
textExpression : textType STRING;
textListExpression : textType listType '[' ( STRING (',' STRING)* )? ']';
intType : 'int';
floatType : 'float';
textType : 'text';
listType : 'list';
INT : '0'..'9'+
;
FLOAT
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
STRING
: '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
;
fragment
EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
Of course you will need to improve it, but I think that with this structure is easy to insert code in the parser to do what you want (a kind of token stream). Try it in ANTLRWorks debug to see what happens.
For your input, this is the parse tree:
Edit: I changed it to support empty lists.
Initiate the scanner with the file in the class constructor. then for the nextWord Method, do this,
public static nextWord(){
return(sc.findInLine("\\w+"));
}
You can derive the code for other methods using the above example with the findInLine method of the Scanner class and changing the regex pattern.