Issues with ANTLR rewrite statement (simple?)

Issues with ANTLR rewrite statement (simple?) - java

I keep getting MissingTokenException, NullPointerException, and if I remember correctly NoViableAlterativeException. The logfile / console output from ANTLRWorks is not helpful enough for me.
What I'm after is a rewrite such as the following:
(expression | FLOAT) '(' -> (expression | FLOAT) '*('
Here below is a sample of my grammar that I snatched out to create a test file with.
grammar Test;
expression
: //FLOAT '(' -> (FLOAT '*(')+
| add EOF!
;
term
:
| '(' add ')'
| FLOAT
| IMULT
;
IMULT
: (add ('(' add)*) -> (add ('*' add)*)
;
negation
: '-'* term
;
unary
: ('+' | '-')* negation
;
mult
: unary (('*' | '/') unary)*
;
add
: mult (('+' | '-') mult)*
;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
FLOAT
: ('0'..'9')+ '.' ('0'..'9')*// EXPONENT?
| '.' ('0'..'9')+ //EXPONENT?
| ('0'..'9')+ //EXPONENT
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
I've also tried :
imult
: FLOAT '(' -> FLOAT '*('
;
And this:
IMULT / imult
: expression '(' -> expression '*'
;
As well as countless other versions (hacks) that I have lost count of.
Can anyone help me out with this ?

I've run into this problem before. The basic answer is that ANTLR doesn't allow you to use tokens on the right hand side of a '->' statement that weren't present on the left hand side. However, what you can do is use extra tokens defined specifically for AST's.
Just create a tokens block before the grammar rules as follows:
tokens { ABSTRACTTOKEN; }
You can use them on the right hand side of the grammar statement like this.
imult
: FLOAT '(' -> ^(ABSTRACTTOKEN FLOAT)
;
Hope that helps.

Related

How to fix the error in left-recursion used with semantic predicates?

I would like to parse two type of expression with boolean :
- the first would be an init expression with boolean like : init : false
- and the last one would be a derive expression with boolean like : derive : !express or (express and (amount >= 100))
My idea is to put semantic predicates in a set of rules,
the goal is when I'm parsing a boolean expression beginning with the word 'init' then it has to go to only one alternative rule proposed who is boolliteral, the last alternative in boolExpression. And if it's an expression beginning with the word 'derive' then it could have access to all alternatives of boolExpression.
I know that I could make two type of boolExpression without semantic predicates like boolExpressionInit and boolExpressionDerive... But I would like to try with my idea if it's could work with a only one boolExpression with semantic predicates.
Here's my grammar
grammar TestExpression;
#header
{
package testexpressionparser;
}
#parser::members {
int vConstraintType;
}
/* SYNTAX RULES */
textInput : initDefinition
| derDefinition ;
initDefinition : t=INIT {vConstraintType = $t.type;} ':' boolExpression ;
derDefinition : t=DERIVE {vConstraintType = $t.type;} ':' boolExpression ;
boolExpression : {vConstraintType != INIT || vConstraintType == DERIVE}? boolExpression (boolOp|relOp) boolExpression
| {vConstraintType != INIT || vConstraintType == DERIVE}? NOT boolExpression
| {vConstraintType != INIT || vConstraintType == DERIVE}? '(' boolExpression ')'
| {vConstraintType != INIT || vConstraintType == DERIVE}? attributeName
| {vConstraintType != INIT || vConstraintType == DERIVE}? numliteral
| {vConstraintType == INIT || vConstraintType == DERIVE}? boolliteral
;
boolOp : OR | AND ;
relOp : EQ | NEQ | GT | LT | GEQT | LEQT ;
attributeName : WORD;
numliteral : intliteral | decliteral;
intliteral : INT ;
decliteral : DEC ;
boolliteral : BOOLEAN;
/* LEXICAL RULES */
INIT : 'init';
DERIVE : 'derive';
BOOLEAN : 'true' | 'false' ;
BRACKETSTART : '(' ;
BRACKETSTOP : ')' ;
BRACESTART : '{' ;
BRACESTOP : '}' ;
EQ : '=' ;
NEQ : '!=' ;
NOT : '!' ;
GT : '>' ;
LT : '<' ;
GEQT : '>=' ;
LEQT : '<=' ;
OR : 'or' ;
AND : 'and' ;
DEC : [0-9]* '.' [0-9]* ;
INT : ZERO | POSITIF;
ZERO : '0';
POSITIF : [1-9] [0-9]* ;
WORD : [a-zA-Z] [_0-9a-zA-Z]* ;
WS : (SPACE | NEWLINE)+ -> skip ;
SPACE : [ \t] ; /* Space or tab */
NEWLINE : '\r'? '\n' ; /* Carriage return and new line */
I except that the grammar would run successfully, but what i receive is : "error(119): TestExpression.g4::: The following sets of rules are mutually left-recursive [boolExpression]
1 error(s)
BUILD FAIL"

Apparently ANTLR4's support for (direct) left-recursion does not work when a predicate appears before a left-recursive rule invocation. So you can fix the error by moving the predicate after the first boolExpression in the left-recursive alternatives.
That said, it seems like the predicates aren't really necessary in the first place - at least not in the example you've shown us (or the one before your edit as far as I could tell). Since a boolExpression with the constraint type INIT can apparently only match boolLiteral, you can just change initDefinition as follows:
initDefinition : t=INIT ':' boolLiteral ;
Then boolExpression will always have the constraint type DERIVE and no predicates are necessary anymore.
Generally, if you want to allow different alternatives in non-terminal x based on whether it was invoked by y or z, you should simply have multiple versions of x and then call one from y and the other from z. That's usually a lot less hassle than littering the code with actions and predicates.
Similarly it can also make sense to have a rule that matches more than it should and then detect illegal expressions in a later phase instead of trying to reject them at the syntax level. Specifically beginners often try to write grammars that only allow well-typed expressions (rejecting something like 1+true with a syntax error) and that never works out well.

Prevent left recursion in ANTLR 4 from matching invalid inputs

I am making a simple programming language. It has the following grammar:
program: declaration+;
declaration: varDeclaration
| statement
;
varDeclaration: 'var' IDENTIFIER ('=' expression)?';';
statement: exprStmt
| assertStmt
| printStmt
| block
;
exprStmt: expression';';
assertStmt: 'assert' expression';';
printStmt: 'print' expression';';
block: '{' declaration* '}';
//expression without left recursion
/*
expression: assignment
;
assignment: IDENTIFIER '=' assignment
| equality;
equality: comparison (op=('==' | '!=') comparison)*;
comparison: addition (op=('>' | '>=' | '<' | '<=') addition)* ;
addition: multiplication (op=('-' | '+') multiplication)* ;
multiplication: unary (op=( '/' | '*' ) unary )* ;
unary: op=( '!' | '-' ) unary
| primary
;
*/
//expression with left recursion
expression: IDENTIFIER '=' expression
| expression op=('==' | '!=') expression
| expression op=('>' | '>=' | '<' | '<=') expression
| expression op=('-' | '+') expression
| expression op=( '/' | '*' ) expression
| op=( '!' | '-' ) expression
| primary
;
primary: intLiteral
| booleanLiteral
| stringLiteral
| identifier
| group
;
intLiteral: NUMBER;
booleanLiteral: value=('True' | 'False');
stringLiteral: STRING;
identifier: IDENTIFIER;
group: '(' expression ')';
TRUE: 'True';
FALSE: 'False';
NUMBER: [0-9]+ ;
STRING: '"' ~('\n'|'"')* '"' ;
IDENTIFIER : [a-zA-Z]+ ;
This left recursive grammar is useful because it ensures every node in the parse tree has at most 2 children. For example,
var a = 1 + 2 + 3 will turn into two nested addition expressions, rather than one addition expression with three children. That behavior is useful because it makes writing an interpreter easy, since I can just do (highly simplified):
public Object visitAddition(AdditionContext ctx) {
return visit(ctx.addition(0)) + visit(ctx.addition(1));
}
instead of iterating through all the child nodes.
However, this left recursive grammar has one flaw, which is that it accepts invalid statements.
For example:
var a = 3;
var b = 4;
a = b == b = a;
is valid under this grammar even though the expected behavior would be
b == b is parsed first since == has higher precedence than assignment (=).
Because b == b is parsed first, the expression becomes incoherent. Parsing fails.
Instead, the following undesired behavior occurs: the final line is parsed as (a = b) == (b = a).
How can I prevent left recursion from parsing incoherent statements, such as a = b == b = a?
The non-left-recursive grammar recognizes this input is correct and throws a parsing error, which is the desired behavior.

Antlr4 token ambiguity for single character

I have a problem with the rule mnemonic_format.
Instead to recognize a simple text like A100 it gives the following error :
mismatched input 'A100' expecting 'A'
The grammar is:
grammar SimpleMathGrammar;
INTEGER : [0-9]+;
FLOAT : [0-9]+ '.' [0-9]+;
ADD : '+';
SUB : '-';
DOT : '.';
AND : 'AND';
BACKSLASH : '\\';
fragment SINGLELETTER : ( 'a'..'z' | 'A'..'Z');
fragment LOWERCASE : 'a'..'z';
fragment UNDERSCORE : '_';
fragment DOLLAR : '$';
fragment NUMBER : '0'..'9';
VARIABLENAME
: SINGLELETTER
| (SINGLELETTER|UNDERSCORE) (SINGLELETTER | UNDERSCORE | DOLLAR | NUMBER)*;
HASH : '#';
/* PARSER */
operation
: (INTEGER | FLOAT) ADD (INTEGER | FLOAT)
| (INTEGER | FLOAT) SUB (INTEGER | FLOAT);
operation_with_backslash : BACKSLASH operation BACKSLASH;
mnemonic: HASH VARIABLENAME HASH;
mnemonic_format
// Example: A100
: 'A' INTEGER;
At this point, i know that the token VARIABLENAME should not include the character A (correct me if im wrong)
So what can i do for include a single character (o fixed sequence) in distinct rule? (and which is my error?)
EDIT: I found the origin of the problem (by remove all of the other tokens and rules) in the following token case:
VARIABLENAME: (SINGLELETTER|UNDERSCORE) (SINGLELETTER | UNDERSCORE | DOLLAR | NUMBER)*;
So how can i create a token or a lexer rule that give me the basic for detect some generic text (like a Class name or a Variable name) by also create rules where i must accept a fixed sequence of characters?

Ok,
The trick was the "general scope" of the token VARIABLENAME.
In other terms, the token is too much generic.
In my case the sub-condition VARIABLENAME: SINGLELETTER NUMBER* crash/collide with the condition mnemonic_format: 'A' INTEGER
(Indeed i can create the string A100 with VARIABLENAME or mnemonic_format and this create an ambiguity)
So i "specialize" VARIABLENAME for accept a prefix, for example:
VARIABLENAME
: HASH (SINGLELETTER|UNDERSCORE)(SINGLELETTER|UNDERSCORE|DOLLAR|NUMBER)*
| 'class ' (SINGLELETTER|UNDERSCORE)(SINGLELETTER|UNDERSCORE|DOLLAR|NUMBER)*
...
This should avoid an ambiguity between the token and the rule

Antlr4 grammar with basic arithmetic and signed expressions

I'm learning Antlr4 to write a language for basic arithmetics. Currently, I have written a grammar with Antlr4 for the basic arithmetic operators * + - /.
Here is my grammar:
grammar Expr; // rename to distinguish from Expr.g4
prog: stat (';' stat)* ;
stat: ID '=' expr (';'|',')? # assign
| expr (';')? # printExpr
;
expr: op=('-'|'+') expr # signed
| expr op=('*'|'/') expr # MulDiv
| expr op=('+'|'-') expr # AddSub
| ID # id
| DOUBLE # Double
| '(' expr ')' # parens
;
MUL : '*' ; // assigns token name to '*' used above in grammar
DIV : '/' ;
ADD : '+' ;
SUB : '-' ;
ID : [a-zA-Z]+ [0-9]* ; // match identifiers
DOUBLE : [0-9]+ ('.' [0-9]+)? ;
WS : [ \t\r\n]+ -> skip ;
The Problem is that my grammar accepts inputs like 2++++3 due to rule: op=('-'|'+') expr. However, I didn't find another way to implements signed expressions such as -2 + 3, x = 6; y = -x, +3 -2.
How can I fix the bug?

Try breaking up your grammar, now it is a bit of a monster rule (expr). You probably don't want to sign an entire expression, but rather a single value. How about something like this
expr: add value
| expr mult expr
| expr add expr
| value
;
value: ID
| DOUBLE
| '(' expr ')'
;
add: '+' | '-';
mult: '*' | '/';
This way, you can build signed expressions like -2, +x or -(2+3), but not 2++3.

Java - parsing text file - Scanner, Reader or something else?

I'd like to parse an UTF8 encoded text file that may contain something like this:
int 1
text " some text with \" and \\ "
int list[-45,54, 435 ,-65]
float list [ 4.0, 5.2,-5.2342e+4]
The numbers in the list are separated by commas. Whitespace is permitted but not required between any number and any symbol like commas and brackets here. Similarly for words and symbols, like in the case of list[
I've done the quoted string reading by forcing Scanner to give me single chars (setting its delimiter to an empty pattern) because I still thought it'll be useful for reading the ints and floats, but I'm not sure anymore.
The Scanner always takes a complete token and then tries to match it. What I need is try to match as much (or as little) as possible, disregarding delimiters.
Basically for this input
int list[-45,54, 435 ,-65]
I'd like to be able to call and get this
s.nextWord() // int
s.nextWord() // list
s.nextSymbol() // [
s.nextInt() // -45
s.nextSymbol() // ,
s.nextInt() // 54
s.nextSymbol() // ,
s.nextInt() // 435
s.nextSymbol() // ,
s.nextInt() // -65
s.nextSymbol() // ]
and so on.
Or, if it couldn't parse doubles and other types itself, at least a method that takes a regex, returns the biggest string that matches it (or an error) and sets the stream position to just after what it matched.
Can the Scanner somehow be used for this? Or is there another approach? I feel this must be quite a common thing to do, but I don't seem to be able to find the right tool for it.

I'm not an ANTLR expert, but this ANTLR grammar is capable to parse your code:
grammar Expressions;
expressions
: expression+ EOF
;
expression
: intExpression
| intListExpression
| floatExpression
| floatListExpression
| textExpression
| textListExpression
;
intExpression : intType INT;
intListExpression : intType listType '[' ( INT (',' INT)* )? ']';
floatExpression : floatType FLOAT;
floatListExpression : floatType listType '[' ( (INT|FLOAT) (',' (INT|FLOAT))* )? ']';
textExpression : textType STRING;
textListExpression : textType listType '[' ( STRING (',' STRING)* )? ']';
intType : 'int';
floatType : 'float';
textType : 'text';
listType : 'list';
INT : '0'..'9'+
;
FLOAT
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
STRING
: '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
;
fragment
EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
Of course you will need to improve it, but I think that with this structure is easy to insert code in the parser to do what you want (a kind of token stream). Try it in ANTLRWorks debug to see what happens.
For your input, this is the parse tree:
Edit: I changed it to support empty lists.

Initiate the scanner with the file in the class constructor. then for the nextWord Method, do this,
public static nextWord(){
return(sc.findInLine("\\w+"));
}
You can derive the code for other methods using the above example with the findInLine method of the Scanner class and changing the regex pattern.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Issues with ANTLR rewrite statement (simple?) - java

Related

How to fix the error in left-recursion used with semantic predicates?

Prevent left recursion in ANTLR 4 from matching invalid inputs

Antlr4 token ambiguity for single character

Antlr4 grammar with basic arithmetic and signed expressions

Java - parsing text file - Scanner, Reader or something else?

Categories

Resources