Here is the beginning of my lexer rules:
F_TEXT_START
: {! matchingFText}? 'f"' {matchingFText = true;}
;
F_TEXT_PH_ESCAPE
: {matchingFText && ! matchingFTextPh}? '{=/'
;
F_TEXT_PH_START
: {matchingFText && ! matchingFTextPh}? '{=' {matchingFTextPh = true;}
;
F_TEXT_PH_END
: {matchingFText && matchingFTextPh}? '}' {matchingFTextPh = false;}
;
F_TEXT_CHAR
: {matchingFText && ! matchingFTextPh}? (~('"' | '{')+ | '""' | '{' ~'=')
;
F_TEXT_END
: {matchingFText && ! matchingFTextPh}? '"' {matchingFText = false;}
;
IF
: {! matchingFText || matchingFTextPh}? 'if'
;
ELIF
: {! matchingFText || matchingFTextPh}? 'elif'
;
// Lots of other keywords
fragment LETTER
: ('A' .. 'Z' | 'a' .. 'z' | '_')
;
VARIABLE
: {! matchingFText || matchingFTextPh}? LETTER (LETTER | DIGIT)*
;
What I am doing is putting my formatted text not just like a normal text token but with a f before, but I add it to my parse tree, to be able to tell if there are errors while parsing (with just parser.start()). So a formatted text starts with f", finishes with a ", any " must be replaced by "", and can contain placeholders starting with {= and finishing with } but if you want to actually write {=, you'll have to replace it by {=/.
The problem is that in a normal formatted text content (not placeholder), the lexer started to mach not only F_TEXT_CHAR but other lexer rules too, like variables. What I did seems pretty dumb, I just put semantic predicates for every other rule to avoid them to be matched in a formatted text's content (but still in a placeholder).
Isn't there a better way ?
I'd use a lexical mode for this. To use lexical modes, you'll have to define separate lexer- and parser grammars. Here's a quick demo:
lexer grammar TestLexer;
F_TEXT_START
: 'f"' -> pushMode(F_TEXT)
;
VARIABLE
: LETTER (LETTER | DIGIT)*
;
F_TEXT_PH_ESCAPE
: '{=/'
;
F_TEXT_PH_END
: '}' -> popMode
;
SPACES
: [ \t\r\n]+ -> skip
;
fragment LETTER
: [a-zA-Z_]
;
fragment DIGIT
: [0-9]
;
mode F_TEXT;
F_TEXT_CHAR
: ~["{]+ | '""' | '{' ~'='
;
F_TEXT_PH_START
: '{=' -> pushMode(DEFAULT_MODE)
;
F_TEXT_END
: '"' -> popMode
;
Use the lexer in your parser like this:
parser grammar TestParser;
options {
tokenVocab=TestLexer;
}
// ...
If you now tokenise the input f"mu {=mu}" mu, you'd get the following tokens:
F_TEXT_START `f"`
F_TEXT_CHAR `mu `
F_TEXT_PH_START `{=`
VARIABLE `mu`
F_TEXT_PH_END `}`
F_TEXT_END `"`
VARIABLE `mu`
Related
I try to implement oopath syntax in a drools like rule, but I have some issues regarding the non variables oopaths. For example, here is what I try to generate in the when block:
when $rt : string(Variables
$yr1t : /path1/F
$yr2t : /path2/F
$yr3t : path3.path4/PATH5
$yr4t : path3
$yr5t : /path3
Conditions
$yr4t == $yr5t + 3
$yr3t != $yr2t
//FROM HERE IS THE PROBLEM:
$yr3t == p/path/f
$yr3t == /g/t
/path2/F[g==$yr1t]
)
The problem I am facing is that my grammar doesn't support this format and I don't know how to modify the existing one in order to support even the last 3 statements.
Here is what I've tried so far:
Model:
declarations+=Declaration*
;
Temp:
elementType=ElementType
;
ElementType:
typeName=('string'|'int'|'boolean');
Declaration:
Rule
;
#Override
terminal ID: ('^'|'$')('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'_'|'0'..'9')*;
terminal PATH: ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'_'|'0'..'9')*;
Rule:
'Filter'
'rule' ruleDescription=STRING
'#specification' QualifiedNameSpecification
'ruleflow-group' ruleflowDescription=STRING
'when' (name += ID ':' atribute += Temp '(' 'Variables'?
//(varName += ID ':' QualifiedNameVariabilePath)*
(variablesList += Variable)*
'Conditions'?
(exp += EvalExpression)*
')'
)*
;
QualifiedNameSpecification: '(' STRING ')';
QualifiedNameVariabilePath: (('/'|'.')? PATH)* ;
ExpressionsModel:
elements += AbstractElement*;
AbstractElement:
Variable | EvalExpression ;
Variable:
//'var'
name=ID ':' QualifiedNameVariabilePath; //expression=Expression;
EvalExpression:
//'eval'
expression=Expression;
Expression: Or;
Or returns Expression:
And ({Or.left=current} "||" right=And)*
;
And returns Expression:
Equality ({And.left=current} "&&" right=Equality)*
;
Equality returns Expression:
Comparison (
{Equality.left=current} op=("=="|"!=")
right=Comparison
)*
;
Comparison returns Expression:
PlusOrMinus (
{Comparison.left=current} op=(">="|"<="|">"|"<")
right=PlusOrMinus
)*
;
PlusOrMinus returns Expression:
MulOrDiv (
({Plus.left=current} '+' | {Minus.left=current} '-')
right=MulOrDiv
)*
;
MulOrDiv returns Expression:
Primary (
{MulOrDiv.left=current} op=('*'|'/')
right=Primary
)*
;
Primary returns Expression:
'(' Expression ')' |
{Not} "!" expression=Primary |
Atomic
;
Atomic returns Expression:
{IntConstant} value=INT |
{StringConstant} value=STRING |
{BoolConstant} value=('true'|'false') |
{VariableRef} variable=[Variable]
;
EDIT: To make it more clear, the question is how do I modify my grammar in order to support oopath syntax without binding them to a variable, something like a temporary object.
I'm implementing a python interpreter using ANTLR4 like lexer and parser generator. I used the BNF defined at this link:
https://github.com/antlr/grammars-v4/blob/master/python3/Python3.g4.
However the implementation of indentation with the INDENT and DEDENT tokens within the lexer::members do not work when i define a compound statement.
For example if i define the following statement:
x=10
while x>2 :
print("hello")
x=x-3
So in the line when i reassign the value of x variable i should have an indentation error that i don't have in my currest state.
Should i edit something into the lexer code or what?
This is the BNF that i'm using with the lexer::members and the NEWLINE rules defined in the above link.
grammar python;
tokens { INDENT, DEDENT }
#lexer::members {
// A queue where extra tokens are pushed on (see the NEWLINE lexer rule).
private java.util.LinkedList<Token> tokens = new java.util.LinkedList<>();
// The stack that keeps track of the indentation level.
private java.util.Stack<Integer> indents = new java.util.Stack<>();
// The amount of opened braces, brackets and parenthesis.
private int opened = 0;
// The most recently produced token.
private Token lastToken = null;
#Override
public void emit(Token t) {
super.setToken(t);
tokens.offer(t);
}
#Override
public Token nextToken() {
// Check if the end-of-file is ahead and there are still some DEDENTS expected.
if (_input.LA(1) == EOF && !this.indents.isEmpty()) {
// Remove any trailing EOF tokens from our buffer.
for (int i = tokens.size() - 1; i >= 0; i--) {
if (tokens.get(i).getType() == EOF) {
tokens.remove(i);
}
}
// First emit an extra line break that serves as the end of the statement.
this.emit(commonToken(pythonParser.NEWLINE, "\n"));
// Now emit as much DEDENT tokens as needed.
while (!indents.isEmpty()) {
this.emit(createDedent());
indents.pop();
}
// Put the EOF back on the token stream.
this.emit(commonToken(pythonParser.EOF, "<EOF>"));
//throw new Exception("indentazione inaspettata in riga "+this.getLine());
}
Token next = super.nextToken();
if (next.getChannel() == Token.DEFAULT_CHANNEL) {
// Keep track of the last token on the default channel.
this.lastToken = next;
}
return tokens.isEmpty() ? next : tokens.poll();
}
private Token createDedent() {
CommonToken dedent = commonToken(pythonParser.DEDENT, "");
dedent.setLine(this.lastToken.getLine());
return dedent;
}
private CommonToken commonToken(int type, String text) {
int stop = this.getCharIndex() - 1;
int start = text.isEmpty() ? stop : stop - text.length() + 1;
return new CommonToken(this._tokenFactorySourcePair, type, DEFAULT_TOKEN_CHANNEL, start, stop);
}
// Calculates the indentation of the provided spaces, taking the
// following rules into account:
//
// "Tabs are replaced (from left to right) by one to eight spaces
// such that the total number of characters up to and including
// the replacement is a multiple of eight [...]"
//
// -- https://docs.python.org/3.1/reference/lexical_analysis.html#indentation
static int getIndentationCount(String spaces) {
int count = 0;
for (char ch : spaces.toCharArray()) {
switch (ch) {
case '\t':
count += 8 - (count % 8);
break;
default:
// A normal space char.
count++;
}
}
return count;
}
boolean atStartOfInput() {
return super.getCharPositionInLine() == 0 && super.getLine() == 1;
}
}
parse
:( NEWLINE parse
| block ) EOF
;
block
: (statement NEWLINE?| functionDecl)*
;
statement
: assignment
| functionCall
| ifStatement
| forStatement
| whileStatement
| arithmetic_expression
;
assignment
: IDENTIFIER indexes? '=' expression
;
functionCall
: IDENTIFIER OPAREN exprList? CPAREN #identifierFunctionCall
| PRINT OPAREN? exprList? CPAREN? #printFunctionCall
;
arithmetic_expression
: expression
;
ifStatement
: ifStat elifStat* elseStat?
;
ifStat
: IF expression COLON NEWLINE INDENT block DEDENT
;
elifStat
: ELIF expression COLON NEWLINE INDENT block DEDENT
;
elseStat
: ELSE COLON NEWLINE INDENT block DEDENT
;
functionDecl
: DEF IDENTIFIER OPAREN idList? CPAREN COLON NEWLINE INDENT block DEDENT
;
forStatement
: FOR IDENTIFIER IN expression COLON NEWLINE INDENT block DEDENT elseStat?
;
whileStatement
: WHILE expression COLON NEWLINE INDENT block DEDENT elseStat?
;
idList
: IDENTIFIER (',' IDENTIFIER)*
;
exprList
: expression (COMMA expression)*
;
expression
: '-' expression #unaryMinusExpression
| '!' expression #notExpression
| expression '**' expression #powerExpression
| expression '*' expression #multiplyExpression
| expression '/' expression #divideExpression
| expression '%' expression #modulusExpression
| expression '+' expression #addExpression
| expression '-' expression #subtractExpression
| expression '>=' expression #gtEqExpression
| expression '<=' expression #ltEqExpression
| expression '>' expression #gtExpression
| expression '<' expression #ltExpression
| expression '==' expression #eqExpression
| expression '!=' expression #notEqExpression
| expression '&&' expression #andExpression
| expression '||' expression #orExpression
| expression '?' expression ':' expression #ternaryExpression
| expression IN expression #inExpression
| NUMBER #numberExpression
| BOOL #boolExpression
| NULL #nullExpression
| functionCall indexes? #functionCallExpression
| list indexes? #listExpression
| IDENTIFIER indexes? #identifierExpression
| STRING indexes? #stringExpression
| '(' expression ')' indexes? #expressionExpression
| INPUT '(' STRING? ')' #inputExpression
;
list
: '[' exprList? ']'
;
indexes
: ('[' expression ']')+
;
PRINT : 'print';
INPUT : 'input';
DEF : 'def';
IF : 'if';
ELSE : 'else';
ELIF : 'elif';
RETURN : 'return';
FOR : 'for';
WHILE : 'while';
IN : 'in';
NULL : 'null';
OR : '||';
AND : '&&';
EQUALS : '==';
NEQUALS : '!=';
GTEQUALS : '>=';
LTEQUALS : '<=';
POW : '**';
EXCL : '!';
GT : '>';
LT : '<';
ADD : '+';
SUBTRACT : '-';
MULTIPLY : '*';
DIVIDE : '/';
MODULE : '%';
OBRACE : '{' {opened++;};
CBRACE : '}' {opened--;};
OBRACKET : '[' {opened++;};
CBRACKET : ']' {opened--;};
OPAREN : '(' {opened++;};
CPAREN : ')' {opened--;};
SCOLON : ';';
ASSIGN : '=';
COMMA : ',';
QMARK : '?';
COLON : ':';
BOOL
: 'true'
| 'false'
;
NUMBER
: INT ('.' DIGIT*)?
;
IDENTIFIER
: [a-zA-Z_] [a-zA-Z_0-9]*
;
STRING
: ["] (~["\r\n] | '\\\\' | '\\"')* ["]
| ['] (~['\r\n] | '\\\\' | '\\\'')* [']
;
SKIPS
: ( SPACES | COMMENT | LINE_JOINING ){firstLine();} -> skip
;
NEWLINE
: ( {atStartOfInput()}? SPACES
| ( '\r'? '\n' | '\r' | '\f' ) SPACES?
)
{
String newLine = getText().replaceAll("[^\r\n\f]+", "");
String spaces = getText().replaceAll("[\r\n\f]+", "");
int next = _input.LA(1);
if (opened > 0 || next == '\r' || next == '\n' || next == '\f' || next == '#') {
// If we're inside a list or on a blank line, ignore all indents,
// dedents and line breaks.
skip();
}
else {
emit(commonToken(NEWLINE, newLine));
int indent = getIndentationCount(spaces);
int previous = indents.isEmpty() ? 0 : indents.peek();
if (indent == previous) {
// skip indents of the same size as the present indent-size
skip();
}
else if (indent > previous) {
indents.push(indent);
emit(commonToken(pythonParser.INDENT, spaces));
}
else {
// Possibly emit more than 1 DEDENT token.
while(!indents.isEmpty() && indents.peek() > indent) {
this.emit(createDedent());
indents.pop();
}
}
}
}
;
fragment INT
: [1-9] DIGIT*
| '0'
;
fragment DIGIT
: [0-9]
;
fragment SPACES
: [ \t]+
;
fragment COMMENT
: '#' ~[\r\n\f]*
;
fragment LINE_JOINING
: '\\' SPACES? ( '\r'? '\n' | '\r' | '\f' )
;
No, this should not be handled in the grammar. The lexer should simply emit the (faulty) INDENT token. The parser should, at runtime, produce an error. Something like this:
String source = "x=10\n" +
"while x>2 :\n" +
" print(\"hello\")\n" +
" x=x-3\n";
Python3Lexer lexer = new Python3Lexer(CharStreams.fromString(source));
Python3Parser parser = new Python3Parser(new CommonTokenStream(lexer));
// Remove default error-handling
parser.removeErrorListeners();
// Add custom error-handling
parser.addErrorListener(new BaseErrorListener() {
#Override
public void syntaxError(Recognizer<?, ?> recognizer, Object o, int i, int i1, String s, RecognitionException e) {
CommonToken token = (CommonToken) o;
if (token.getType() == Python3Parser.INDENT) {
// The parser encountered an unexpected INDENT token
// TODO throw your exception
}
// TODO handle other errors
}
});
// Trigger the error
parser.file_input();
I'm using ANTLR 4 to try and parse task definitions. The task definitions look a little like the following:
task = { priority = 10; };
My grammar file then looks like the following:
grammar TaskGrammar;
/* Parser rules */
task : 'task' ASSIGNMENT_OP block EOF;
logical_entity : (TRUE | FALSE) # LogicalConst
| IDENTIFIER # LogicalVariable
;
numeric_entity : DECIMAL # NumericConst
| IDENTIFIER # NumericVariable
;
block : LBRACE (statement)* RBRACE SEMICOLON;
assignment : IDENTIFIER ASSIGNMENT_OP DECIMAL SEMICOLON
| IDENTIFIER ASSIGNMENT_OP block SEMICOLON
| IDENTIFIER ASSIGNMENT_OP QUOTED_STRING SEMICOLON
| IDENTIFIER ASSIGNMENT_OP CONSTANT SEMICOLON;
functionCall : IDENTIFIER LPAREN (parameter)*? RPAREN SEMICOLON;
parameter : DECIMAL
| QUOTED_STRING;
statement : assignment
| functionCall;
/* Lexxer rules */
IF : 'if' ;
THEN : 'then';
AND : 'and' ;
OR : 'or' ;
TRUE : 'true' ;
FALSE : 'false' ;
MULT : '*' ;
DIV : '/' ;
PLUS : '+' ;
MINUS : '-' ;
GT : '>' ;
GE : '>=' ;
LT : '<' ;
LE : '<=' ;
EQ : '==' ;
ASSIGNMENT_OP : '=' ;
LPAREN : '(' ;
RPAREN : ')' ;
LBRACE : '{' ;
RBRACE : '}' ;
SEMICOLON : ';' ;
// DECIMAL, IDENTIFIER, COMMENTS, WS are set using regular expressions
DECIMAL : '-'?[0-9]+('.'[0-9]+)? ;
IDENTIFIER : [a-zA-Z_][a-zA-Z_0-9]* ;
Value: STR_EXT | QUOTED_STRING | SINGLE_QUOTED
;
STR_EXT
:
[a-zA-Z0-9_/\.,\-:=~+!?$&^*\[\]#|]+;
Comment
:
'#' ~[\r\n]*;
CONSTANT : StringCharacters;
QUOTED_STRING
:
'"' StringCharacters? '"'
;
fragment
StringCharacters
: (~["\\] | EscapeSequence)+
;
fragment
EscapeSequence
: '\\' [btnfr"'\\]?
;
SINGLE_QUOTED
:
'\'' ~['\\]* '\'';
// COMMENT and WS are stripped from the output token stream by sending
// to a different channel 'skip'
COMMENT : '//' .+? ('\n'|EOF) -> skip ;
WS : [ \r\t\u000C\n]+ -> skip ;
This grammar compiles fine in ANTLR, but when it comes to trying to use the parser, I get the following error:
line 1:0 mismatched input 'task = { priority = 10; return = AND; };' expecting 'task'
org.antlr.v4.runtime.InputMismatchException
It looks like the parser isn't recognising the block part of the definition, but I can't quite see why. The block parse rule definition should match as far as I can tell. I would expect to have a TaskContext, with a child BlockContext containing a single AssignmentContext. I get the TaskContext, but it has the above exception.
Am I missing something here? This is my first attempt at using Antler, so may be getting confused between Lexxer and Parser rules...
Your STR_EXT consumes the entire input. That rule has to go: ANTLR's lexer will always try to match as much characters as possible.
I also see that CONSTANT might consume that entire input. It has to go to, or at least be changed to consume less chars.
I need a small trick to get my parser completely working.
I use antlr to parse boolean queries.
a query is composed of elements, linked together by ands, ors and nots.
So I can have something like :
"(P or not Q or R) or (( not A and B) or C)"
Thing is, an element can be long, and is generally in the form :
a an_operator b
for example :
"New-York matches NY"
Trick, one of the an_operator is "not like"
So I would like to modify my lexer so that the not checks that there is no like after it, to avoid parsing elements containing "not like" operators.
My current grammar is here :
// save it in a file called Logic.g
grammar Logic;
options {
output=AST;
}
// parser/production rules start with a lower case letter
parse
: expression EOF! // omit the EOF token
;
expression
: orexp
;
orexp
: andexp ('or'^ andexp)* // make `or` the root
;
andexp
: notexp ('and'^ notexp)* // make `and` the root
;
notexp
: 'not'^ atom // make `not` the root
| atom
;
atom
: ID
| '('! expression ')'! // omit both `(` andexp `)`
;
// lexer/terminal rules start with an upper case letter
ID : ('a'..'z' | 'A'..'Z')+;
Space : (' ' | '\t' | '\r' | '\n')+ {$channel=HIDDEN;};
Any help would be appreciated.
Thanks !
Here's a possible solution:
grammar Logic;
options {
output=AST;
}
tokens {
NOT_LIKE;
}
parse
: expression EOF!
;
expression
: orexp
;
orexp
: andexp (Or^ andexp)*
;
andexp
: fuzzyexp (And^ fuzzyexp)*
;
fuzzyexp
: (notexp -> notexp) ( Matches e=notexp -> ^(Matches $fuzzyexp $e)
| Not Like e=notexp -> ^(NOT_LIKE $fuzzyexp $e)
| Like e=notexp -> ^(Like $fuzzyexp $e)
)?
;
notexp
: Not^ atom
| atom
;
atom
: ID
| '('! expression ')'!
;
And : 'and';
Or : 'or';
Not : 'not';
Like : 'like';
Matches : 'matches';
ID : ('a'..'z' | 'A'..'Z')+;
Space : (' ' | '\t' | '\r' | '\n')+ {$channel=HIDDEN;};
which will parse the input "A not like B or C like D and (E or not F) and G matches H" into the following AST:
I'm trying to create a very simple grammar to learn to use ANTLR but I get the following message:
"The following alternatives can never be reached: 2"
This is my grammar attempt:
grammar Robot;
file : command+;
command : ( delay|type|move|click|rclick) ;
delay : 'wait' number ';';
type : 'type' id ';';
move : 'move' number ',' number ';';
click : 'click' ;
rclick : 'rlick' ;
id : ('a'..'z'|'A'..'Z')+ ;
number : ('0'..'9')+ ;
WS : (' ' | '\t' | '\r' | '\n' ) { skip();} ;
I'm using ANTLRWorks plugin for IDEA:
The .. (range) inside parser rules means something different than inside lexer rules. Inside lexer rules, it means: "from char X to char Y", and inside parser rule it matches "from token M to token N". And since you made number a parser rule, it does not do what you think it does (and are therefor receiving an obscure error message).
The solution: make number a lexer rule instead (so, capitalize it: Number):
grammar Robot;
file : command+;
command : (delay | type | move | Click | RClick) ;
delay : 'wait' Number ';';
type : 'type' Id ';';
move : 'move' Number ',' Number ';';
Click : 'click' ;
RClick : 'rlick' ;
Id : ('a'..'z'|'A'..'Z')+ ;
Number : ('0'..'9')+ ;
WS : (' ' | '\t' | '\r' | '\n') { skip();} ;
And as you can see, I also made id, click and rclick lexer rules instead. If you're not sure what the difference is between parser- and lexer rules, please say so and I'll add an explanation to this answer.