I am writing a lexer and parser to my own language to process operations on lists. I started with that:
list_Declaration : L_LIST L_ID ASSIGN LBRACE NUMBER (COMA NUMBER)* RBRACE SEMI;
NUMBER : [0-9]+;
L_BOOLEAN_LITERAL
: 'true'
| 'false'
;
L_ID : [a-z]+;
L_IF : 'if';
L_ELSE : 'else';
L_THEN : 'then';
L_FOREACH : 'foreach';
L_VAR : 'var';
L_IN : 'in';
L_LIST : 'list';
L_NUMBER : 'number';
L_RETURN : 'return';
ASSIGN : '=';
LPAREN : '(';
RPAREN : ')';
LBRACE : '{';
RBRACE : '}';
COMA : ',';
SEMI : ';';
WS: [ \t\n\r]+ ->skip;
And when i try to parse this with example text:
list a = {2,3};
It says:
line 1:0 token recognition error at: ''
line 1:1 missing 'list' at 'list'
line 1:6 extraneous input 'a' expecting '='
What am I doing wrong?
I usually use this way to define lexer, and it always work:
fragment I : [iI];
fragment L : [lL];
fragment S : [sS];
fragment T : [tT];
L_List : L I S T;
Your 'list' has been matched as L_ID, so extraneous input 'a' expecting '='
Put your L_ID below almost all of the lexers.
Related
I have a problem with my antlr grammar or(lexer). In my case I need to parse a string with custom text and find functions in it. Format of function $foo($bar(3),'strArg'). I found solution in this post ANTLR Nested Functions and little bit improved it for my needs. But while testing different cases I found one that brakes parser: $foo($3,'strArg'). This will throw IncorectSyntax exception. I tried many variants(for example not to skip $ and include it in parsing tree) but it all these attempts were unsuccessfully
Lexer
lexer grammar TLexer;
TEXT
: ~[$]
;
FUNCTION_START
: '$' -> pushMode(IN_FUNCTION), skip
;
mode IN_FUNCTION;
FUNTION_NESTED : '$' -> pushMode(IN_FUNCTION), skip;
ID : [a-zA-Z_]+;
PAR_OPEN : '(';
PAR_CLOSE : ')' -> popMode;
NUMBER : [0-9]+;
STRING : '\'' ( ~'\'' | '\'\'' )* '\'';
COMMA : ',';
SPACE : [ \t\r\n]-> skip;
Parser
options {
tokenVocab=TLexer;
}
parse
: atom* EOF
;
atom
: text
| function
;
text
: TEXT+
;
function
: ID params
;
params
: PAR_OPEN ( param ( COMMA param )* )? PAR_CLOSE
;
param
: NUMBER
| STRING
| function
;
The parser does not fail on $foo($3,'strArg'), because when it encounters the second $ it is already in IN_FUNCTION mode and it is expecting a parameter. It skips the character and reads a NUMBER.
If you want it to fail you need to unskip the dollar signs in the Lexer:
FUNCTION_START : '$' -> pushMode(IN_FUNCTION);
mode IN_FUNCTION;
FUNTION_START : '$' -> pushMode(IN_FUNCTION);
and modify the function rule:
function : FUNCTION_START ID params;
I'm trying to use antlr4 and I have the following grammar :
grammar Comp;
start : 'ca\n';
ID : [a-zA-Z][a-zA-Z0-9]* ; // match identifiers
INT : [0-9]+ ; // match integers
NEWLINE : '\r'? '\n' ; // return newlines to parser (is end-statement signal)
WS : [ \t]+ -> skip ; // toss out whitespace
OTHER : (~'\n')* '\n' ;
If I send the lexem 'ca\n' it works.
But with the rule :
start : 'ca' '\n';
or
start : 'ca' NEWLINE;
the lexem is not recongnized. Why ?
Thanks for your help. ;)
I'm using ANTLR 4 to try and parse task definitions. The task definitions look a little like the following:
task = { priority = 10; };
My grammar file then looks like the following:
grammar TaskGrammar;
/* Parser rules */
task : 'task' ASSIGNMENT_OP block EOF;
logical_entity : (TRUE | FALSE) # LogicalConst
| IDENTIFIER # LogicalVariable
;
numeric_entity : DECIMAL # NumericConst
| IDENTIFIER # NumericVariable
;
block : LBRACE (statement)* RBRACE SEMICOLON;
assignment : IDENTIFIER ASSIGNMENT_OP DECIMAL SEMICOLON
| IDENTIFIER ASSIGNMENT_OP block SEMICOLON
| IDENTIFIER ASSIGNMENT_OP QUOTED_STRING SEMICOLON
| IDENTIFIER ASSIGNMENT_OP CONSTANT SEMICOLON;
functionCall : IDENTIFIER LPAREN (parameter)*? RPAREN SEMICOLON;
parameter : DECIMAL
| QUOTED_STRING;
statement : assignment
| functionCall;
/* Lexxer rules */
IF : 'if' ;
THEN : 'then';
AND : 'and' ;
OR : 'or' ;
TRUE : 'true' ;
FALSE : 'false' ;
MULT : '*' ;
DIV : '/' ;
PLUS : '+' ;
MINUS : '-' ;
GT : '>' ;
GE : '>=' ;
LT : '<' ;
LE : '<=' ;
EQ : '==' ;
ASSIGNMENT_OP : '=' ;
LPAREN : '(' ;
RPAREN : ')' ;
LBRACE : '{' ;
RBRACE : '}' ;
SEMICOLON : ';' ;
// DECIMAL, IDENTIFIER, COMMENTS, WS are set using regular expressions
DECIMAL : '-'?[0-9]+('.'[0-9]+)? ;
IDENTIFIER : [a-zA-Z_][a-zA-Z_0-9]* ;
Value: STR_EXT | QUOTED_STRING | SINGLE_QUOTED
;
STR_EXT
:
[a-zA-Z0-9_/\.,\-:=~+!?$&^*\[\]#|]+;
Comment
:
'#' ~[\r\n]*;
CONSTANT : StringCharacters;
QUOTED_STRING
:
'"' StringCharacters? '"'
;
fragment
StringCharacters
: (~["\\] | EscapeSequence)+
;
fragment
EscapeSequence
: '\\' [btnfr"'\\]?
;
SINGLE_QUOTED
:
'\'' ~['\\]* '\'';
// COMMENT and WS are stripped from the output token stream by sending
// to a different channel 'skip'
COMMENT : '//' .+? ('\n'|EOF) -> skip ;
WS : [ \r\t\u000C\n]+ -> skip ;
This grammar compiles fine in ANTLR, but when it comes to trying to use the parser, I get the following error:
line 1:0 mismatched input 'task = { priority = 10; return = AND; };' expecting 'task'
org.antlr.v4.runtime.InputMismatchException
It looks like the parser isn't recognising the block part of the definition, but I can't quite see why. The block parse rule definition should match as far as I can tell. I would expect to have a TaskContext, with a child BlockContext containing a single AssignmentContext. I get the TaskContext, but it has the above exception.
Am I missing something here? This is my first attempt at using Antler, so may be getting confused between Lexxer and Parser rules...
Your STR_EXT consumes the entire input. That rule has to go: ANTLR's lexer will always try to match as much characters as possible.
I also see that CONSTANT might consume that entire input. It has to go to, or at least be changed to consume less chars.
One of our web applications regulary dies because its out of memory. The sparse data we gathered from memory dumps suggests there is an issue in our antlr parsing implementation. What we see is a antlr tokenstream containing more than a million items. The input text which causes this has yet to be found.
Is it possible this is somehow related to an zero width item beeing matched?
Could there be another issue in the grammer resulting in excessive memory usage?
Here is the current grammar we use:
grammar AdvancedQueries;
options {
language = Java;
output = AST;
ASTLabelType=CommonTree;
}
tokens {
FOR;
END;
FIELDSEARCH;
TARGETFIELD;
RELATION;
NOTNODE;
ANDNODE;
NEARDISTANCE;
OUTOFPLACE;
}
#header {
package de.bsmo.fast.parsing;
}
#lexer::header {
package de.bsmo.fast.parsing;
}
startExpression : orEx;
expressionLevel4
: LPARENTHESIS! orEx RPARENTHESIS! | atomicExpression | outofplace;
expressionLevel3
: (fieldExpression) | expressionLevel4 ;
expressionLevel2
: (nearExpression) | expressionLevel3 ;
expressionLevel1
: (countExpression) | expressionLevel2 ;
notEx : NOT^? a=expressionLevel1 ;
andEx : (notEx -> notEx)
(AND? a=notEx -> ^(ANDNODE $andEx $a))*;
orEx : andEx (OR^ andEx)*;
countExpression : COUNT LPARENTHESIS countSub RPARENTHESIS RELATION NUMBERS -> ^(COUNT countSub RELATION NUMBERS);
countSub
: orEx;
nearExpression : NEAR LPARENTHESIS (WORD|PHRASE) MULTIPLESEPERATOR (WORD|PHRASE) MULTIPLESEPERATOR NUMBERS RPARENTHESIS -> ^(NEAR WORD* PHRASE* ^(NEARDISTANCE NUMBERS));
fieldExpression : WORD PROPERTYSEPERATOR fieldSub -> ^(FIELDSEARCH ^(TARGETFIELD WORD) fieldSub );
fieldSub
: WORD | PHRASE | LPARENTHESIS! orEx RPARENTHESIS!;
atomicExpression
: WORD
| PHRASE
| NUMBERS
;
//Out of place are elements captured that may be in the parseable input but need to be ommited from output later
//Those unwanted elements are captured here.
//MULTIPLESEPERATOR capture unwanted ","
outofplace
: MULTIPLESEPERATOR -> ^(OUTOFPLACE ^(MULTIPLESEPERATOR));
fragment NUMBER : ('0'..'9');
fragment CHARACTER : ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'?');
fragment QUOTE : ('"');
fragment LESSTHEN : '<';
fragment MORETHEN: '>';
fragment EQUAL: '=';
fragment SPACE : ('\u0009'|'\u0020'|'\u000C'|'\u00A0');
fragment WORDMATTER: ('!'|'0'..'9'|'\u0023'..'\u0027'|'*'|'+'|'\u002D'..'\u0039'|'\u003F'..'\u007E'|'\u00A1'..'\uFFFE');
LPARENTHESIS : '(';
RPARENTHESIS : ')';
AND : ('A'|'a')('N'|'n')('D'|'d');
OR : ('O'|'o')('R'|'r');
ANDNOT : ('A'|'a')('N'|'n')('D'|'d')('N'|'n')('O'|'o')('T'|'t');
NOT : ('N'|'n')('O'|'o')('T'|'t');
COUNT:('C'|'c')('O'|'o')('U'|'u')('N'|'n')('T'|'t');
NEAR:('N'|'n')('E'|'e')('A'|'a')('R'|'r');
PROPERTYSEPERATOR : ':';
MULTIPLESEPERATOR : ',';
WS : (SPACE) { $channel=HIDDEN; };
NUMBERS : (NUMBER)+;
RELATION : (LESSTHEN | MORETHEN)? EQUAL // '<=', '>=', or '='
| (LESSTHEN | MORETHEN); // '<' or '>'
PHRASE : (QUOTE)(.)*(QUOTE);
WORD : WORDMATTER* ;
The most common cause of this is a token that can have length 0. There can be an infinite number of such a token between any two other tokens in the file. Defining a token like this now results in a compiler warning in ANTLR 4.
The following rule can match the empty string:
WORD : WORDMATTER*;
Perhaps you meant to use the following instead?
WORD : WORDMATTER+;
I'm trying to create a very simple grammar to learn to use ANTLR but I get the following message:
"The following alternatives can never be reached: 2"
This is my grammar attempt:
grammar Robot;
file : command+;
command : ( delay|type|move|click|rclick) ;
delay : 'wait' number ';';
type : 'type' id ';';
move : 'move' number ',' number ';';
click : 'click' ;
rclick : 'rlick' ;
id : ('a'..'z'|'A'..'Z')+ ;
number : ('0'..'9')+ ;
WS : (' ' | '\t' | '\r' | '\n' ) { skip();} ;
I'm using ANTLRWorks plugin for IDEA:
The .. (range) inside parser rules means something different than inside lexer rules. Inside lexer rules, it means: "from char X to char Y", and inside parser rule it matches "from token M to token N". And since you made number a parser rule, it does not do what you think it does (and are therefor receiving an obscure error message).
The solution: make number a lexer rule instead (so, capitalize it: Number):
grammar Robot;
file : command+;
command : (delay | type | move | Click | RClick) ;
delay : 'wait' Number ';';
type : 'type' Id ';';
move : 'move' Number ',' Number ';';
Click : 'click' ;
RClick : 'rlick' ;
Id : ('a'..'z'|'A'..'Z')+ ;
Number : ('0'..'9')+ ;
WS : (' ' | '\t' | '\r' | '\n') { skip();} ;
And as you can see, I also made id, click and rclick lexer rules instead. If you're not sure what the difference is between parser- and lexer rules, please say so and I'll add an explanation to this answer.