I'm trying to use antlr4 and I have the following grammar :
grammar Comp;
start : 'ca\n';
ID : [a-zA-Z][a-zA-Z0-9]* ; // match identifiers
INT : [0-9]+ ; // match integers
NEWLINE : '\r'? '\n' ; // return newlines to parser (is end-statement signal)
WS : [ \t]+ -> skip ; // toss out whitespace
OTHER : (~'\n')* '\n' ;
If I send the lexem 'ca\n' it works.
But with the rule :
start : 'ca' '\n';
or
start : 'ca' NEWLINE;
the lexem is not recongnized. Why ?
Thanks for your help. ;)
I'm using ANTLR 4 to try and parse task definitions. The task definitions look a little like the following:
task = { priority = 10; };
My grammar file then looks like the following:
grammar TaskGrammar;
/* Parser rules */
task : 'task' ASSIGNMENT_OP block EOF;
logical_entity : (TRUE | FALSE) # LogicalConst
| IDENTIFIER # LogicalVariable
;
numeric_entity : DECIMAL # NumericConst
| IDENTIFIER # NumericVariable
;
block : LBRACE (statement)* RBRACE SEMICOLON;
assignment : IDENTIFIER ASSIGNMENT_OP DECIMAL SEMICOLON
| IDENTIFIER ASSIGNMENT_OP block SEMICOLON
| IDENTIFIER ASSIGNMENT_OP QUOTED_STRING SEMICOLON
| IDENTIFIER ASSIGNMENT_OP CONSTANT SEMICOLON;
functionCall : IDENTIFIER LPAREN (parameter)*? RPAREN SEMICOLON;
parameter : DECIMAL
| QUOTED_STRING;
statement : assignment
| functionCall;
/* Lexxer rules */
IF : 'if' ;
THEN : 'then';
AND : 'and' ;
OR : 'or' ;
TRUE : 'true' ;
FALSE : 'false' ;
MULT : '*' ;
DIV : '/' ;
PLUS : '+' ;
MINUS : '-' ;
GT : '>' ;
GE : '>=' ;
LT : '<' ;
LE : '<=' ;
EQ : '==' ;
ASSIGNMENT_OP : '=' ;
LPAREN : '(' ;
RPAREN : ')' ;
LBRACE : '{' ;
RBRACE : '}' ;
SEMICOLON : ';' ;
// DECIMAL, IDENTIFIER, COMMENTS, WS are set using regular expressions
DECIMAL : '-'?[0-9]+('.'[0-9]+)? ;
IDENTIFIER : [a-zA-Z_][a-zA-Z_0-9]* ;
Value: STR_EXT | QUOTED_STRING | SINGLE_QUOTED
;
STR_EXT
:
[a-zA-Z0-9_/\.,\-:=~+!?$&^*\[\]#|]+;
Comment
:
'#' ~[\r\n]*;
CONSTANT : StringCharacters;
QUOTED_STRING
:
'"' StringCharacters? '"'
;
fragment
StringCharacters
: (~["\\] | EscapeSequence)+
;
fragment
EscapeSequence
: '\\' [btnfr"'\\]?
;
SINGLE_QUOTED
:
'\'' ~['\\]* '\'';
// COMMENT and WS are stripped from the output token stream by sending
// to a different channel 'skip'
COMMENT : '//' .+? ('\n'|EOF) -> skip ;
WS : [ \r\t\u000C\n]+ -> skip ;
This grammar compiles fine in ANTLR, but when it comes to trying to use the parser, I get the following error:
line 1:0 mismatched input 'task = { priority = 10; return = AND; };' expecting 'task'
org.antlr.v4.runtime.InputMismatchException
It looks like the parser isn't recognising the block part of the definition, but I can't quite see why. The block parse rule definition should match as far as I can tell. I would expect to have a TaskContext, with a child BlockContext containing a single AssignmentContext. I get the TaskContext, but it has the above exception.
Am I missing something here? This is my first attempt at using Antler, so may be getting confused between Lexxer and Parser rules...
Your STR_EXT consumes the entire input. That rule has to go: ANTLR's lexer will always try to match as much characters as possible.
I also see that CONSTANT might consume that entire input. It has to go to, or at least be changed to consume less chars.
How do I skip sql single line comments in antlr4 grammar?
This is the input which I have given:
--
-- $INPUT.sql$
--
CREATE TABLE table_one ( customer_number integer, address character varying(30));
create table table_two ( id integer, city character varying(50));
Like this:
SINGLE_LINE_COMMENT
: '--' ~[\r\n]* -> skip
;
If I parse your example input with the following grammar:
grammar Hello;
parse
: .*? EOF
;
INTEGER
: [0-9]+
;
IDENTIFIER
: [a-zA-Z_]+
;
SINGLE_LINE_COMMENT
: '--' ~[\r\n]* -> skip
;
SPACES
: [ \t\r\n]+ -> skip
;
OTHER
: .
;
and let ANTLRWorks2 print the tokens, I see the following:
[#0,23:28='CREATE',<2>,6:0]
[#1,30:34='TABLE',<2>,6:7]
[#2,36:44='table_one',<2>,6:13]
[#3,46:46='(',<5>,6:23]
[#4,48:62='customer_number',<2>,6:25]
[#5,64:70='integer',<2>,6:41]
[#6,71:71=',',<5>,6:48]
[#7,73:79='address',<2>,6:50]
[#8,81:89='character',<2>,6:58]
[#9,91:97='varying',<2>,6:68]
[#10,98:98='(',<5>,6:75]
[#11,99:100='30',<1>,6:76]
[#12,101:101=')',<5>,6:78]
[#13,102:102=')',<5>,6:79]
[#14,103:103=';',<5>,6:80]
[#15,106:111='create',<2>,8:0]
[#16,113:117='table',<2>,8:7]
[#17,119:127='table_two',<2>,8:13]
[#18,129:129='(',<5>,8:23]
[#19,131:132='id',<2>,8:25]
[#20,134:140='integer',<2>,8:28]
[#21,141:141=',',<5>,8:35]
[#22,143:146='city',<2>,8:37]
[#23,148:156='character',<2>,8:42]
[#24,158:164='varying',<2>,8:52]
[#25,165:165='(',<5>,8:59]
[#26,166:167='50',<1>,8:60]
[#27,168:168=')',<5>,8:62]
[#28,169:169=')',<5>,8:63]
[#29,170:170=';',<5>,8:64]
[#30,172:171='<EOF>',<-1>,9:0]
I.e.: the line comment are discarded properly. If it does not with happen in your case, something else is going wrong.
try EOF on the end of the rule as one of the options not just \r\n, like:
LINE_COMMENT
: '//' ~[\r\n]* (EOF|'\r'? '\n') -> channel(HIDDEN)
;
One of our web applications regulary dies because its out of memory. The sparse data we gathered from memory dumps suggests there is an issue in our antlr parsing implementation. What we see is a antlr tokenstream containing more than a million items. The input text which causes this has yet to be found.
Is it possible this is somehow related to an zero width item beeing matched?
Could there be another issue in the grammer resulting in excessive memory usage?
Here is the current grammar we use:
grammar AdvancedQueries;
options {
language = Java;
output = AST;
ASTLabelType=CommonTree;
}
tokens {
FOR;
END;
FIELDSEARCH;
TARGETFIELD;
RELATION;
NOTNODE;
ANDNODE;
NEARDISTANCE;
OUTOFPLACE;
}
#header {
package de.bsmo.fast.parsing;
}
#lexer::header {
package de.bsmo.fast.parsing;
}
startExpression : orEx;
expressionLevel4
: LPARENTHESIS! orEx RPARENTHESIS! | atomicExpression | outofplace;
expressionLevel3
: (fieldExpression) | expressionLevel4 ;
expressionLevel2
: (nearExpression) | expressionLevel3 ;
expressionLevel1
: (countExpression) | expressionLevel2 ;
notEx : NOT^? a=expressionLevel1 ;
andEx : (notEx -> notEx)
(AND? a=notEx -> ^(ANDNODE $andEx $a))*;
orEx : andEx (OR^ andEx)*;
countExpression : COUNT LPARENTHESIS countSub RPARENTHESIS RELATION NUMBERS -> ^(COUNT countSub RELATION NUMBERS);
countSub
: orEx;
nearExpression : NEAR LPARENTHESIS (WORD|PHRASE) MULTIPLESEPERATOR (WORD|PHRASE) MULTIPLESEPERATOR NUMBERS RPARENTHESIS -> ^(NEAR WORD* PHRASE* ^(NEARDISTANCE NUMBERS));
fieldExpression : WORD PROPERTYSEPERATOR fieldSub -> ^(FIELDSEARCH ^(TARGETFIELD WORD) fieldSub );
fieldSub
: WORD | PHRASE | LPARENTHESIS! orEx RPARENTHESIS!;
atomicExpression
: WORD
| PHRASE
| NUMBERS
;
//Out of place are elements captured that may be in the parseable input but need to be ommited from output later
//Those unwanted elements are captured here.
//MULTIPLESEPERATOR capture unwanted ","
outofplace
: MULTIPLESEPERATOR -> ^(OUTOFPLACE ^(MULTIPLESEPERATOR));
fragment NUMBER : ('0'..'9');
fragment CHARACTER : ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'?');
fragment QUOTE : ('"');
fragment LESSTHEN : '<';
fragment MORETHEN: '>';
fragment EQUAL: '=';
fragment SPACE : ('\u0009'|'\u0020'|'\u000C'|'\u00A0');
fragment WORDMATTER: ('!'|'0'..'9'|'\u0023'..'\u0027'|'*'|'+'|'\u002D'..'\u0039'|'\u003F'..'\u007E'|'\u00A1'..'\uFFFE');
LPARENTHESIS : '(';
RPARENTHESIS : ')';
AND : ('A'|'a')('N'|'n')('D'|'d');
OR : ('O'|'o')('R'|'r');
ANDNOT : ('A'|'a')('N'|'n')('D'|'d')('N'|'n')('O'|'o')('T'|'t');
NOT : ('N'|'n')('O'|'o')('T'|'t');
COUNT:('C'|'c')('O'|'o')('U'|'u')('N'|'n')('T'|'t');
NEAR:('N'|'n')('E'|'e')('A'|'a')('R'|'r');
PROPERTYSEPERATOR : ':';
MULTIPLESEPERATOR : ',';
WS : (SPACE) { $channel=HIDDEN; };
NUMBERS : (NUMBER)+;
RELATION : (LESSTHEN | MORETHEN)? EQUAL // '<=', '>=', or '='
| (LESSTHEN | MORETHEN); // '<' or '>'
PHRASE : (QUOTE)(.)*(QUOTE);
WORD : WORDMATTER* ;
The most common cause of this is a token that can have length 0. There can be an infinite number of such a token between any two other tokens in the file. Defining a token like this now results in a compiler warning in ANTLR 4.
The following rule can match the empty string:
WORD : WORDMATTER*;
Perhaps you meant to use the following instead?
WORD : WORDMATTER+;
I'm trying to create a very simple grammar to learn to use ANTLR but I get the following message:
"The following alternatives can never be reached: 2"
This is my grammar attempt:
grammar Robot;
file : command+;
command : ( delay|type|move|click|rclick) ;
delay : 'wait' number ';';
type : 'type' id ';';
move : 'move' number ',' number ';';
click : 'click' ;
rclick : 'rlick' ;
id : ('a'..'z'|'A'..'Z')+ ;
number : ('0'..'9')+ ;
WS : (' ' | '\t' | '\r' | '\n' ) { skip();} ;
I'm using ANTLRWorks plugin for IDEA:
The .. (range) inside parser rules means something different than inside lexer rules. Inside lexer rules, it means: "from char X to char Y", and inside parser rule it matches "from token M to token N". And since you made number a parser rule, it does not do what you think it does (and are therefor receiving an obscure error message).
The solution: make number a lexer rule instead (so, capitalize it: Number):
grammar Robot;
file : command+;
command : (delay | type | move | Click | RClick) ;
delay : 'wait' Number ';';
type : 'type' Id ';';
move : 'move' Number ',' Number ';';
Click : 'click' ;
RClick : 'rlick' ;
Id : ('a'..'z'|'A'..'Z')+ ;
Number : ('0'..'9')+ ;
WS : (' ' | '\t' | '\r' | '\n') { skip();} ;
And as you can see, I also made id, click and rclick lexer rules instead. If you're not sure what the difference is between parser- and lexer rules, please say so and I'll add an explanation to this answer.