How to skip single line sql comment -- in antlr4 - java

How do I skip sql single line comments in antlr4 grammar?
This is the input which I have given:
--
-- $INPUT.sql$
--
CREATE TABLE table_one ( customer_number integer, address character varying(30));
create table table_two ( id integer, city character varying(50));

Like this:
SINGLE_LINE_COMMENT
: '--' ~[\r\n]* -> skip
;
If I parse your example input with the following grammar:
grammar Hello;
parse
: .*? EOF
;
INTEGER
: [0-9]+
;
IDENTIFIER
: [a-zA-Z_]+
;
SINGLE_LINE_COMMENT
: '--' ~[\r\n]* -> skip
;
SPACES
: [ \t\r\n]+ -> skip
;
OTHER
: .
;
and let ANTLRWorks2 print the tokens, I see the following:
[#0,23:28='CREATE',<2>,6:0]
[#1,30:34='TABLE',<2>,6:7]
[#2,36:44='table_one',<2>,6:13]
[#3,46:46='(',<5>,6:23]
[#4,48:62='customer_number',<2>,6:25]
[#5,64:70='integer',<2>,6:41]
[#6,71:71=',',<5>,6:48]
[#7,73:79='address',<2>,6:50]
[#8,81:89='character',<2>,6:58]
[#9,91:97='varying',<2>,6:68]
[#10,98:98='(',<5>,6:75]
[#11,99:100='30',<1>,6:76]
[#12,101:101=')',<5>,6:78]
[#13,102:102=')',<5>,6:79]
[#14,103:103=';',<5>,6:80]
[#15,106:111='create',<2>,8:0]
[#16,113:117='table',<2>,8:7]
[#17,119:127='table_two',<2>,8:13]
[#18,129:129='(',<5>,8:23]
[#19,131:132='id',<2>,8:25]
[#20,134:140='integer',<2>,8:28]
[#21,141:141=',',<5>,8:35]
[#22,143:146='city',<2>,8:37]
[#23,148:156='character',<2>,8:42]
[#24,158:164='varying',<2>,8:52]
[#25,165:165='(',<5>,8:59]
[#26,166:167='50',<1>,8:60]
[#27,168:168=')',<5>,8:62]
[#28,169:169=')',<5>,8:63]
[#29,170:170=';',<5>,8:64]
[#30,172:171='<EOF>',<-1>,9:0]
I.e.: the line comment are discarded properly. If it does not with happen in your case, something else is going wrong.

try EOF on the end of the rule as one of the options not just \r\n, like:
LINE_COMMENT
: '//' ~[\r\n]* (EOF|'\r'? '\n') -> channel(HIDDEN)
;

Related

ANTLR4 - arguments in nested functions

I have a problem with my antlr grammar or(lexer). In my case I need to parse a string with custom text and find functions in it. Format of function $foo($bar(3),'strArg'). I found solution in this post ANTLR Nested Functions and little bit improved it for my needs. But while testing different cases I found one that brakes parser: $foo($3,'strArg'). This will throw IncorectSyntax exception. I tried many variants(for example not to skip $ and include it in parsing tree) but it all these attempts were unsuccessfully
Lexer
lexer grammar TLexer;
TEXT
: ~[$]
;
FUNCTION_START
: '$' -> pushMode(IN_FUNCTION), skip
;
mode IN_FUNCTION;
FUNTION_NESTED : '$' -> pushMode(IN_FUNCTION), skip;
ID : [a-zA-Z_]+;
PAR_OPEN : '(';
PAR_CLOSE : ')' -> popMode;
NUMBER : [0-9]+;
STRING : '\'' ( ~'\'' | '\'\'' )* '\'';
COMMA : ',';
SPACE : [ \t\r\n]-> skip;
Parser
options {
tokenVocab=TLexer;
}
parse
: atom* EOF
;
atom
: text
| function
;
text
: TEXT+
;
function
: ID params
;
params
: PAR_OPEN ( param ( COMMA param )* )? PAR_CLOSE
;
param
: NUMBER
| STRING
| function
;
The parser does not fail on $foo($3,'strArg'), because when it encounters the second $ it is already in IN_FUNCTION mode and it is expecting a parameter. It skips the character and reads a NUMBER.
If you want it to fail you need to unskip the dollar signs in the Lexer:
FUNCTION_START : '$' -> pushMode(IN_FUNCTION);
mode IN_FUNCTION;
FUNTION_START : '$' -> pushMode(IN_FUNCTION);
and modify the function rule:
function : FUNCTION_START ID params;

ANTLR4: Whitespace and Space lexical handling

In my (simplyfied) grammar
grammar test;
prog: stat+;
stat:
sourceDef ';'
;
sourceDef:
SRC COLON ID
;
STRING : '"' ('""'|~'"')* '"' ; // quote-quote is an escaped quote
LINE_COMMENT
: '//' (~('\n'|'\r'))* -> skip;
WS : [ \t\n\r]+ -> skip;
//SP : ' ' -> skip;
COMMENT : '/*' .*? '*/' -> skip;
LE: '<';
MINUS: '-';
GR: '>';
COLON: ':' ;
HASH: '#';
EQ: '=';
SEMI: ';';
COMMA: ',';
AND: [Aa][Nn][Dd];
SRC: [Ss][Rr][Cc];
NUMBER: [0-9];
ID: [a-zA-Z][a-zA-z0-9]+;
DAY: ('0'[1-9]|[12][0-9]|'3'[01]);
MONTH: ('0' [1-9]|'1'[012]);
YEAR: [0-2] [890] NUMBER NUMBER;
DATE: DAY [- /.] MONTH [- /.] YEAR;
the code
src : xxx;
shows a syntax error:
extraneous input ' ' expecting ':'
The code
src:xxx;
resolves fine.
The modified version with
WS : [\t\n\r]+ -> skip;
SP : ' ' -> skip;
works fine with both syntax versions (with and without spaces).
So the spaces seem to be skipped only, if they are defined in a
separate rule.
Is something wrong with this
WS : [ \t\n\r]+ -> skip;
definition?
Or what else could cause this (to me) unexpected behavior?
I assume that you have already found the solution, but for the sake of record.
Your whitespace lexer rule should be:
WS : (' '|'\r'|'\n'|'\t') -> channel(HIDDEN);
In your grammar space char just is not specified, that is all.

Antlr4 'no viable alternative at input' with my grammar

I'm trying to use antlr4 and I have the following grammar :
grammar Comp;
start : 'ca\n';
ID : [a-zA-Z][a-zA-Z0-9]* ; // match identifiers
INT : [0-9]+ ; // match integers
NEWLINE : '\r'? '\n' ; // return newlines to parser (is end-statement signal)
WS : [ \t]+ -> skip ; // toss out whitespace
OTHER : (~'\n')* '\n' ;
If I send the lexem 'ca\n' it works.
But with the rule :
start : 'ca' '\n';
or
start : 'ca' NEWLINE;
the lexem is not recongnized. Why ?
Thanks for your help. ;)

ANTLR4 Grammar only matching first part of parser rule

I'm using ANTLR 4 to try and parse task definitions. The task definitions look a little like the following:
task = { priority = 10; };
My grammar file then looks like the following:
grammar TaskGrammar;
/* Parser rules */
task : 'task' ASSIGNMENT_OP block EOF;
logical_entity : (TRUE | FALSE) # LogicalConst
| IDENTIFIER # LogicalVariable
;
numeric_entity : DECIMAL # NumericConst
| IDENTIFIER # NumericVariable
;
block : LBRACE (statement)* RBRACE SEMICOLON;
assignment : IDENTIFIER ASSIGNMENT_OP DECIMAL SEMICOLON
| IDENTIFIER ASSIGNMENT_OP block SEMICOLON
| IDENTIFIER ASSIGNMENT_OP QUOTED_STRING SEMICOLON
| IDENTIFIER ASSIGNMENT_OP CONSTANT SEMICOLON;
functionCall : IDENTIFIER LPAREN (parameter)*? RPAREN SEMICOLON;
parameter : DECIMAL
| QUOTED_STRING;
statement : assignment
| functionCall;
/* Lexxer rules */
IF : 'if' ;
THEN : 'then';
AND : 'and' ;
OR : 'or' ;
TRUE : 'true' ;
FALSE : 'false' ;
MULT : '*' ;
DIV : '/' ;
PLUS : '+' ;
MINUS : '-' ;
GT : '>' ;
GE : '>=' ;
LT : '<' ;
LE : '<=' ;
EQ : '==' ;
ASSIGNMENT_OP : '=' ;
LPAREN : '(' ;
RPAREN : ')' ;
LBRACE : '{' ;
RBRACE : '}' ;
SEMICOLON : ';' ;
// DECIMAL, IDENTIFIER, COMMENTS, WS are set using regular expressions
DECIMAL : '-'?[0-9]+('.'[0-9]+)? ;
IDENTIFIER : [a-zA-Z_][a-zA-Z_0-9]* ;
Value: STR_EXT | QUOTED_STRING | SINGLE_QUOTED
;
STR_EXT
:
[a-zA-Z0-9_/\.,\-:=~+!?$&^*\[\]#|]+;
Comment
:
'#' ~[\r\n]*;
CONSTANT : StringCharacters;
QUOTED_STRING
:
'"' StringCharacters? '"'
;
fragment
StringCharacters
: (~["\\] | EscapeSequence)+
;
fragment
EscapeSequence
: '\\' [btnfr"'\\]?
;
SINGLE_QUOTED
:
'\'' ~['\\]* '\'';
// COMMENT and WS are stripped from the output token stream by sending
// to a different channel 'skip'
COMMENT : '//' .+? ('\n'|EOF) -> skip ;
WS : [ \r\t\u000C\n]+ -> skip ;
This grammar compiles fine in ANTLR, but when it comes to trying to use the parser, I get the following error:
line 1:0 mismatched input 'task = { priority = 10; return = AND; };' expecting 'task'
org.antlr.v4.runtime.InputMismatchException
It looks like the parser isn't recognising the block part of the definition, but I can't quite see why. The block parse rule definition should match as far as I can tell. I would expect to have a TaskContext, with a child BlockContext containing a single AssignmentContext. I get the TaskContext, but it has the above exception.
Am I missing something here? This is my first attempt at using Antler, so may be getting confused between Lexxer and Parser rules...
Your STR_EXT consumes the entire input. That rule has to go: ANTLR's lexer will always try to match as much characters as possible.
I also see that CONSTANT might consume that entire input. It has to go to, or at least be changed to consume less chars.

The following alternatives can never be reached: 2

I'm trying to create a very simple grammar to learn to use ANTLR but I get the following message:
"The following alternatives can never be reached: 2"
This is my grammar attempt:
grammar Robot;
file : command+;
command : ( delay|type|move|click|rclick) ;
delay : 'wait' number ';';
type : 'type' id ';';
move : 'move' number ',' number ';';
click : 'click' ;
rclick : 'rlick' ;
id : ('a'..'z'|'A'..'Z')+ ;
number : ('0'..'9')+ ;
WS : (' ' | '\t' | '\r' | '\n' ) { skip();} ;
I'm using ANTLRWorks plugin for IDEA:
The .. (range) inside parser rules means something different than inside lexer rules. Inside lexer rules, it means: "from char X to char Y", and inside parser rule it matches "from token M to token N". And since you made number a parser rule, it does not do what you think it does (and are therefor receiving an obscure error message).
The solution: make number a lexer rule instead (so, capitalize it: Number):
grammar Robot;
file : command+;
command : (delay | type | move | Click | RClick) ;
delay : 'wait' Number ';';
type : 'type' Id ';';
move : 'move' Number ',' Number ';';
Click : 'click' ;
RClick : 'rlick' ;
Id : ('a'..'z'|'A'..'Z')+ ;
Number : ('0'..'9')+ ;
WS : (' ' | '\t' | '\r' | '\n') { skip();} ;
And as you can see, I also made id, click and rclick lexer rules instead. If you're not sure what the difference is between parser- and lexer rules, please say so and I'll add an explanation to this answer.

Categories

Resources