ANTLR4: Whitespace and Space lexical handling

ANTLR4: Whitespace and Space lexical handling - java

In my (simplyfied) grammar
grammar test;
prog: stat+;
stat:
sourceDef ';'
;
sourceDef:
SRC COLON ID
;
STRING : '"' ('""'|~'"')* '"' ; // quote-quote is an escaped quote
LINE_COMMENT
: '//' (~('\n'|'\r'))* -> skip;
WS : [ \t\n\r]+ -> skip;
//SP : ' ' -> skip;
COMMENT : '/*' .*? '*/' -> skip;
LE: '<';
MINUS: '-';
GR: '>';
COLON: ':' ;
HASH: '#';
EQ: '=';
SEMI: ';';
COMMA: ',';
AND: [Aa][Nn][Dd];
SRC: [Ss][Rr][Cc];
NUMBER: [0-9];
ID: [a-zA-Z][a-zA-z0-9]+;
DAY: ('0'[1-9]|[12][0-9]|'3'[01]);
MONTH: ('0' [1-9]|'1'[012]);
YEAR: [0-2] [890] NUMBER NUMBER;
DATE: DAY [- /.] MONTH [- /.] YEAR;
the code
src : xxx;
shows a syntax error:
extraneous input ' ' expecting ':'
The code
src:xxx;
resolves fine.
The modified version with
WS : [\t\n\r]+ -> skip;
SP : ' ' -> skip;
works fine with both syntax versions (with and without spaces).
So the spaces seem to be skipped only, if they are defined in a
separate rule.
Is something wrong with this
WS : [ \t\n\r]+ -> skip;
definition?
Or what else could cause this (to me) unexpected behavior?

I assume that you have already found the solution, but for the sake of record.
Your whitespace lexer rule should be:
WS : (' '|'\r'|'\n'|'\t') -> channel(HIDDEN);
In your grammar space char just is not specified, that is all.

Related

ANTLR4 - arguments in nested functions

I have a problem with my antlr grammar or(lexer). In my case I need to parse a string with custom text and find functions in it. Format of function $foo($bar(3),'strArg'). I found solution in this post ANTLR Nested Functions and little bit improved it for my needs. But while testing different cases I found one that brakes parser: $foo($3,'strArg'). This will throw IncorectSyntax exception. I tried many variants(for example not to skip $ and include it in parsing tree) but it all these attempts were unsuccessfully
Lexer
lexer grammar TLexer;
TEXT
: ~[$]
;
FUNCTION_START
: '$' -> pushMode(IN_FUNCTION), skip
;
mode IN_FUNCTION;
FUNTION_NESTED : '$' -> pushMode(IN_FUNCTION), skip;
ID : [a-zA-Z_]+;
PAR_OPEN : '(';
PAR_CLOSE : ')' -> popMode;
NUMBER : [0-9]+;
STRING : '\'' ( ~'\'' | '\'\'' )* '\'';
COMMA : ',';
SPACE : [ \t\r\n]-> skip;
Parser
options {
tokenVocab=TLexer;
}
parse
: atom* EOF
;
atom
: text
| function
;
text
: TEXT+
;
function
: ID params
;
params
: PAR_OPEN ( param ( COMMA param )* )? PAR_CLOSE
;
param
: NUMBER
| STRING
| function
;

The parser does not fail on $foo($3,'strArg'), because when it encounters the second $ it is already in IN_FUNCTION mode and it is expecting a parameter. It skips the character and reads a NUMBER.
If you want it to fail you need to unskip the dollar signs in the Lexer:
FUNCTION_START : '$' -> pushMode(IN_FUNCTION);
mode IN_FUNCTION;
FUNTION_START : '$' -> pushMode(IN_FUNCTION);
and modify the function rule:
function : FUNCTION_START ID params;

Antlr4 'no viable alternative at input' with my grammar

I'm trying to use antlr4 and I have the following grammar :
grammar Comp;
start : 'ca\n';
ID : [a-zA-Z][a-zA-Z0-9]* ; // match identifiers
INT : [0-9]+ ; // match integers
NEWLINE : '\r'? '\n' ; // return newlines to parser (is end-statement signal)
WS : [ \t]+ -> skip ; // toss out whitespace
OTHER : (~'\n')* '\n' ;
If I send the lexem 'ca\n' it works.
But with the rule :
start : 'ca' '\n';
or
start : 'ca' NEWLINE;
the lexem is not recongnized. Why ?
Thanks for your help. ;)

ANTLR4 Grammar only matching first part of parser rule

I'm using ANTLR 4 to try and parse task definitions. The task definitions look a little like the following:
task = { priority = 10; };
My grammar file then looks like the following:
grammar TaskGrammar;
/* Parser rules */
task : 'task' ASSIGNMENT_OP block EOF;
logical_entity : (TRUE | FALSE) # LogicalConst
| IDENTIFIER # LogicalVariable
;
numeric_entity : DECIMAL # NumericConst
| IDENTIFIER # NumericVariable
;
block : LBRACE (statement)* RBRACE SEMICOLON;
assignment : IDENTIFIER ASSIGNMENT_OP DECIMAL SEMICOLON
| IDENTIFIER ASSIGNMENT_OP block SEMICOLON
| IDENTIFIER ASSIGNMENT_OP QUOTED_STRING SEMICOLON
| IDENTIFIER ASSIGNMENT_OP CONSTANT SEMICOLON;
functionCall : IDENTIFIER LPAREN (parameter)*? RPAREN SEMICOLON;
parameter : DECIMAL
| QUOTED_STRING;
statement : assignment
| functionCall;
/* Lexxer rules */
IF : 'if' ;
THEN : 'then';
AND : 'and' ;
OR : 'or' ;
TRUE : 'true' ;
FALSE : 'false' ;
MULT : '*' ;
DIV : '/' ;
PLUS : '+' ;
MINUS : '-' ;
GT : '>' ;
GE : '>=' ;
LT : '<' ;
LE : '<=' ;
EQ : '==' ;
ASSIGNMENT_OP : '=' ;
LPAREN : '(' ;
RPAREN : ')' ;
LBRACE : '{' ;
RBRACE : '}' ;
SEMICOLON : ';' ;
// DECIMAL, IDENTIFIER, COMMENTS, WS are set using regular expressions
DECIMAL : '-'?[0-9]+('.'[0-9]+)? ;
IDENTIFIER : [a-zA-Z_][a-zA-Z_0-9]* ;
Value: STR_EXT | QUOTED_STRING | SINGLE_QUOTED
;
STR_EXT
:
[a-zA-Z0-9_/\.,\-:=~+!?$&^*\[\]#|]+;
Comment
:
'#' ~[\r\n]*;
CONSTANT : StringCharacters;
QUOTED_STRING
:
'"' StringCharacters? '"'
;
fragment
StringCharacters
: (~["\\] | EscapeSequence)+
;
fragment
EscapeSequence
: '\\' [btnfr"'\\]?
;
SINGLE_QUOTED
:
'\'' ~['\\]* '\'';
// COMMENT and WS are stripped from the output token stream by sending
// to a different channel 'skip'
COMMENT : '//' .+? ('\n'|EOF) -> skip ;
WS : [ \r\t\u000C\n]+ -> skip ;
This grammar compiles fine in ANTLR, but when it comes to trying to use the parser, I get the following error:
line 1:0 mismatched input 'task = { priority = 10; return = AND; };' expecting 'task'
org.antlr.v4.runtime.InputMismatchException
It looks like the parser isn't recognising the block part of the definition, but I can't quite see why. The block parse rule definition should match as far as I can tell. I would expect to have a TaskContext, with a child BlockContext containing a single AssignmentContext. I get the TaskContext, but it has the above exception.
Am I missing something here? This is my first attempt at using Antler, so may be getting confused between Lexxer and Parser rules...

Your STR_EXT consumes the entire input. That rule has to go: ANTLR's lexer will always try to match as much characters as possible.
I also see that CONSTANT might consume that entire input. It has to go to, or at least be changed to consume less chars.

How to skip single line sql comment -- in antlr4

How do I skip sql single line comments in antlr4 grammar?
This is the input which I have given:
--
-- $INPUT.sql$
--
CREATE TABLE table_one ( customer_number integer, address character varying(30));
create table table_two ( id integer, city character varying(50));

Like this:
SINGLE_LINE_COMMENT
: '--' ~[\r\n]* -> skip
;
If I parse your example input with the following grammar:
grammar Hello;
parse
: .*? EOF
;
INTEGER
: [0-9]+
;
IDENTIFIER
: [a-zA-Z_]+
;
SINGLE_LINE_COMMENT
: '--' ~[\r\n]* -> skip
;
SPACES
: [ \t\r\n]+ -> skip
;
OTHER
: .
;
and let ANTLRWorks2 print the tokens, I see the following:
[#0,23:28='CREATE',<2>,6:0]
[#1,30:34='TABLE',<2>,6:7]
[#2,36:44='table_one',<2>,6:13]
[#3,46:46='(',<5>,6:23]
[#4,48:62='customer_number',<2>,6:25]
[#5,64:70='integer',<2>,6:41]
[#6,71:71=',',<5>,6:48]
[#7,73:79='address',<2>,6:50]
[#8,81:89='character',<2>,6:58]
[#9,91:97='varying',<2>,6:68]
[#10,98:98='(',<5>,6:75]
[#11,99:100='30',<1>,6:76]
[#12,101:101=')',<5>,6:78]
[#13,102:102=')',<5>,6:79]
[#14,103:103=';',<5>,6:80]
[#15,106:111='create',<2>,8:0]
[#16,113:117='table',<2>,8:7]
[#17,119:127='table_two',<2>,8:13]
[#18,129:129='(',<5>,8:23]
[#19,131:132='id',<2>,8:25]
[#20,134:140='integer',<2>,8:28]
[#21,141:141=',',<5>,8:35]
[#22,143:146='city',<2>,8:37]
[#23,148:156='character',<2>,8:42]
[#24,158:164='varying',<2>,8:52]
[#25,165:165='(',<5>,8:59]
[#26,166:167='50',<1>,8:60]
[#27,168:168=')',<5>,8:62]
[#28,169:169=')',<5>,8:63]
[#29,170:170=';',<5>,8:64]
[#30,172:171='<EOF>',<-1>,9:0]
I.e.: the line comment are discarded properly. If it does not with happen in your case, something else is going wrong.

try EOF on the end of the rule as one of the options not just \r\n, like:
LINE_COMMENT
: '//' ~[\r\n]* (EOF|'\r'? '\n') -> channel(HIDDEN)
;

ANTLR4 token image concatenation with comments in the mix

I'm trying to write an ANTLR4 lexer for some language. I've got a working one, but I'm not entirely satisfied with it.
keyword "my:little:uri" + /* my comment here */ ':it:is'
// nasty comment
+ ":mehmeh"; // single line comment
keyword + {}
This is an example of statements in the language. It's simply a bunch of keywords followed by string arguments and terminated by a semicolon or a block of sub-statements. Strings may be unquoted, single-quoted or double-quoted. The quoted strings may be concatenated as in the example above. An unquoted string containing a plus sign (+) is valid.
What I find problematic are the comments. I'd like to recognize whatever follows a keyword as a single string token, sans the comments (and whitespace). I'd usually use the more lexer command but I don't think it's applicable for the example above. Is there a pattern that would allow me achieve something like this?
My current lexer grammar:
lexer grammar test;
#members {
public static final int CHANNEL_COMMENTS = 1;
}
WHITESPACE : (' ' | '\t' | '\n' | '\r' | '\f') -> skip;
SINGLE_LINE_COMMENT : '//' (~[\n\r])* ('\n' | '\r' | '\r\n')? -> channel(CHANNEL_COMMENTS);
MULTI_LINE_COMMENT : '/*' .*? '*/' -> channel(CHANNEL_COMMENTS);
KEYWORD : 'keyword' -> pushMode(IN_STRING_KEYWORD);
LBRACE : '{';
RBRACE : '}';
SEMICOLON : ';';
mode IN_STRING_KEYWORD;
STRING_WHITESPACE : WHITESPACE -> skip;
STRING_SINGLE_LINE_COMMENT : SINGLE_LINE_COMMENT -> type(SINGLE_LINE_COMMENT), channel(CHANNEL_COMMENTS);
STRING_MULTI_LINE_COMMENT : MULTI_LINE_COMMENT -> type(MULTI_LINE_COMMENT), channel(CHANNEL_COMMENTS);
STRING_LBRACE : LBRACE -> type(LBRACE), popMode;
STRING_SEMICOLON : SEMICOLON -> type(SEMICOLON), popMode;
STRING : ((QUOTED_STRING ('+' QUOTED_STRING)*) | UNQUOTED_STRING);
fragment QUOTED_STRING : (SINGLEQUOTED_STRING | DOUBLEQUOTED_STRING);
fragment UNQUOTED_STRING : (~[ \t;{}/*'"\n\r] | '/' ~[/*] | '*' ~['/'])+;
fragment SINGLEQUOTED_STRING : '\'' (~['])* '\'';
fragment DOUBLEQUOTED_STRING :
'"'
(
(~["\\]) |
('\\' [nt"\\])
)*
'"'
;
Am I perhaps trying to do too much inside the lexer and should just feed what I currently have to the parser and let it handle the above mess?
Edit01
Thanks to 280Z28, I decided to fix the above lexer grammar by getting rid of my STRING token and simply settling for QUOTED_STRING, UNQUOTED_STRING and the operator CONCAT. The rest will be handled in the parser. I also added an additional lexer mode in order to distinguish between CONCAT and UNQUOTED_STRING.
lexer grammar test;
#members {
public static final int CHANNEL_COMMENTS = 2;
}
WHITESPACE : (' ' | '\t' | '\n' | '\r' | '\f') -> skip;
SINGLE_LINE_COMMENT : '//' (~[\n\r])* -> channel(CHANNEL_COMMENTS);
MULTI_LINE_COMMENT : '/*' .*? '*/' -> channel(CHANNEL_COMMENTS);
KEYWORD : 'keyword' -> pushMode(IN_STRING_KEYWORD);
LBRACE : '{';
RBRACE : '}';
SEMICOLON : ';';
mode IN_STRING_KEYWORD;
STRING_WHITESPACE : WHITESPACE -> skip;
STRING_SINGLE_LINE_COMMENT : SINGLE_LINE_COMMENT -> type(SINGLE_LINE_COMMENT), channel(CHANNEL_COMMENTS);
STRING_MULTI_LINE_COMMENT : MULTI_LINE_COMMENT -> type(MULTI_LINE_COMMENT), channel(CHANNEL_COMMENTS);
STRING_LBRACE : LBRACE -> type(LBRACE), popMode;
STRING_SEMICOLON : SEMICOLON -> type(SEMICOLON), popMode;
QUOTED_STRING : (SINGLEQUOTED_STRING | DOUBLEQUOTED_STRING) -> mode(IN_QUOTED_STRING);
UNQUOTED_STRING : (~[ \t;{}/*'"\n\r] | '/' ~[/*] | '*' ~[/])+;
fragment SINGLEQUOTED_STRING : '\'' (~['])* '\'';
fragment DOUBLEQUOTED_STRING :
'"'
(
(~["\\]) |
('\\' [nt"\\])
)*
'"'
;
mode IN_QUOTED_STRING;
QUOTED_STRING_WHITESPACE : WHITESPACE -> skip;
QUOTED_STRING_SINGLE_LINE_COMMENT : SINGLE_LINE_COMMENT -> type(SINGLE_LINE_COMMENT), channel(CHANNEL_COMMENTS);
QUOTED_STRING_MULTI_LINE_COMMENT : MULTI_LINE_COMMENT -> type(MULTI_LINE_COMMENT), channel(CHANNEL_COMMENTS);
QUOTED_STRING_LBRACE : LBRACE -> type(LBRACE), popMode;
QUOTED_STRING_SEMICOLON : SEMICOLON -> type(SEMICOLON), popMode;
QUOTED_STRING2 : QUOTED_STRING -> type(QUOTED_STRING);
CONCAT : '+';

Don't perform string concatenation in the lexer. Send the + operator to the parser as an operator. This will make it much easier to eliminate the whitespace and/or comments appearing between strings and the operator.
CONCAT : '+';
STRING : QUOTED_STRING | UNQUOTED_STRING;
You should be aware that ANTLR 4 changed the predefined HIDDEN channel from 99 to 1, so HIDDEN and CHANNEL_COMMENTS are the same in your grammar.
Don't include the line terminator at the end of the SINGLE_LINE_COMMENT rule.
SINGLE_LINE_COMMENT
: '//' (~[\n\r])*
-> channel(CHANNEL_COMMENTS)
;
Your UNQUOTED_STRING token currently contains the set ['/']. If you meant to exclude ' characters, the second ' in the set is redundant so you can use ['/]. If you only meant to exclude /, then you can use either the syntax [/] or '/'.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

ANTLR4: Whitespace and Space lexical handling - java

I assume that you have already found the solution, but for the sake of record. Your whitespace lexer rule should be: WS : (' '|'\r'|'\n'|'\t') -> channel(HIDDEN); In your grammar space char just is not specified, that is all.

Related

ANTLR4 - arguments in nested functions

Antlr4 'no viable alternative at input' with my grammar

ANTLR4 Grammar only matching first part of parser rule

How to skip single line sql comment -- in antlr4

ANTLR4 token image concatenation with comments in the mix

Categories

Resources