Here is the beginning of my lexer rules:
F_TEXT_START
: {! matchingFText}? 'f"' {matchingFText = true;}
;
F_TEXT_PH_ESCAPE
: {matchingFText && ! matchingFTextPh}? '{=/'
;
F_TEXT_PH_START
: {matchingFText && ! matchingFTextPh}? '{=' {matchingFTextPh = true;}
;
F_TEXT_PH_END
: {matchingFText && matchingFTextPh}? '}' {matchingFTextPh = false;}
;
F_TEXT_CHAR
: {matchingFText && ! matchingFTextPh}? (~('"' | '{')+ | '""' | '{' ~'=')
;
F_TEXT_END
: {matchingFText && ! matchingFTextPh}? '"' {matchingFText = false;}
;
IF
: {! matchingFText || matchingFTextPh}? 'if'
;
ELIF
: {! matchingFText || matchingFTextPh}? 'elif'
;
// Lots of other keywords
fragment LETTER
: ('A' .. 'Z' | 'a' .. 'z' | '_')
;
VARIABLE
: {! matchingFText || matchingFTextPh}? LETTER (LETTER | DIGIT)*
;
What I am doing is putting my formatted text not just like a normal text token but with a f before, but I add it to my parse tree, to be able to tell if there are errors while parsing (with just parser.start()). So a formatted text starts with f", finishes with a ", any " must be replaced by "", and can contain placeholders starting with {= and finishing with } but if you want to actually write {=, you'll have to replace it by {=/.
The problem is that in a normal formatted text content (not placeholder), the lexer started to mach not only F_TEXT_CHAR but other lexer rules too, like variables. What I did seems pretty dumb, I just put semantic predicates for every other rule to avoid them to be matched in a formatted text's content (but still in a placeholder).
Isn't there a better way ?
I'd use a lexical mode for this. To use lexical modes, you'll have to define separate lexer- and parser grammars. Here's a quick demo:
lexer grammar TestLexer;
F_TEXT_START
: 'f"' -> pushMode(F_TEXT)
;
VARIABLE
: LETTER (LETTER | DIGIT)*
;
F_TEXT_PH_ESCAPE
: '{=/'
;
F_TEXT_PH_END
: '}' -> popMode
;
SPACES
: [ \t\r\n]+ -> skip
;
fragment LETTER
: [a-zA-Z_]
;
fragment DIGIT
: [0-9]
;
mode F_TEXT;
F_TEXT_CHAR
: ~["{]+ | '""' | '{' ~'='
;
F_TEXT_PH_START
: '{=' -> pushMode(DEFAULT_MODE)
;
F_TEXT_END
: '"' -> popMode
;
Use the lexer in your parser like this:
parser grammar TestParser;
options {
tokenVocab=TestLexer;
}
// ...
If you now tokenise the input f"mu {=mu}" mu, you'd get the following tokens:
F_TEXT_START `f"`
F_TEXT_CHAR `mu `
F_TEXT_PH_START `{=`
VARIABLE `mu`
F_TEXT_PH_END `}`
F_TEXT_END `"`
VARIABLE `mu`
I have a problem with my antlr grammar or(lexer). In my case I need to parse a string with custom text and find functions in it. Format of function $foo($bar(3),'strArg'). I found solution in this post ANTLR Nested Functions and little bit improved it for my needs. But while testing different cases I found one that brakes parser: $foo($3,'strArg'). This will throw IncorectSyntax exception. I tried many variants(for example not to skip $ and include it in parsing tree) but it all these attempts were unsuccessfully
Lexer
lexer grammar TLexer;
TEXT
: ~[$]
;
FUNCTION_START
: '$' -> pushMode(IN_FUNCTION), skip
;
mode IN_FUNCTION;
FUNTION_NESTED : '$' -> pushMode(IN_FUNCTION), skip;
ID : [a-zA-Z_]+;
PAR_OPEN : '(';
PAR_CLOSE : ')' -> popMode;
NUMBER : [0-9]+;
STRING : '\'' ( ~'\'' | '\'\'' )* '\'';
COMMA : ',';
SPACE : [ \t\r\n]-> skip;
Parser
options {
tokenVocab=TLexer;
}
parse
: atom* EOF
;
atom
: text
| function
;
text
: TEXT+
;
function
: ID params
;
params
: PAR_OPEN ( param ( COMMA param )* )? PAR_CLOSE
;
param
: NUMBER
| STRING
| function
;
The parser does not fail on $foo($3,'strArg'), because when it encounters the second $ it is already in IN_FUNCTION mode and it is expecting a parameter. It skips the character and reads a NUMBER.
If you want it to fail you need to unskip the dollar signs in the Lexer:
FUNCTION_START : '$' -> pushMode(IN_FUNCTION);
mode IN_FUNCTION;
FUNTION_START : '$' -> pushMode(IN_FUNCTION);
and modify the function rule:
function : FUNCTION_START ID params;
In my (simplyfied) grammar
grammar test;
prog: stat+;
stat:
sourceDef ';'
;
sourceDef:
SRC COLON ID
;
STRING : '"' ('""'|~'"')* '"' ; // quote-quote is an escaped quote
LINE_COMMENT
: '//' (~('\n'|'\r'))* -> skip;
WS : [ \t\n\r]+ -> skip;
//SP : ' ' -> skip;
COMMENT : '/*' .*? '*/' -> skip;
LE: '<';
MINUS: '-';
GR: '>';
COLON: ':' ;
HASH: '#';
EQ: '=';
SEMI: ';';
COMMA: ',';
AND: [Aa][Nn][Dd];
SRC: [Ss][Rr][Cc];
NUMBER: [0-9];
ID: [a-zA-Z][a-zA-z0-9]+;
DAY: ('0'[1-9]|[12][0-9]|'3'[01]);
MONTH: ('0' [1-9]|'1'[012]);
YEAR: [0-2] [890] NUMBER NUMBER;
DATE: DAY [- /.] MONTH [- /.] YEAR;
the code
src : xxx;
shows a syntax error:
extraneous input ' ' expecting ':'
The code
src:xxx;
resolves fine.
The modified version with
WS : [\t\n\r]+ -> skip;
SP : ' ' -> skip;
works fine with both syntax versions (with and without spaces).
So the spaces seem to be skipped only, if they are defined in a
separate rule.
Is something wrong with this
WS : [ \t\n\r]+ -> skip;
definition?
Or what else could cause this (to me) unexpected behavior?
I assume that you have already found the solution, but for the sake of record.
Your whitespace lexer rule should be:
WS : (' '|'\r'|'\n'|'\t') -> channel(HIDDEN);
In your grammar space char just is not specified, that is all.
I am writing a lexer and parser to my own language to process operations on lists. I started with that:
list_Declaration : L_LIST L_ID ASSIGN LBRACE NUMBER (COMA NUMBER)* RBRACE SEMI;
NUMBER : [0-9]+;
L_BOOLEAN_LITERAL
: 'true'
| 'false'
;
L_ID : [a-z]+;
L_IF : 'if';
L_ELSE : 'else';
L_THEN : 'then';
L_FOREACH : 'foreach';
L_VAR : 'var';
L_IN : 'in';
L_LIST : 'list';
L_NUMBER : 'number';
L_RETURN : 'return';
ASSIGN : '=';
LPAREN : '(';
RPAREN : ')';
LBRACE : '{';
RBRACE : '}';
COMA : ',';
SEMI : ';';
WS: [ \t\n\r]+ ->skip;
And when i try to parse this with example text:
list a = {2,3};
It says:
line 1:0 token recognition error at: ''
line 1:1 missing 'list' at 'list'
line 1:6 extraneous input 'a' expecting '='
What am I doing wrong?
I usually use this way to define lexer, and it always work:
fragment I : [iI];
fragment L : [lL];
fragment S : [sS];
fragment T : [tT];
L_List : L I S T;
Your 'list' has been matched as L_ID, so extraneous input 'a' expecting '='
Put your L_ID below almost all of the lexers.
How do I skip sql single line comments in antlr4 grammar?
This is the input which I have given:
--
-- $INPUT.sql$
--
CREATE TABLE table_one ( customer_number integer, address character varying(30));
create table table_two ( id integer, city character varying(50));
Like this:
SINGLE_LINE_COMMENT
: '--' ~[\r\n]* -> skip
;
If I parse your example input with the following grammar:
grammar Hello;
parse
: .*? EOF
;
INTEGER
: [0-9]+
;
IDENTIFIER
: [a-zA-Z_]+
;
SINGLE_LINE_COMMENT
: '--' ~[\r\n]* -> skip
;
SPACES
: [ \t\r\n]+ -> skip
;
OTHER
: .
;
and let ANTLRWorks2 print the tokens, I see the following:
[#0,23:28='CREATE',<2>,6:0]
[#1,30:34='TABLE',<2>,6:7]
[#2,36:44='table_one',<2>,6:13]
[#3,46:46='(',<5>,6:23]
[#4,48:62='customer_number',<2>,6:25]
[#5,64:70='integer',<2>,6:41]
[#6,71:71=',',<5>,6:48]
[#7,73:79='address',<2>,6:50]
[#8,81:89='character',<2>,6:58]
[#9,91:97='varying',<2>,6:68]
[#10,98:98='(',<5>,6:75]
[#11,99:100='30',<1>,6:76]
[#12,101:101=')',<5>,6:78]
[#13,102:102=')',<5>,6:79]
[#14,103:103=';',<5>,6:80]
[#15,106:111='create',<2>,8:0]
[#16,113:117='table',<2>,8:7]
[#17,119:127='table_two',<2>,8:13]
[#18,129:129='(',<5>,8:23]
[#19,131:132='id',<2>,8:25]
[#20,134:140='integer',<2>,8:28]
[#21,141:141=',',<5>,8:35]
[#22,143:146='city',<2>,8:37]
[#23,148:156='character',<2>,8:42]
[#24,158:164='varying',<2>,8:52]
[#25,165:165='(',<5>,8:59]
[#26,166:167='50',<1>,8:60]
[#27,168:168=')',<5>,8:62]
[#28,169:169=')',<5>,8:63]
[#29,170:170=';',<5>,8:64]
[#30,172:171='<EOF>',<-1>,9:0]
I.e.: the line comment are discarded properly. If it does not with happen in your case, something else is going wrong.
try EOF on the end of the rule as one of the options not just \r\n, like:
LINE_COMMENT
: '//' ~[\r\n]* (EOF|'\r'? '\n') -> channel(HIDDEN)
;