ANTLR4 token image concatenation with comments in the mix - java
I'm trying to write an ANTLR4 lexer for some language. I've got a working one, but I'm not entirely satisfied with it.
keyword "my:little:uri" + /* my comment here */ ':it:is'
// nasty comment
+ ":mehmeh"; // single line comment
keyword + {}
This is an example of statements in the language. It's simply a bunch of keywords followed by string arguments and terminated by a semicolon or a block of sub-statements. Strings may be unquoted, single-quoted or double-quoted. The quoted strings may be concatenated as in the example above. An unquoted string containing a plus sign (+) is valid.
What I find problematic are the comments. I'd like to recognize whatever follows a keyword as a single string token, sans the comments (and whitespace). I'd usually use the more lexer command but I don't think it's applicable for the example above. Is there a pattern that would allow me achieve something like this?
My current lexer grammar:
lexer grammar test;
#members {
public static final int CHANNEL_COMMENTS = 1;
}
WHITESPACE : (' ' | '\t' | '\n' | '\r' | '\f') -> skip;
SINGLE_LINE_COMMENT : '//' (~[\n\r])* ('\n' | '\r' | '\r\n')? -> channel(CHANNEL_COMMENTS);
MULTI_LINE_COMMENT : '/*' .*? '*/' -> channel(CHANNEL_COMMENTS);
KEYWORD : 'keyword' -> pushMode(IN_STRING_KEYWORD);
LBRACE : '{';
RBRACE : '}';
SEMICOLON : ';';
mode IN_STRING_KEYWORD;
STRING_WHITESPACE : WHITESPACE -> skip;
STRING_SINGLE_LINE_COMMENT : SINGLE_LINE_COMMENT -> type(SINGLE_LINE_COMMENT), channel(CHANNEL_COMMENTS);
STRING_MULTI_LINE_COMMENT : MULTI_LINE_COMMENT -> type(MULTI_LINE_COMMENT), channel(CHANNEL_COMMENTS);
STRING_LBRACE : LBRACE -> type(LBRACE), popMode;
STRING_SEMICOLON : SEMICOLON -> type(SEMICOLON), popMode;
STRING : ((QUOTED_STRING ('+' QUOTED_STRING)*) | UNQUOTED_STRING);
fragment QUOTED_STRING : (SINGLEQUOTED_STRING | DOUBLEQUOTED_STRING);
fragment UNQUOTED_STRING : (~[ \t;{}/*'"\n\r] | '/' ~[/*] | '*' ~['/'])+;
fragment SINGLEQUOTED_STRING : '\'' (~['])* '\'';
fragment DOUBLEQUOTED_STRING :
'"'
(
(~["\\]) |
('\\' [nt"\\])
)*
'"'
;
Am I perhaps trying to do too much inside the lexer and should just feed what I currently have to the parser and let it handle the above mess?
Edit01
Thanks to 280Z28, I decided to fix the above lexer grammar by getting rid of my STRING token and simply settling for QUOTED_STRING, UNQUOTED_STRING and the operator CONCAT. The rest will be handled in the parser. I also added an additional lexer mode in order to distinguish between CONCAT and UNQUOTED_STRING.
lexer grammar test;
#members {
public static final int CHANNEL_COMMENTS = 2;
}
WHITESPACE : (' ' | '\t' | '\n' | '\r' | '\f') -> skip;
SINGLE_LINE_COMMENT : '//' (~[\n\r])* -> channel(CHANNEL_COMMENTS);
MULTI_LINE_COMMENT : '/*' .*? '*/' -> channel(CHANNEL_COMMENTS);
KEYWORD : 'keyword' -> pushMode(IN_STRING_KEYWORD);
LBRACE : '{';
RBRACE : '}';
SEMICOLON : ';';
mode IN_STRING_KEYWORD;
STRING_WHITESPACE : WHITESPACE -> skip;
STRING_SINGLE_LINE_COMMENT : SINGLE_LINE_COMMENT -> type(SINGLE_LINE_COMMENT), channel(CHANNEL_COMMENTS);
STRING_MULTI_LINE_COMMENT : MULTI_LINE_COMMENT -> type(MULTI_LINE_COMMENT), channel(CHANNEL_COMMENTS);
STRING_LBRACE : LBRACE -> type(LBRACE), popMode;
STRING_SEMICOLON : SEMICOLON -> type(SEMICOLON), popMode;
QUOTED_STRING : (SINGLEQUOTED_STRING | DOUBLEQUOTED_STRING) -> mode(IN_QUOTED_STRING);
UNQUOTED_STRING : (~[ \t;{}/*'"\n\r] | '/' ~[/*] | '*' ~[/])+;
fragment SINGLEQUOTED_STRING : '\'' (~['])* '\'';
fragment DOUBLEQUOTED_STRING :
'"'
(
(~["\\]) |
('\\' [nt"\\])
)*
'"'
;
mode IN_QUOTED_STRING;
QUOTED_STRING_WHITESPACE : WHITESPACE -> skip;
QUOTED_STRING_SINGLE_LINE_COMMENT : SINGLE_LINE_COMMENT -> type(SINGLE_LINE_COMMENT), channel(CHANNEL_COMMENTS);
QUOTED_STRING_MULTI_LINE_COMMENT : MULTI_LINE_COMMENT -> type(MULTI_LINE_COMMENT), channel(CHANNEL_COMMENTS);
QUOTED_STRING_LBRACE : LBRACE -> type(LBRACE), popMode;
QUOTED_STRING_SEMICOLON : SEMICOLON -> type(SEMICOLON), popMode;
QUOTED_STRING2 : QUOTED_STRING -> type(QUOTED_STRING);
CONCAT : '+';
Don't perform string concatenation in the lexer. Send the + operator to the parser as an operator. This will make it much easier to eliminate the whitespace and/or comments appearing between strings and the operator.
CONCAT : '+';
STRING : QUOTED_STRING | UNQUOTED_STRING;
You should be aware that ANTLR 4 changed the predefined HIDDEN channel from 99 to 1, so HIDDEN and CHANNEL_COMMENTS are the same in your grammar.
Don't include the line terminator at the end of the SINGLE_LINE_COMMENT rule.
SINGLE_LINE_COMMENT
: '//' (~[\n\r])*
-> channel(CHANNEL_COMMENTS)
;
Your UNQUOTED_STRING token currently contains the set ['/']. If you meant to exclude ' characters, the second ' in the set is redundant so you can use ['/]. If you only meant to exclude /, then you can use either the syntax [/] or '/'.
Related
Why can my ANTLR4 grammar not parse this text?
I want to be able to parse the following text using ANTLR4: six-buffers() { evil-window-split(); evil-window-vsplit(); evil-window-vsplit(); evil-window-down(1); evil-window-vsplit(); evil-window-vsplit(); }; six-buffers(); First I define a function, then I call it. To do so, I defined the following grammar: grammar Deplorable; script: statement*; statement: (methodCall | functionDeclaration) ';' (WHITESPACE|NEW_LINE); // General stuff deplorableString: '"' DEPLORABLE_STRING* '"'; deplorableInteger: DEPLORABLE_NUMBER; // Method call definition methodCall: methodName LPAREN (methodArgument COMMA?)* RPAREN; methodName: DEPLORABLE_IDENTIFIER; methodArgument: (deplorableString | deplorableInteger); // Function Declaration functionStatement: methodCall ';' (WHITESPACE|NEW_LINE); functionDeclaration: methodName LPAREN RPAREN functionBody; functionBody: CURLY_BRACE_LEFT functionStatement* CURLY_BRACE_RIGHT; // Lexer stuff LPAREN: '('; RPAREN: ')'; DEPLORABLE_IDENTIFIER: (LOWERCASE_LATIN_LETTER | UPPERCASE_LATIN_LETTER | UNDERSCORE | DASH)+; DEPLORABLE_STRING: (LOWERCASE_LATIN_LETTER | UPPERCASE_LATIN_LETTER | UNDERSCORE | WHITESPACE | EXCLAMATION_POINT)+; CURLY_BRACE_LEFT: '{'; CURLY_BRACE_RIGHT: '}'; NEW_LINE: ('\r\n'|'\n'|'\r'); DEPLORABLE_NUMBER: DIGIT+; fragment COMMA: ','; fragment DASH: '-'; fragment LOWERCASE_LATIN_LETTER: 'a'..'z'; fragment UPPERCASE_LATIN_LETTER: 'A'..'Z'; fragment UNDERSCORE: '_'; fragment WHITESPACE: ' '; fragment EXCLAMATION_POINT: '!'; fragment DIGIT: '0'..'9'; I compile this grammar using mvn clean antlr4:antlr4 install (with disabled tests). Here is my pom.xml file. However, when I try to parse the above text in a test, I am getting the error line 1:13 no viable alternative at input 'six-buffers() ' I tried to add void in front of a function declaration so that the parser can distinguish between function declarations and function calls, but this did not help. How can I fix this error, i. e. make sure that the parser correctly recognizes a function declaration and does not mistake it for a function call? Update 1: This version of the grammar (inspired by Mike Cargal) seems to work for now: grammar Deplorable; script: statement*; statement: (methodCall | functionDeclaration) ';'; // General stuff // Method call definition methodCall: methodName LPAREN (methodArgument COMMA?)* RPAREN; methodName: DEPLORABLE_IDENTIFIER; methodArgument: (DEPLORABLE_STRING | DEPLORABLE_NUMBER); // Function Declaration functionStatement: methodCall ';'; functionDeclaration: methodName LPAREN RPAREN functionBody; functionBody: CURLY_BRACE_LEFT functionStatement* CURLY_BRACE_RIGHT; // Lexer stuff LPAREN: '('; RPAREN: ')'; DEPLORABLE_IDENTIFIER: ( LOWERCASE_LATIN_LETTER | UPPERCASE_LATIN_LETTER | UNDERSCORE | DASH )+; DEPLORABLE_STRING: '"' ( LOWERCASE_LATIN_LETTER | UPPERCASE_LATIN_LETTER | UNDERSCORE | WHITESPACE | EXCLAMATION_POINT )+ '"'; CURLY_BRACE_LEFT: '{'; CURLY_BRACE_RIGHT: '}'; NEW_LINE: ( '\r' '\n'? | '\n' ) -> skip; DEPLORABLE_NUMBER: DIGIT+; fragment COMMA: ','; fragment DASH: '-'; fragment LOWERCASE_LATIN_LETTER: 'a'..'z'; fragment UPPERCASE_LATIN_LETTER: 'A'..'Z'; fragment UNDERSCORE: '_'; WHITESPACE: [ \t]+ -> skip; fragment EXCLAMATION_POINT: '!'; fragment DIGIT: '0'..'9';
#sepp2k is pointing you the right direction. Your Lexer rules (particularly DEPLORABLE_STRING) are causing your pain. More specifically, this looks like the misconception a lot of people have, early in using ANTLR, that a Parser rule can have anything to do with tokenization. In the ANTLR pipeline, your input is first tokenized into a stream of tokens using the Lexer rules. So dumping out your stream of tokens is frequently very helpful. in your case, the stream looks like this: [#0,0:10='six-buffers',<DEPLORABLE_IDENTIFIER>,1:0] [#1,11:11='(',<'('>,1:11] [#2,12:12=')',<')'>,1:12] [#3,13:13=' ',<DEPLORABLE_STRING>,1:13] [#4,14:14='{',<'{'>,1:14] [#5,15:15='\n',<NEW_LINE>,1:15] [#6,16:23=' evil',<DEPLORABLE_STRING>,2:0] [#7,24:36='-window-split',<DEPLORABLE_IDENTIFIER>,2:8] [#8,37:37='(',<'('>,2:21] [#9,38:38=')',<')'>,2:22] [#10,39:39=';',<';'>,2:23] [#11,40:40='\n',<NEW_LINE>,2:24] [#12,41:48=' evil',<DEPLORABLE_STRING>,3:0] [#13,49:62='-window-vsplit',<DEPLORABLE_IDENTIFIER>,3:8] [#14,63:63='(',<'('>,3:22] [#15,64:64=')',<')'>,3:23] [#16,65:65=';',<';'>,3:24] [#17,66:66='\n',<NEW_LINE>,3:25] [#18,67:74=' evil',<DEPLORABLE_STRING>,4:0] [#19,75:88='-window-vsplit',<DEPLORABLE_IDENTIFIER>,4:8] [#20,89:89='(',<'('>,4:22] [#21,90:90=')',<')'>,4:23] [#22,91:91=';',<';'>,4:24] [#23,92:92='\n',<NEW_LINE>,4:25] [#24,93:100=' evil',<DEPLORABLE_STRING>,5:0] [#25,101:112='-window-down',<DEPLORABLE_IDENTIFIER>,5:8] [#26,113:113='(',<'('>,5:20] [#27,114:114='1',<DEPLORABLE_NUMBER>,5:21] [#28,115:115=')',<')'>,5:22] [#29,116:116=';',<';'>,5:23] [#30,117:117='\n',<NEW_LINE>,5:24] [#31,118:125=' evil',<DEPLORABLE_STRING>,6:0] [#32,126:139='-window-vsplit',<DEPLORABLE_IDENTIFIER>,6:8] [#33,140:140='(',<'('>,6:22] [#34,141:141=')',<')'>,6:23] [#35,142:142=';',<';'>,6:24] [#36,143:143='\n',<NEW_LINE>,6:25] [#37,144:151=' evil',<DEPLORABLE_STRING>,7:0] [#38,152:165='-window-vsplit',<DEPLORABLE_IDENTIFIER>,7:8] [#39,166:166='(',<'('>,7:22] [#40,167:167=')',<')'>,7:23] [#41,168:168=';',<';'>,7:24] [#42,169:169='\n',<NEW_LINE>,7:25] [#43,170:170='}',<'}'>,8:0] [#44,171:171=';',<';'>,8:1] [#45,172:172='\n',<NEW_LINE>,8:2] [#46,173:183='six-buffers',<DEPLORABLE_IDENTIFIER>,9:0] [#47,184:184='(',<'('>,9:11] [#48,185:185=')',<')'>,9:12] [#49,186:186=';',<';'>,9:13] [#50,187:186='<EOF>',<EOF>,9:14] You'll notice that #3,13 a single ' ' is being tokenized as a DEPLORABLE_STRING. You'll need to incorporate the quotation marks into your DEPLORABLE_STRING rule. (also suggest you skip WHITESPACE (and probably NEW_LINE (most grammars treat NEW_LINEs as WHITESPACE) Something like this should get you "unstuck" grammar Deplorable; script: statement*; statement: (methodCall | functionDeclaration) ';' ( WHITESPACE | NEW_LINE ); // General stuff deplorableString: '"' DEPLORABLE_STRING* '"'; deplorableInteger: DEPLORABLE_NUMBER; // Method call definition methodCall: methodName LPAREN (methodArgument COMMA?)* RPAREN; methodName: DEPLORABLE_IDENTIFIER; methodArgument: (DEPLORABLE_STRING | DEPLORABLE_NUMBER); // Function Declaration functionStatement: methodCall ';' (WHITESPACE | NEW_LINE); functionDeclaration: methodName LPAREN RPAREN functionBody; functionBody: CURLY_BRACE_LEFT functionStatement* CURLY_BRACE_RIGHT; // Lexer stuff LPAREN: '('; RPAREN: ')'; DEPLORABLE_IDENTIFIER: ( LOWERCASE_LATIN_LETTER | UPPERCASE_LATIN_LETTER | UNDERSCORE | DASH )+; DEPLORABLE_STRING: '"' ( LOWERCASE_LATIN_LETTER | UPPERCASE_LATIN_LETTER | UNDERSCORE | WHITESPACE | EXCLAMATION_POINT )+ '"'; CURLY_BRACE_LEFT: '{'; CURLY_BRACE_RIGHT: '}'; NEW_LINE: ('\r\n' | '\n' | '\r'); DEPLORABLE_NUMBER: DIGIT+; fragment COMMA: ','; fragment DASH: '-'; fragment LOWERCASE_LATIN_LETTER: 'a' ..'z'; fragment UPPERCASE_LATIN_LETTER: 'A' ..'Z'; fragment UNDERSCORE: '_'; fragment WHITESPACE: ' ' -> skip; fragment EXCLAMATION_POINT: '!'; fragment DIGIT: '0' ..'9'; That's still tripping on an extraneous \n (hence my comment re: WS and NL handling). Not sure your intention, but take a look at how other grammars handle it. It usually MUCH easier to skip them, than to account for everywhere in the parser rules where they might occur. Most importantly... get your thought model right about what the ANTLR process of processing your stream of characters into a stream of tokens (using Lexer rules) and then using parser rules to process the stream of tokens. You'll be in a for a lot of pain until that's clear for you.
ANTLR4 - arguments in nested functions
I have a problem with my antlr grammar or(lexer). In my case I need to parse a string with custom text and find functions in it. Format of function $foo($bar(3),'strArg'). I found solution in this post ANTLR Nested Functions and little bit improved it for my needs. But while testing different cases I found one that brakes parser: $foo($3,'strArg'). This will throw IncorectSyntax exception. I tried many variants(for example not to skip $ and include it in parsing tree) but it all these attempts were unsuccessfully Lexer lexer grammar TLexer; TEXT : ~[$] ; FUNCTION_START : '$' -> pushMode(IN_FUNCTION), skip ; mode IN_FUNCTION; FUNTION_NESTED : '$' -> pushMode(IN_FUNCTION), skip; ID : [a-zA-Z_]+; PAR_OPEN : '('; PAR_CLOSE : ')' -> popMode; NUMBER : [0-9]+; STRING : '\'' ( ~'\'' | '\'\'' )* '\''; COMMA : ','; SPACE : [ \t\r\n]-> skip; Parser options { tokenVocab=TLexer; } parse : atom* EOF ; atom : text | function ; text : TEXT+ ; function : ID params ; params : PAR_OPEN ( param ( COMMA param )* )? PAR_CLOSE ; param : NUMBER | STRING | function ;
The parser does not fail on $foo($3,'strArg'), because when it encounters the second $ it is already in IN_FUNCTION mode and it is expecting a parameter. It skips the character and reads a NUMBER. If you want it to fail you need to unskip the dollar signs in the Lexer: FUNCTION_START : '$' -> pushMode(IN_FUNCTION); mode IN_FUNCTION; FUNTION_START : '$' -> pushMode(IN_FUNCTION); and modify the function rule: function : FUNCTION_START ID params;
Antlr - Why it expect FunctionCall but PrintCommand gave
my Antlr-grammar expect a FunctionCall but in my example-code for the compiler built by antlr, i wrote a print-command. Does someone know why and how to fix that? The print-command is named: RetroBox.show(); The print-command should be recognised from blockstatements to blockstatement to statement to localFunctionCall to printCommand Here my Antrl-grammar: grammar Mars; // ******************************LEXER BEGIN***************************************** // Keywords FUNC: 'func'; ENTRY: 'entry'; VARI: 'vari'; VARF: 'varf'; VARC: 'varc'; VARS: 'vars'; LET: 'let'; INCREMENTS: 'increments'; RETROBOX: 'retrobox'; SHOW: 'show'; // Literals DECIMAL_LITERAL: ('0' | [1-9] (Digits? | '_'+ Digits)) [lL]?; FLOAT_LITERAL: (Digits '.' Digits? | '.' Digits) ExponentPart? [fFdD]? | Digits (ExponentPart [fFdD]? | [fFdD]) ; CHAR_LITERAL: '\'' (~['\\\r\n] | EscapeSequence) '\''; STRING_LITERAL: '"' (~["\\\r\n] | EscapeSequence)* '"'; // Seperators ORBRACKET: '('; CRBRACKET: ')'; OEBRACKET: '{'; CEBRACKET: '}'; SEMI: ';'; POINT: '.'; // Operators ASSIGN: '='; // Whitespace and comments WS: [ \t\r\n\u000C]+ -> channel(HIDDEN); COMMENT: '/*' .*? '*/' -> channel(HIDDEN); LINE_COMMENT: '//' ~[\r\n]* -> channel(HIDDEN); // Identifiers IDENTIFIER: Letter LetterOrDigit*; // Fragment rules fragment ExponentPart : [eE] [+-]? Digits ; fragment EscapeSequence : '\\' [btnfr"'\\] | '\\' ([0-3]? [0-7])? [0-7] | '\\' 'u'+ HexDigit HexDigit HexDigit HexDigit ; fragment HexDigits : HexDigit ((HexDigit | '_')* HexDigit)? ; fragment HexDigit : [0-9a-fA-F] ; fragment Digits : [0-9] ([0-9_]* [0-9])? ; fragment LetterOrDigit : Letter | [0-9] ; fragment Letter : [a-zA-Z$_] // these are the "java letters" below 0x7F | ~[\u0000-\u007F\uD800-\uDBFF] // covers all characters above 0x7F which are not a surrogate | [\uD800-\uDBFF] [\uDC00-\uDFFF] // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF ; // *******************************LEXER END**************************************** // *****************************PARSER BEGIN***************************************** program : mainfunction #Programm | /*EMPTY*/ #Garnichts ; mainfunction : FUNC VARI ENTRY ORBRACKET CRBRACKET block #NormaleHauptmethode ; block : '{' blockStatement '}' #CodeBlock | /*EMPTY*/ #EmptyCodeBlock ; blockStatement : statement* #Befehl ; statement : localVariableDeclaration | localVariableInitialization | localFunctionImplementation | localFunctionCall ; expression : left=expression op='%' | left=expression op=('*' | '/') right=expression | left=expression op=('+' | '-') right=expression | neg='-' right=expression | number | IDENTIFIER | '(' expression ')' ; number : DECIMAL_LITERAL | FLOAT_LITERAL ; localFunctionImplementation : FUNC primitiveType IDENTIFIER ORBRACKET CRBRACKET block #Methodenimplementierung ; localFunctionCall : IDENTIFIER ORBRACKET CRBRACKET SEMI #Methodenaufruf | printCommand #RetroBoxShowCommand ; printCommand : RETROBOX POINT SHOW ORBRACKET params=primitiveLiteral CRBRACKET SEMI #PrintCommandWP ; localVariableDeclaration : varTypeDek=primitiveType IDENTIFIER SEMI #Variablendeklaration ; localVariableInitialization : varTypeIni=primitiveType IDENTIFIER ASSIGN varValue=primitiveLiteral SEMI #VariableninitKonst | varTypeIni=primitiveType IDENTIFIER ASSIGN varValue=expression SEMI #VariableninitExpr ; primitiveLiteral : DECIMAL_LITERAL | FLOAT_LITERAL | STRING_LITERAL | CHAR_LITERAL ; primitiveType : VARI | VARC | VARF | VARS ; // ******************************PARSER END**************************************** Here my example-code: func vari entry() { RetroBox.show("Hallo"); //Should be recognised as print-command } And here a AST printed from Antlr: AST from Compiler
The problem is that your RETROBOX keyword is 'retrobox' but your example code has it typed as 'RetroBox'. Antlr parses 'RetroBox' as an identifier so the following '.' is unexpected. Antlr should emit an error: "line 3:12 mismatched input '.' expecting '('". Then it attempts to recover and continue parsing. It tries single token deletion (just ignoring the '.') and finds that that works... except the rule it now matches is #Methodenaufruf instead of #RetroBoxShowCommand.
ANTLR4: Whitespace and Space lexical handling
In my (simplyfied) grammar grammar test; prog: stat+; stat: sourceDef ';' ; sourceDef: SRC COLON ID ; STRING : '"' ('""'|~'"')* '"' ; // quote-quote is an escaped quote LINE_COMMENT : '//' (~('\n'|'\r'))* -> skip; WS : [ \t\n\r]+ -> skip; //SP : ' ' -> skip; COMMENT : '/*' .*? '*/' -> skip; LE: '<'; MINUS: '-'; GR: '>'; COLON: ':' ; HASH: '#'; EQ: '='; SEMI: ';'; COMMA: ','; AND: [Aa][Nn][Dd]; SRC: [Ss][Rr][Cc]; NUMBER: [0-9]; ID: [a-zA-Z][a-zA-z0-9]+; DAY: ('0'[1-9]|[12][0-9]|'3'[01]); MONTH: ('0' [1-9]|'1'[012]); YEAR: [0-2] [890] NUMBER NUMBER; DATE: DAY [- /.] MONTH [- /.] YEAR; the code src : xxx; shows a syntax error: extraneous input ' ' expecting ':' The code src:xxx; resolves fine. The modified version with WS : [\t\n\r]+ -> skip; SP : ' ' -> skip; works fine with both syntax versions (with and without spaces). So the spaces seem to be skipped only, if they are defined in a separate rule. Is something wrong with this WS : [ \t\n\r]+ -> skip; definition? Or what else could cause this (to me) unexpected behavior?
I assume that you have already found the solution, but for the sake of record. Your whitespace lexer rule should be: WS : (' '|'\r'|'\n'|'\t') -> channel(HIDDEN); In your grammar space char just is not specified, that is all.
lexer that takes "not" but not "not like"
I need a small trick to get my parser completely working. I use antlr to parse boolean queries. a query is composed of elements, linked together by ands, ors and nots. So I can have something like : "(P or not Q or R) or (( not A and B) or C)" Thing is, an element can be long, and is generally in the form : a an_operator b for example : "New-York matches NY" Trick, one of the an_operator is "not like" So I would like to modify my lexer so that the not checks that there is no like after it, to avoid parsing elements containing "not like" operators. My current grammar is here : // save it in a file called Logic.g grammar Logic; options { output=AST; } // parser/production rules start with a lower case letter parse : expression EOF! // omit the EOF token ; expression : orexp ; orexp : andexp ('or'^ andexp)* // make `or` the root ; andexp : notexp ('and'^ notexp)* // make `and` the root ; notexp : 'not'^ atom // make `not` the root | atom ; atom : ID | '('! expression ')'! // omit both `(` andexp `)` ; // lexer/terminal rules start with an upper case letter ID : ('a'..'z' | 'A'..'Z')+; Space : (' ' | '\t' | '\r' | '\n')+ {$channel=HIDDEN;}; Any help would be appreciated. Thanks !
Here's a possible solution: grammar Logic; options { output=AST; } tokens { NOT_LIKE; } parse : expression EOF! ; expression : orexp ; orexp : andexp (Or^ andexp)* ; andexp : fuzzyexp (And^ fuzzyexp)* ; fuzzyexp : (notexp -> notexp) ( Matches e=notexp -> ^(Matches $fuzzyexp $e) | Not Like e=notexp -> ^(NOT_LIKE $fuzzyexp $e) | Like e=notexp -> ^(Like $fuzzyexp $e) )? ; notexp : Not^ atom | atom ; atom : ID | '('! expression ')'! ; And : 'and'; Or : 'or'; Not : 'not'; Like : 'like'; Matches : 'matches'; ID : ('a'..'z' | 'A'..'Z')+; Space : (' ' | '\t' | '\r' | '\n')+ {$channel=HIDDEN;}; which will parse the input "A not like B or C like D and (E or not F) and G matches H" into the following AST: