Antlr4: Prevent rule and token conflicts

Antlr4: Prevent rule and token conflicts - java

Given following grammar:
grammar minimal;
rule: '(' rule_name body ')';
rule_name : NAME;
body : '(' AT NAME ')';
AT : 'at';
NAME: LETTER ANY_CHAR*;
fragment LETTER: 'a' .. 'z' | 'A' .. 'Z';
fragment ANY_CHAR: LETTER | '0' .. '9' | '-' | '_';
WHITESPACE: ( ' ' | '\t' | '\r' | '\n' )+ -> skip;
How can I match (at (at bar)) with at as a valid function name without getting conflicts with the AT token from body without rearranging the grammar?

To remove the conflict and preserve the intended token type:
rule_name : ( NAME | AT ) -> type(NAME) ;

Related

My grammar identifies keywords as identifiers

Almost every word is recognized as a identifier, and it doesnt even get to the more complex rules. For an example, 'program'is recognized as a conditition and it doesnt recognize 'integer a,b;' as a Decl_list , just the 'integer' part as a Decl.
Do you guys have any ideia why?
Im using this code for testing:
program test1
declare
integer a, b, c;
integer result;
begin
read (a);
read (c);
b := 10;
result := (a * c)/(b + 5) ;
write(result);
end
lexer grammar MiniLexer;
Program: 'program' Identifier Body;
Body: ('declare' Decl_list) 'begin' Stmt_list 'end';
Decl_list: Decl ';' (Decl ';')?;
Decl: Type Ident_list;
fragment
Ident_list: (Identifier ','?)*;
Type: 'integer' | 'decimal';
Stmt_list: Stmt ';' ((Stmt ';')*)?;
Stmt: Assign_stmt | If_stmt | While_stmt| Read_stmt | Write_stmt;
Assign_stmt: Identifier ':=' Simple_expr;
If_stmt: 'if' Condition 'then' Stmt_list 'end' | 'if' Condition 'then' Stmt_list 'else' Stmt_list 'end';
Condition: Expression;
For_stmt: 'for' Assign_stmt 'to' Condition 'do' Stmt_list 'end';
While_stmt: 'while' Condition 'do' Stmt_list 'end';
Read_stmt: 'read' '(' Identifier ')';
Write_stmt: 'write' '(' Writable ')';
Writable: Simple_expr | Literal;
Expression: Simple_expr | Simple_expr Relop Simple_expr;
Simple_expr: Term | Term Addop Term| '(' Term ')' ? Term ':' Term;
Term: Factor_a | Factor_a Mulop Factor_a;
Factor_a: Factor | 'not' Factor | '-' Factor;
Factor: Identifier | Constant | '(' Expression ')';
Relop: '=' | '>' | '>=' | '<' | '<=' | '<>';
Addop: '+' | '-' | 'or';
Mulop: '*' | '/' | 'mod' | 'and';
Shiftop: '<<' | '>>' | '<<<' | '>>>';
COMENTARIO: '%' ~('\n'|'\r')* '\r'? '\n' {skip();};
WS : ( ' '| '\t'| '\r'| '\n') {skip();};
Constant: ('0'..'9') (('0'..'9'))*;
Literal: '"' ('\u0000'..'\uFFFE')* '"';
Identifier: ('a'..'z'|'A'..'Z') (('a'..'z'|'A'..'Z') | ('0'..'9'))*;
Do you guys have any ideia why?

Your grammar is a lexer grammar, meaning it produces only tokens. Learn the difference between lexer, parser and combined grammars here: https://github.com/antlr/antlr4/blob/master/doc/grammars.md
In short, remove the word lexer from your grammar and change some rules into parser rules (these start with a lower case letter):
grammar Mini;
program: 'program' Identifier body EOF;
body: ('declare' decl_list) 'begin' stmt_list 'end';
decl_list: decl ';' (decl ';')?;
decl: type ident_list;
ident_list: (Identifier ','?)*;
type: 'integer' | 'decimal';
stmt_list: stmt ';' (stmt ';')*;
stmt: assign_stmt | if_stmt | while_stmt| read_stmt | write_stmt | for_stmt;
assign_stmt: Identifier ':=' simple_expr;
if_stmt: 'if' condition 'then' stmt_list 'end' | 'if' condition 'then' stmt_list 'else' stmt_list 'end';
condition: expression;
for_stmt: 'for' assign_stmt 'to' condition 'do' stmt_list 'end';
while_stmt: 'while' condition 'do' stmt_list 'end';
read_stmt: 'read' '(' Identifier ')';
write_stmt: 'write' '(' writable ')';
writable: simple_expr | Literal;
expression: simple_expr | simple_expr Relop simple_expr;
simple_expr: term | term Addop term| '(' term ')' ? term ':' term;
term: factor_a | factor_a Mulop factor_a;
factor_a: factor | 'not' factor | '-' factor;
factor: Identifier | Constant | '(' expression ')';
Relop: '=' | '>' | '>=' | '<' | '<=' | '<>';
Addop: '+' | '-' | 'or';
Mulop: '*' | '/' | 'mod' | 'and';
Shiftop: '<<' | '>>' | '<<<' | '>>>';
COMENTARIO: '%' ~('\n'|'\r')* '\r'? '\n' -> skip;
Constant: ('0'..'9') (('0'..'9'))*;
Literal: '"' ('\u0000'..'\uFFFE')* '"';
Identifier: ('a'..'z'|'A'..'Z') (('a'..'z'|'A'..'Z') | ('0'..'9'))*;
Space: [ \t\r\n] -> skip;
Note that {skip();} is old v3 syntax, use -> skip instead.
And Constant: ('0'..'9') (('0'..'9'))*; is also old v3 syntax (although still valid in v4). The preferred way to do it is like this:
Constant: [0-9] (([0-9]))*;
which can simply be written as:
Constant: [0-9]+;

Antlr - Why it expect FunctionCall but PrintCommand gave

my Antlr-grammar expect a FunctionCall but in my example-code for the compiler built by antlr, i wrote a print-command. Does someone know why and how to fix that? The print-command is named: RetroBox.show(); The print-command should be recognised from blockstatements to blockstatement to statement to localFunctionCall to printCommand
Here my Antrl-grammar:
grammar Mars;
// ******************************LEXER
BEGIN*****************************************
// Keywords
FUNC: 'func';
ENTRY: 'entry';
VARI: 'vari';
VARF: 'varf';
VARC: 'varc';
VARS: 'vars';
LET: 'let';
INCREMENTS: 'increments';
RETROBOX: 'retrobox';
SHOW: 'show';
// Literals
DECIMAL_LITERAL: ('0' | [1-9] (Digits? | '_'+ Digits)) [lL]?;
FLOAT_LITERAL: (Digits '.' Digits? | '.' Digits) ExponentPart? [fFdD]?
| Digits (ExponentPart [fFdD]? | [fFdD])
;
CHAR_LITERAL: '\'' (~['\\\r\n] | EscapeSequence) '\'';
STRING_LITERAL: '"' (~["\\\r\n] | EscapeSequence)* '"';
// Seperators
ORBRACKET: '(';
CRBRACKET: ')';
OEBRACKET: '{';
CEBRACKET: '}';
SEMI: ';';
POINT: '.';
// Operators
ASSIGN: '=';
// Whitespace and comments
WS: [ \t\r\n\u000C]+ -> channel(HIDDEN);
COMMENT: '/*' .*? '*/' -> channel(HIDDEN);
LINE_COMMENT: '//' ~[\r\n]* -> channel(HIDDEN);
// Identifiers
IDENTIFIER: Letter LetterOrDigit*;
// Fragment rules
fragment ExponentPart
: [eE] [+-]? Digits
;
fragment EscapeSequence
: '\\' [btnfr"'\\]
| '\\' ([0-3]? [0-7])? [0-7]
| '\\' 'u'+ HexDigit HexDigit HexDigit HexDigit
;
fragment HexDigits
: HexDigit ((HexDigit | '_')* HexDigit)?
;
fragment HexDigit
: [0-9a-fA-F]
;
fragment Digits
: [0-9] ([0-9_]* [0-9])?
;
fragment LetterOrDigit
: Letter
| [0-9]
;
fragment Letter
: [a-zA-Z$_] // these are the "java letters" below 0x7F
| ~[\u0000-\u007F\uD800-\uDBFF] // covers all characters above 0x7F which are not a surrogate
| [\uD800-\uDBFF] [\uDC00-\uDFFF] // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
;
// *******************************LEXER END****************************************
// *****************************PARSER BEGIN*****************************************
program
: mainfunction #Programm
| /*EMPTY*/ #Garnichts
;
mainfunction
: FUNC VARI ENTRY ORBRACKET CRBRACKET block #NormaleHauptmethode
;
block
: '{' blockStatement '}' #CodeBlock
| /*EMPTY*/ #EmptyCodeBlock
;
blockStatement
: statement* #Befehl
;
statement
: localVariableDeclaration
| localVariableInitialization
| localFunctionImplementation
| localFunctionCall
;
expression
: left=expression op='%'
| left=expression op=('*' | '/') right=expression
| left=expression op=('+' | '-') right=expression
| neg='-' right=expression
| number
| IDENTIFIER
| '(' expression ')'
;
number
: DECIMAL_LITERAL
| FLOAT_LITERAL
;
localFunctionImplementation
: FUNC primitiveType IDENTIFIER ORBRACKET CRBRACKET block #Methodenimplementierung
;
localFunctionCall
: IDENTIFIER ORBRACKET CRBRACKET SEMI #Methodenaufruf
| printCommand #RetroBoxShowCommand
;
printCommand
: RETROBOX POINT SHOW ORBRACKET params=primitiveLiteral CRBRACKET SEMI #PrintCommandWP
;
localVariableDeclaration
: varTypeDek=primitiveType IDENTIFIER SEMI #Variablendeklaration
;
localVariableInitialization
: varTypeIni=primitiveType IDENTIFIER ASSIGN varValue=primitiveLiteral SEMI #VariableninitKonst
| varTypeIni=primitiveType IDENTIFIER ASSIGN varValue=expression SEMI #VariableninitExpr
;
primitiveLiteral
: DECIMAL_LITERAL
| FLOAT_LITERAL
| STRING_LITERAL
| CHAR_LITERAL
;
primitiveType
: VARI
| VARC
| VARF
| VARS
;
// ******************************PARSER END****************************************
Here my example-code:
func vari entry()
{
RetroBox.show("Hallo"); //Should be recognised as print-command
}
And here a AST printed from Antlr:
AST from Compiler

The problem is that your RETROBOX keyword is 'retrobox' but your example code has it typed as 'RetroBox'. Antlr parses 'RetroBox' as an identifier so the following '.' is unexpected.
Antlr should emit an error: "line 3:12 mismatched input '.' expecting '('".
Then it attempts to recover and continue parsing. It tries single token deletion (just ignoring the '.') and finds that that works... except the rule it now matches is #Methodenaufruf instead of #RetroBoxShowCommand.

Antlr 4 TEXT and DIGIT together

I am parsing a SQL like language and I have problems with strings that starts with a number:
SELECT 90userN is parsed to SELECT 90 AS userN
Since I remove the whitespaces, it somehow gets the digits as the name and the string as the alias.
I don't know even where to start.
Grammar:
result_column : '*'
| table_name '.' '*'
| table_name '.' any_name
| expr
any_name : keyword
| IDENTIFIER
| STRING_LITERAL
| '(' any_name ')'
;
expr: literal_value;
literal_value :
NUMERIC_LITERAL
| STRING_LITERAL
| DATE_LITERAL
| IDENTIFIER
| NULL
;
IDENTIFIER :
'"' (~'"' | '""')* '"'
| '`' (~'`' | '``')* '`'
| '[' ~']'* ']'
| [a-zA-Z_] [a-zA-Z_0-9]*;
STRING_LITERAL : '\'' ( ~'\'' | '\'\'' )* '\'' ;
NUMERIC_LITERAL :
DIGIT+ ( '.' DIGIT* )? ( E [-+]? DIGIT+ )?
| '.' DIGIT+ ( E [-+]? DIGIT+ )? ;
DATE_LITERAL: DIGIT DIGIT DIGIT DIGIT '-' DIGIT DIGIT '-' DIGIT DIGIT;

Identifiers in SQL can not start with numbers and that is really clear in the last alternative of your IDENTIFIER rule: [a-zA-Z_] [a-zA-Z_0-9]*;
I think you are already using it, but refer to the SQLite4 grammar example

ANTLR4 token image concatenation with comments in the mix

I'm trying to write an ANTLR4 lexer for some language. I've got a working one, but I'm not entirely satisfied with it.
keyword "my:little:uri" + /* my comment here */ ':it:is'
// nasty comment
+ ":mehmeh"; // single line comment
keyword + {}
This is an example of statements in the language. It's simply a bunch of keywords followed by string arguments and terminated by a semicolon or a block of sub-statements. Strings may be unquoted, single-quoted or double-quoted. The quoted strings may be concatenated as in the example above. An unquoted string containing a plus sign (+) is valid.
What I find problematic are the comments. I'd like to recognize whatever follows a keyword as a single string token, sans the comments (and whitespace). I'd usually use the more lexer command but I don't think it's applicable for the example above. Is there a pattern that would allow me achieve something like this?
My current lexer grammar:
lexer grammar test;
#members {
public static final int CHANNEL_COMMENTS = 1;
}
WHITESPACE : (' ' | '\t' | '\n' | '\r' | '\f') -> skip;
SINGLE_LINE_COMMENT : '//' (~[\n\r])* ('\n' | '\r' | '\r\n')? -> channel(CHANNEL_COMMENTS);
MULTI_LINE_COMMENT : '/*' .*? '*/' -> channel(CHANNEL_COMMENTS);
KEYWORD : 'keyword' -> pushMode(IN_STRING_KEYWORD);
LBRACE : '{';
RBRACE : '}';
SEMICOLON : ';';
mode IN_STRING_KEYWORD;
STRING_WHITESPACE : WHITESPACE -> skip;
STRING_SINGLE_LINE_COMMENT : SINGLE_LINE_COMMENT -> type(SINGLE_LINE_COMMENT), channel(CHANNEL_COMMENTS);
STRING_MULTI_LINE_COMMENT : MULTI_LINE_COMMENT -> type(MULTI_LINE_COMMENT), channel(CHANNEL_COMMENTS);
STRING_LBRACE : LBRACE -> type(LBRACE), popMode;
STRING_SEMICOLON : SEMICOLON -> type(SEMICOLON), popMode;
STRING : ((QUOTED_STRING ('+' QUOTED_STRING)*) | UNQUOTED_STRING);
fragment QUOTED_STRING : (SINGLEQUOTED_STRING | DOUBLEQUOTED_STRING);
fragment UNQUOTED_STRING : (~[ \t;{}/*'"\n\r] | '/' ~[/*] | '*' ~['/'])+;
fragment SINGLEQUOTED_STRING : '\'' (~['])* '\'';
fragment DOUBLEQUOTED_STRING :
'"'
(
(~["\\]) |
('\\' [nt"\\])
)*
'"'
;
Am I perhaps trying to do too much inside the lexer and should just feed what I currently have to the parser and let it handle the above mess?
Edit01
Thanks to 280Z28, I decided to fix the above lexer grammar by getting rid of my STRING token and simply settling for QUOTED_STRING, UNQUOTED_STRING and the operator CONCAT. The rest will be handled in the parser. I also added an additional lexer mode in order to distinguish between CONCAT and UNQUOTED_STRING.
lexer grammar test;
#members {
public static final int CHANNEL_COMMENTS = 2;
}
WHITESPACE : (' ' | '\t' | '\n' | '\r' | '\f') -> skip;
SINGLE_LINE_COMMENT : '//' (~[\n\r])* -> channel(CHANNEL_COMMENTS);
MULTI_LINE_COMMENT : '/*' .*? '*/' -> channel(CHANNEL_COMMENTS);
KEYWORD : 'keyword' -> pushMode(IN_STRING_KEYWORD);
LBRACE : '{';
RBRACE : '}';
SEMICOLON : ';';
mode IN_STRING_KEYWORD;
STRING_WHITESPACE : WHITESPACE -> skip;
STRING_SINGLE_LINE_COMMENT : SINGLE_LINE_COMMENT -> type(SINGLE_LINE_COMMENT), channel(CHANNEL_COMMENTS);
STRING_MULTI_LINE_COMMENT : MULTI_LINE_COMMENT -> type(MULTI_LINE_COMMENT), channel(CHANNEL_COMMENTS);
STRING_LBRACE : LBRACE -> type(LBRACE), popMode;
STRING_SEMICOLON : SEMICOLON -> type(SEMICOLON), popMode;
QUOTED_STRING : (SINGLEQUOTED_STRING | DOUBLEQUOTED_STRING) -> mode(IN_QUOTED_STRING);
UNQUOTED_STRING : (~[ \t;{}/*'"\n\r] | '/' ~[/*] | '*' ~[/])+;
fragment SINGLEQUOTED_STRING : '\'' (~['])* '\'';
fragment DOUBLEQUOTED_STRING :
'"'
(
(~["\\]) |
('\\' [nt"\\])
)*
'"'
;
mode IN_QUOTED_STRING;
QUOTED_STRING_WHITESPACE : WHITESPACE -> skip;
QUOTED_STRING_SINGLE_LINE_COMMENT : SINGLE_LINE_COMMENT -> type(SINGLE_LINE_COMMENT), channel(CHANNEL_COMMENTS);
QUOTED_STRING_MULTI_LINE_COMMENT : MULTI_LINE_COMMENT -> type(MULTI_LINE_COMMENT), channel(CHANNEL_COMMENTS);
QUOTED_STRING_LBRACE : LBRACE -> type(LBRACE), popMode;
QUOTED_STRING_SEMICOLON : SEMICOLON -> type(SEMICOLON), popMode;
QUOTED_STRING2 : QUOTED_STRING -> type(QUOTED_STRING);
CONCAT : '+';

Don't perform string concatenation in the lexer. Send the + operator to the parser as an operator. This will make it much easier to eliminate the whitespace and/or comments appearing between strings and the operator.
CONCAT : '+';
STRING : QUOTED_STRING | UNQUOTED_STRING;
You should be aware that ANTLR 4 changed the predefined HIDDEN channel from 99 to 1, so HIDDEN and CHANNEL_COMMENTS are the same in your grammar.
Don't include the line terminator at the end of the SINGLE_LINE_COMMENT rule.
SINGLE_LINE_COMMENT
: '//' (~[\n\r])*
-> channel(CHANNEL_COMMENTS)
;
Your UNQUOTED_STRING token currently contains the set ['/']. If you meant to exclude ' characters, the second ' in the set is redundant so you can use ['/]. If you only meant to exclude /, then you can use either the syntax [/] or '/'.

Remove extra symbol from the repetitive ANTLR rule

Consider the following simple grammar.
grammar test;
options {
language = Java;
output = AST;
}
//imaginary tokens
tokens{
}
parse
: declaration
;
declaration
: forall
;
forall
:'forall' '('rule1')' '[' (( '(' rule2 ')' '|' )* ) ']'
;
rule1
: INT
;
rule2
: ID
;
ID
: ('a'..'z' | 'A'..'Z'|'_')('a'..'z' | 'A'..'Z'|'0'..'9'|'_')*
;
INT
: ('0'..'9')+
;
WHITESPACE
: ('\t' | ' ' | '\r' | '\n' | '\u000C')+ {$channel = HIDDEN;}
;
and here is the input
forall (1) [(first) | (second) | (third) | (fourth) | (fifth) |]
The grammar works fine for the above input but I want to get rid of the extra pipe symbol (2nd last character in the input) from the input.
Any thoughts/ideas?

My antlr syntax is a bit rusty but you should try something like this:
forall
:'forall' '('rule1')' '[' ('(' rule2 ')' ('|' '(' rule2 ')' )* )? ']'
;
That is, instead of (r|)* write (r(|r)*)?. You can see how the latter allows for zero, one or many rules with pipes inbetween.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Antlr4: Prevent rule and token conflicts - java

To remove the conflict and preserve the intended token type: rule_name : ( NAME | AT ) -> type(NAME) ;

Related

My grammar identifies keywords as identifiers

Antlr - Why it expect FunctionCall but PrintCommand gave

Antlr 4 TEXT and DIGIT together

ANTLR4 token image concatenation with comments in the mix

Remove extra symbol from the repetitive ANTLR rule

Categories

Resources