Antlr Extraneous Input

Antlr Extraneous Input - java

I have a grammar file BoardFile.g4 that has (relevant parts only):
grammar Board;
//Tokens
GADGET : 'squareBumper' | 'circleBumper' | 'triangleBumper' | 'leftFlipper' | 'rightFlipper' | 'absorber' | 'portal' ;
NAME : [A-Za-z_][A-Za-z_0-9]* ;
INT : [0-9]+ ;
FLOAT : '-'?[0-9]+('.'[0-9]+)? ;
COMMENT : '#' ~( '\r' | '\n' )*;
WHITESPACE : [ \t\r\n]+ -> skip ;
KEY : [a-z] | [0-9] | 'shift' | 'ctrl' | 'alt' | 'meta' | 'space' | 'left' | 'right' | 'up' | 'down' | 'minus' | 'equals' | 'backspace' | 'openbracket' | 'closebracket' | 'backslash' | 'semicolon' | 'quote' | 'enter' | 'comma' | 'period' | 'slash' ;
KEYPRESS : 'keyup' | 'keydown' ;
//Rules
file : define+ EOF ;
define : board | ball | gadget | fire | COMMENT | key ;
board : 'board' 'name' '=' name ('gravity' '=' gravity)? ('friction1' '=' friction1)? ('friction2' '=' friction2)? ;
ball : 'ball' 'name' '=' name 'x' '=' xfloat 'y' '=' yfloat 'xVelocity' '=' xvel 'yVelocity' '=' yvel ;
gadget : gadgettype 'name' '=' name 'x' '=' xint 'y' '=' yint ('width' '=' width 'height' '=' height)? ('orientation' '=' orientation)? ('otherBoard' '=' name 'otherPortal' '=' name)? ;
fire : 'fire' 'trigger' '=' trigger 'action' '=' action ;
key : keytype 'key' '=' KEY 'action' '=' name ;
name : NAME ;
gadgettype : GADGET ;
keytype : KEYPRESS ;
gravity : FLOAT ;
friction1 : FLOAT ;
friction2 : FLOAT ;
trigger : NAME ;
action : NAME ;
yfloat : FLOAT ;
xfloat : FLOAT ;
yint : INT ;
xint : INT ;
xvel : FLOAT ;
yvel : FLOAT ;
orientation : INT ;
width : INT ;
height : INT ;
This generates the lexer and parser fine. However, when I use it against the following file, it gives the following error:
line 12:0 extraneous input 'keyup' expecting {<EOF>, KEYPRESS}
File to Parse:
board name=keysBoard gravity=5.0 friction1=0.0 friction2=0.0
# define a ball
ball name=Ball x=0.5 y=0.5 xVelocity=2.5 yVelocity=2.5
# add some flippers
leftFlipper name=FlipL1 x=16 y=2 orientation=0
leftFlipper name=FlipL2 x=16 y=9 orientation=0
# add keys. lots of keys.
keyup key=space action=apple
keydown key=a action=ball
keyup key=backslash action=cat
keydown key=period action=dog
I went through other questions about this error in SO but none helped me. I cannot figure out what's going wrong. Why am I getting this error?

The string "keyup" is being tokenized as a NAME token: that is the problem.
You must realize that the lexer operates independently from the parser. If the parser is trying to match a KEYPRESS token, the lexer does not "listen" to it, but just constructs a token following the rules:
match the rule that consumes the most characters
if there are more rules that match the same amount of characters, choose the one that is defined first
Taking these rules into account, and the order of your rules:
NAME : [A-Za-z_][A-Za-z_0-9]* ;
INT : [0-9]+ ;
KEY : [a-z] | [0-9] | 'shift' | 'ctrl' | 'alt' | 'meta' | 'space' | 'left' | 'right' | 'up' | 'down' | 'minus' | 'equals' | 'backspace' | 'openbracket' | 'closebracket' | 'backslash' | 'semicolon' | 'quote' | 'enter' | 'comma' | 'period' | 'slash' ;
KEYPRESS : 'keyup' | 'keydown' ;
a NAME token will be created before most of the KEY alternatives, and all of the KEYPRESS alternatives will be created.
And since an INT matches one or more digits and is defined before KEY which also has a single digit alternative, it is clear that the lexer will never produce a KEY or KEYPRESS token.
If you move the NAME and INT rule below the KEY and KEYPRESS rules, then most of the tokens will be constructed as you expect, is my guess.
EDIT
A possible solution would look like:
KEY : [a-z] | 'shift' | 'ctrl' | 'alt' | 'meta' | 'space' | 'left' | 'right' | 'up' | 'down' | 'minus' | 'equals' | 'backspace' | 'openbracket' | 'closebracket' | 'backslash' | 'semicolon' | 'quote' | 'enter' | 'comma' | 'period' | 'slash' ;
KEYPRESS : 'keyup' | 'keydown' ;
NAME : [A-Za-z_][A-Za-z_0-9]* ;
SINGLE_DIGIT : [0-9] ;
INT : [0-9]+ ;
I.e. I removed the [0-9] alternative from KEY and introduced a SINGLE_DIGIT rule (which is placed before the INT rule!).
Now create some extra parser rules:
integer : INT | SINGLE_DIGIT ;
key : KEY | SINGLE_DIGIT ;
and change all occurrences of INT inside parser rules to integer (don't call your rule int: it is a reserved word) and change all KEY to key.
And you might also want to do something similar to NAME and the [a-z] alternative in KEY (i.e. a single lowercase char would now never be tokenized as a NAME, always as a KEY).

Related

Antlr - Why it expect FunctionCall but PrintCommand gave

my Antlr-grammar expect a FunctionCall but in my example-code for the compiler built by antlr, i wrote a print-command. Does someone know why and how to fix that? The print-command is named: RetroBox.show(); The print-command should be recognised from blockstatements to blockstatement to statement to localFunctionCall to printCommand
Here my Antrl-grammar:
grammar Mars;
// ******************************LEXER
BEGIN*****************************************
// Keywords
FUNC: 'func';
ENTRY: 'entry';
VARI: 'vari';
VARF: 'varf';
VARC: 'varc';
VARS: 'vars';
LET: 'let';
INCREMENTS: 'increments';
RETROBOX: 'retrobox';
SHOW: 'show';
// Literals
DECIMAL_LITERAL: ('0' | [1-9] (Digits? | '_'+ Digits)) [lL]?;
FLOAT_LITERAL: (Digits '.' Digits? | '.' Digits) ExponentPart? [fFdD]?
| Digits (ExponentPart [fFdD]? | [fFdD])
;
CHAR_LITERAL: '\'' (~['\\\r\n] | EscapeSequence) '\'';
STRING_LITERAL: '"' (~["\\\r\n] | EscapeSequence)* '"';
// Seperators
ORBRACKET: '(';
CRBRACKET: ')';
OEBRACKET: '{';
CEBRACKET: '}';
SEMI: ';';
POINT: '.';
// Operators
ASSIGN: '=';
// Whitespace and comments
WS: [ \t\r\n\u000C]+ -> channel(HIDDEN);
COMMENT: '/*' .*? '*/' -> channel(HIDDEN);
LINE_COMMENT: '//' ~[\r\n]* -> channel(HIDDEN);
// Identifiers
IDENTIFIER: Letter LetterOrDigit*;
// Fragment rules
fragment ExponentPart
: [eE] [+-]? Digits
;
fragment EscapeSequence
: '\\' [btnfr"'\\]
| '\\' ([0-3]? [0-7])? [0-7]
| '\\' 'u'+ HexDigit HexDigit HexDigit HexDigit
;
fragment HexDigits
: HexDigit ((HexDigit | '_')* HexDigit)?
;
fragment HexDigit
: [0-9a-fA-F]
;
fragment Digits
: [0-9] ([0-9_]* [0-9])?
;
fragment LetterOrDigit
: Letter
| [0-9]
;
fragment Letter
: [a-zA-Z$_] // these are the "java letters" below 0x7F
| ~[\u0000-\u007F\uD800-\uDBFF] // covers all characters above 0x7F which are not a surrogate
| [\uD800-\uDBFF] [\uDC00-\uDFFF] // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
;
// *******************************LEXER END****************************************
// *****************************PARSER BEGIN*****************************************
program
: mainfunction #Programm
| /*EMPTY*/ #Garnichts
;
mainfunction
: FUNC VARI ENTRY ORBRACKET CRBRACKET block #NormaleHauptmethode
;
block
: '{' blockStatement '}' #CodeBlock
| /*EMPTY*/ #EmptyCodeBlock
;
blockStatement
: statement* #Befehl
;
statement
: localVariableDeclaration
| localVariableInitialization
| localFunctionImplementation
| localFunctionCall
;
expression
: left=expression op='%'
| left=expression op=('*' | '/') right=expression
| left=expression op=('+' | '-') right=expression
| neg='-' right=expression
| number
| IDENTIFIER
| '(' expression ')'
;
number
: DECIMAL_LITERAL
| FLOAT_LITERAL
;
localFunctionImplementation
: FUNC primitiveType IDENTIFIER ORBRACKET CRBRACKET block #Methodenimplementierung
;
localFunctionCall
: IDENTIFIER ORBRACKET CRBRACKET SEMI #Methodenaufruf
| printCommand #RetroBoxShowCommand
;
printCommand
: RETROBOX POINT SHOW ORBRACKET params=primitiveLiteral CRBRACKET SEMI #PrintCommandWP
;
localVariableDeclaration
: varTypeDek=primitiveType IDENTIFIER SEMI #Variablendeklaration
;
localVariableInitialization
: varTypeIni=primitiveType IDENTIFIER ASSIGN varValue=primitiveLiteral SEMI #VariableninitKonst
| varTypeIni=primitiveType IDENTIFIER ASSIGN varValue=expression SEMI #VariableninitExpr
;
primitiveLiteral
: DECIMAL_LITERAL
| FLOAT_LITERAL
| STRING_LITERAL
| CHAR_LITERAL
;
primitiveType
: VARI
| VARC
| VARF
| VARS
;
// ******************************PARSER END****************************************
Here my example-code:
func vari entry()
{
RetroBox.show("Hallo"); //Should be recognised as print-command
}
And here a AST printed from Antlr:
AST from Compiler

The problem is that your RETROBOX keyword is 'retrobox' but your example code has it typed as 'RetroBox'. Antlr parses 'RetroBox' as an identifier so the following '.' is unexpected.
Antlr should emit an error: "line 3:12 mismatched input '.' expecting '('".
Then it attempts to recover and continue parsing. It tries single token deletion (just ignoring the '.') and finds that that works... except the rule it now matches is #Methodenaufruf instead of #RetroBoxShowCommand.

Antlr3 report java.lang.OutOfMemoryError when parse expression

I try to match the string "match 'match content'", meanwhile extract match content that within single quotes. But throws the following exception:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.antlr.runtime.Lexer.emit(Lexer.java:160)
at org.antlr.runtime.Lexer.nextToken(Lexer.java:91)
at org.antlr.runtime.BufferedTokenStream.fetch(BufferedTokenStream.java:133)
at org.antlr.runtime.BufferedTokenStream.sync(BufferedTokenStream.java:127)
at org.antlr.runtime.CommonTokenStream.consume(CommonTokenStream.java:70)
at org.antlr.runtime.BaseRecognizer.match(BaseRecognizer.java:106)
I don't known why throws OOM exception and i can not find error define in dot g file.
My dot g file:
grammar Contains;
options {
language=Java;
output=AST;
ASTLabelType=CommonTree;
backtrack=false;
k=3;
}
match
:
KW_MATCH SINGLE_QUOTE ( ~(SINGLE_QUOTE|'\\') | ('\\' .) )+ SINGLE_QUOTE
;
regexp
:
KW_REGEXP SINGLE_QUOTE RegexComponent+ SINGLE_QUOTE
;
range
:
KW_RANGE SINGLE_QUOTE left=(LPAREN | LSQUARE) start=Number COMMA end = Number right=(RPAREN | RSQUARE) SINGLE_QUOTE
;
DOT : '.'; // generated as a part of Number rule
COLON : ':' ;
COMMA : ',' ;
LPAREN : '(' ;
RPAREN : ')' ;
LSQUARE : '[' ;
RSQUARE : ']' ;
LCURLY : '{';
RCURLY : '}';
PLUS : '+';
MINUS : '-';
STAR : '*';
BITWISEOR : '|';
BITWISEXOR : '^';
QUESTION : '?';
DOLLAR : '$';
KW_RANGE : 'RANGE';
KW_REGEXP : 'REGEXP';
KW_MATCH : 'MATCH';
DOUBLE_QUOTE : '\"';
SINGLE_QUOTE : '\'';
fragment
Digit
:
'0'..'9'
;
fragment
Exponent
:
('e' | 'E') ( PLUS|MINUS )? (Digit)+
;
fragment
RegexComponent
: 'a'..'z' | 'A'..'Z' | '0'..'9' | '_'
| PLUS | STAR | QUESTION | MINUS | DOT
| LPAREN | RPAREN | LSQUARE | RSQUARE | LCURLY | RCURLY
| BITWISEXOR | BITWISEOR | DOLLAR | '\u0080'..'\u00FF' | '\u0400'..'\u04FF'
| '\u0600'..'\u06FF' | '\u0900'..'\u09FF' | '\u4E00'..'\u9FFF' | '\u0A00'..'\u0A7F'
;
Number
:
(Digit)+ ( DOT (Digit)* (Exponent)? | Exponent)?
;
WS : (' '|'\r'|'\t'|'\n'|'\u000C')* {$channel=HIDDEN;}
;

You could start by changing:
WS : (' '|'\r'|'\t'|'\n'|'\u000C')* {$channel=HIDDEN;}
;
to:
WS : (' '|'\r'|'\t'|'\n'|'\u000C')+ {$channel=HIDDEN;}
;
Your version matches an empty string, which might produce an infinite amount of tokens (which might throw an OOME).

Antlr 4 TEXT and DIGIT together

I am parsing a SQL like language and I have problems with strings that starts with a number:
SELECT 90userN is parsed to SELECT 90 AS userN
Since I remove the whitespaces, it somehow gets the digits as the name and the string as the alias.
I don't know even where to start.
Grammar:
result_column : '*'
| table_name '.' '*'
| table_name '.' any_name
| expr
any_name : keyword
| IDENTIFIER
| STRING_LITERAL
| '(' any_name ')'
;
expr: literal_value;
literal_value :
NUMERIC_LITERAL
| STRING_LITERAL
| DATE_LITERAL
| IDENTIFIER
| NULL
;
IDENTIFIER :
'"' (~'"' | '""')* '"'
| '`' (~'`' | '``')* '`'
| '[' ~']'* ']'
| [a-zA-Z_] [a-zA-Z_0-9]*;
STRING_LITERAL : '\'' ( ~'\'' | '\'\'' )* '\'' ;
NUMERIC_LITERAL :
DIGIT+ ( '.' DIGIT* )? ( E [-+]? DIGIT+ )?
| '.' DIGIT+ ( E [-+]? DIGIT+ )? ;
DATE_LITERAL: DIGIT DIGIT DIGIT DIGIT '-' DIGIT DIGIT '-' DIGIT DIGIT;

Identifiers in SQL can not start with numbers and that is really clear in the last alternative of your IDENTIFIER rule: [a-zA-Z_] [a-zA-Z_0-9]*;
I think you are already using it, but refer to the SQLite4 grammar example

write a grammar rule name in unicode [ANTLR 4]

I am still a beginner in ANTLR 4 and I was wondering if there is a way to write a grammar rule name in unicode. For example, the following rule is fine:
atomExp returns [double value]
: n=Number {$value = Double.parseDouble($n.text);}
| '(' exp=additionExp ')' {$value = $exp.value;}
;
However, let's say I want to write the same rule but instead of writing its name as "atomExp" , I want to write the name as an Arabic word "تعبير"
تعبير returns [double value]
: n=Number {$value = Double.parseDouble($n.text);}
| '(' exp=additionExp ')' {$value = $exp.value;}
;
but when I try to write it that way I get "no viable alternative" error. Can someone solve my problem please. Thanks in advance

When looking at the lexer grammar for ANTLR4, you can see that lexer and parser names support certain Unicode chars:
/** Allow unicode rule/token names */
ID : NameStartChar NameChar*;
fragment
NameChar
: NameStartChar
| '0'..'9'
| '_'
| '\u00B7'
| '\u0300'..'\u036F'
| '\u203F'..'\u2040'
;
fragment
NameStartChar
: 'A'..'Z'
| 'a'..'z'
| '\u00C0'..'\u00D6'
| '\u00D8'..'\u00F6'
| '\u00F8'..'\u02FF'
| '\u0370'..'\u037D'
| '\u037F'..'\u1FFF'
| '\u200C'..'\u200D'
| '\u2070'..'\u218F'
| '\u2C00'..'\u2FEF'
| '\u3001'..'\uD7FF'
| '\uF900'..'\uFDCF'
| '\uFDF0'..'\uFFFD'
; // ignores | ['\u10000-'\uEFFFF] ;
INT : [0-9]+
;
But it appears that your ID تعبير does not comply with the NameChar* part of the ID rule.

ANTLR replace tokens in a recursive manner

I have the following grammar:
rule: q=QualifiedName {System.out.println($q.text);};
QualifiedName
:
i=Identifier { $i.setText($i.text + "_");}
('[' (QualifiedName+ | Integer)? ']')*
;
Integer
: Digit Digit*
;
fragment
Digit
: '0'..'9'
;
fragment
Identifier
: ( '_'
| '$'
| ('a'..'z' | 'A'..'Z')
)
('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '$')*
;
and the code from Java:
ANTLRStringStream stream = new ANTLRStringStream("array1[array2[array3[index]]]");
TestLexer lexer = new TestLexer(stream);
CommonTokenStream tokens = new TokenRewriteStream(lexer);
TestParser parser = new TestParser(tokens);
try {
parser.rule();
} catch (RecognitionException e) {
e.printStackTrace();
}
For the input: array1[array2[array3[index]]], i want to modify each identifier. I was expecting to see the output: array1_[array_2[array3_[index_]]], but the output was the same as the input.
So the question is: why the setText() method doesn't work here?
EDIT:
I modified Bart's answer in the following way:
rule: q=qualifiedName {System.out.println($q.modified);};
qualifiedName returns [String modified]
:
Identifier
('[' (qualifiedName+ | Integer)? ']')*
{
$modified = $text + "_";
}
;
Identifier
: ( '_'
| '$'
| ('a'..'z' | 'A'..'Z')
)
('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '$')*
;
Integer
: Digit Digit*
;
fragment
Digit
: '0'..'9'
;
I want to modify each token matched by the rule qualifiedName. I tried the code above, and for the input array1[array2[array3[index]]] i was expecting to see the output array1[array2[array3[index_]_]_]_, but instead only the last token was modified: array1[array2[array3[index]]]_.
How can i solve this?

You can only use setText(...) once a token is created. You're recursively calling this token and setting some other text, which won't work. You'll need to create a parser rule out of QualifiedName instead of a lexer rule, and remove the fragment before Identifier.
rule: q=qualifiedName {System.out.println($q.text);};
qualifiedName
:
i=Identifier { $i.setText($i.text + "_");}
('[' (qualifiedName+ | Integer)? ']')*
;
Identifier
: ( '_'
| '$'
| ('a'..'z' | 'A'..'Z')
)
('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '$')*
;
Integer
: Digit Digit*
;
fragment
Digit
: '0'..'9'
;
Now, it will print: array1_[array2_[array3_[index_]]] on the console.
EDIT
I have no idea why you'd want to do that, but it seems you're simply trying to rewrite ] into ]_, which can be done in the same way as I showed above:
qualifiedName
:
Identifier
('[' (qualifiedName+ | Integer)? t=']' {$t.setText("]_");} )*
;

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Antlr Extraneous Input - java

Related

Antlr - Why it expect FunctionCall but PrintCommand gave

Antlr3 report java.lang.OutOfMemoryError when parse expression

Antlr 4 TEXT and DIGIT together

write a grammar rule name in unicode [ANTLR 4]

ANTLR replace tokens in a recursive manner

Categories

Resources