write a grammar rule name in unicode [ANTLR 4] - java

I am still a beginner in ANTLR 4 and I was wondering if there is a way to write a grammar rule name in unicode. For example, the following rule is fine:
atomExp returns [double value]
: n=Number {$value = Double.parseDouble($n.text);}
| '(' exp=additionExp ')' {$value = $exp.value;}
;
However, let's say I want to write the same rule but instead of writing its name as "atomExp" , I want to write the name as an Arabic word "تعبير"
تعبير returns [double value]
: n=Number {$value = Double.parseDouble($n.text);}
| '(' exp=additionExp ')' {$value = $exp.value;}
;
but when I try to write it that way I get "no viable alternative" error. Can someone solve my problem please. Thanks in advance

When looking at the lexer grammar for ANTLR4, you can see that lexer and parser names support certain Unicode chars:
/** Allow unicode rule/token names */
ID : NameStartChar NameChar*;
fragment
NameChar
: NameStartChar
| '0'..'9'
| '_'
| '\u00B7'
| '\u0300'..'\u036F'
| '\u203F'..'\u2040'
;
fragment
NameStartChar
: 'A'..'Z'
| 'a'..'z'
| '\u00C0'..'\u00D6'
| '\u00D8'..'\u00F6'
| '\u00F8'..'\u02FF'
| '\u0370'..'\u037D'
| '\u037F'..'\u1FFF'
| '\u200C'..'\u200D'
| '\u2070'..'\u218F'
| '\u2C00'..'\u2FEF'
| '\u3001'..'\uD7FF'
| '\uF900'..'\uFDCF'
| '\uFDF0'..'\uFFFD'
; // ignores | ['\u10000-'\uEFFFF] ;
INT : [0-9]+
;
But it appears that your ID تعبير does not comply with the NameChar* part of the ID rule.

Related

Antlr - Why it expect FunctionCall but PrintCommand gave

my Antlr-grammar expect a FunctionCall but in my example-code for the compiler built by antlr, i wrote a print-command. Does someone know why and how to fix that? The print-command is named: RetroBox.show(); The print-command should be recognised from blockstatements to blockstatement to statement to localFunctionCall to printCommand
Here my Antrl-grammar:
grammar Mars;
// ******************************LEXER
BEGIN*****************************************
// Keywords
FUNC: 'func';
ENTRY: 'entry';
VARI: 'vari';
VARF: 'varf';
VARC: 'varc';
VARS: 'vars';
LET: 'let';
INCREMENTS: 'increments';
RETROBOX: 'retrobox';
SHOW: 'show';
// Literals
DECIMAL_LITERAL: ('0' | [1-9] (Digits? | '_'+ Digits)) [lL]?;
FLOAT_LITERAL: (Digits '.' Digits? | '.' Digits) ExponentPart? [fFdD]?
| Digits (ExponentPart [fFdD]? | [fFdD])
;
CHAR_LITERAL: '\'' (~['\\\r\n] | EscapeSequence) '\'';
STRING_LITERAL: '"' (~["\\\r\n] | EscapeSequence)* '"';
// Seperators
ORBRACKET: '(';
CRBRACKET: ')';
OEBRACKET: '{';
CEBRACKET: '}';
SEMI: ';';
POINT: '.';
// Operators
ASSIGN: '=';
// Whitespace and comments
WS: [ \t\r\n\u000C]+ -> channel(HIDDEN);
COMMENT: '/*' .*? '*/' -> channel(HIDDEN);
LINE_COMMENT: '//' ~[\r\n]* -> channel(HIDDEN);
// Identifiers
IDENTIFIER: Letter LetterOrDigit*;
// Fragment rules
fragment ExponentPart
: [eE] [+-]? Digits
;
fragment EscapeSequence
: '\\' [btnfr"'\\]
| '\\' ([0-3]? [0-7])? [0-7]
| '\\' 'u'+ HexDigit HexDigit HexDigit HexDigit
;
fragment HexDigits
: HexDigit ((HexDigit | '_')* HexDigit)?
;
fragment HexDigit
: [0-9a-fA-F]
;
fragment Digits
: [0-9] ([0-9_]* [0-9])?
;
fragment LetterOrDigit
: Letter
| [0-9]
;
fragment Letter
: [a-zA-Z$_] // these are the "java letters" below 0x7F
| ~[\u0000-\u007F\uD800-\uDBFF] // covers all characters above 0x7F which are not a surrogate
| [\uD800-\uDBFF] [\uDC00-\uDFFF] // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
;
// *******************************LEXER END****************************************
// *****************************PARSER BEGIN*****************************************
program
: mainfunction #Programm
| /*EMPTY*/ #Garnichts
;
mainfunction
: FUNC VARI ENTRY ORBRACKET CRBRACKET block #NormaleHauptmethode
;
block
: '{' blockStatement '}' #CodeBlock
| /*EMPTY*/ #EmptyCodeBlock
;
blockStatement
: statement* #Befehl
;
statement
: localVariableDeclaration
| localVariableInitialization
| localFunctionImplementation
| localFunctionCall
;
expression
: left=expression op='%'
| left=expression op=('*' | '/') right=expression
| left=expression op=('+' | '-') right=expression
| neg='-' right=expression
| number
| IDENTIFIER
| '(' expression ')'
;
number
: DECIMAL_LITERAL
| FLOAT_LITERAL
;
localFunctionImplementation
: FUNC primitiveType IDENTIFIER ORBRACKET CRBRACKET block #Methodenimplementierung
;
localFunctionCall
: IDENTIFIER ORBRACKET CRBRACKET SEMI #Methodenaufruf
| printCommand #RetroBoxShowCommand
;
printCommand
: RETROBOX POINT SHOW ORBRACKET params=primitiveLiteral CRBRACKET SEMI #PrintCommandWP
;
localVariableDeclaration
: varTypeDek=primitiveType IDENTIFIER SEMI #Variablendeklaration
;
localVariableInitialization
: varTypeIni=primitiveType IDENTIFIER ASSIGN varValue=primitiveLiteral SEMI #VariableninitKonst
| varTypeIni=primitiveType IDENTIFIER ASSIGN varValue=expression SEMI #VariableninitExpr
;
primitiveLiteral
: DECIMAL_LITERAL
| FLOAT_LITERAL
| STRING_LITERAL
| CHAR_LITERAL
;
primitiveType
: VARI
| VARC
| VARF
| VARS
;
// ******************************PARSER END****************************************
Here my example-code:
func vari entry()
{
RetroBox.show("Hallo"); //Should be recognised as print-command
}
And here a AST printed from Antlr:
AST from Compiler
The problem is that your RETROBOX keyword is 'retrobox' but your example code has it typed as 'RetroBox'. Antlr parses 'RetroBox' as an identifier so the following '.' is unexpected.
Antlr should emit an error: "line 3:12 mismatched input '.' expecting '('".
Then it attempts to recover and continue parsing. It tries single token deletion (just ignoring the '.') and finds that that works... except the rule it now matches is #Methodenaufruf instead of #RetroBoxShowCommand.

Why is ANTLR omitting the final token *and* not producing an error?

I have a grammar like this (anything which looks convoluted is a result of it being a subset of the actual grammar which contains more red herrings):
grammar Query;
startExpression
: WS? expression WS? EOF
;
expression
| maybeDefaultBooleanExpression
;
maybeDefaultBooleanExpression
: defaultBooleanExpression
| queryFragment
;
defaultBooleanExpression
: nested += queryFragment (WS nested += queryFragment)+
;
queryFragment
: unquotedQuery
| quotedQuery
;
unquotedQuery
: UNQUOTED
;
quotedQuery
: QUOTED
;
UNQUOTED
: UnquotedStartChar
UnquotedChar*
;
fragment
UnquotedStartChar
: EscapeSequence
| ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\' | ':'
| '"' | '\u201C' | '\u201D' // DoubleQuote
| '\'' | '\u2018' | '\u2019' // SingleQuote
| '(' | ')' | '[' | ']' | '{' | '}' | '~'
| '&' | '|' | '!' | '^' | '?' | '*' | '/' | '+' | '-' | '$' )
;
fragment
UnquotedChar
: EscapeSequence
| ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\' | ':'
| '"' | '\u201C' | '\u201D' // DoubleQuote
| '\'' | '\u2018' | '\u2019' // SingleQuote
| '(' | ')' | '[' | ']' | '{' | '}' | '~'
| '&' | '|' | '!' | '^' | '?' | '*' )
;
QUOTED
: '"'
QuotedChar*
'"'
;
fragment
QuotedChar
: ~( '\\'
| | '\u201C' | '\u201D' // DoubleQuote
| '\r' | '\n' | '?' | '*' )
;
WS : ( ' ' | '\r' | '\t' | '\u000C' | '\n' )+;
If I call the lexer myself directly:
CharStream input = CharStreams.fromString("A \"");
QueryLexer lexer = new QueryLexer(input);
lexer.removeErrorListeners();
CommonTokenStream tokens = new CommonTokenStream(lexer);
System.out.println(tokens.LT(0));
System.out.println(tokens.LT(1));
System.out.println(tokens.LT(2));
System.out.println(tokens.LT(3));
I get:
java.lang.StringIndexOutOfBoundsException: String index out of range: 4
at java.lang.String.checkBounds(String.java:385)
at java.lang.String.<init>(String.java:462)
at org.antlr.v4.runtime.CodePointCharStream$CodePoint8BitCharStream.getText(CodePointCharStream.java:160)
at org.antlr.v4.runtime.Lexer.notifyListeners(Lexer.java:360)
at org.antlr.v4.runtime.Lexer.nextToken(Lexer.java:144)
at org.antlr.v4.runtime.BufferedTokenStream.fetch(BufferedTokenStream.java:169)
at org.antlr.v4.runtime.BufferedTokenStream.sync(BufferedTokenStream.java:152)
at org.antlr.v4.runtime.CommonTokenStream.LT(CommonTokenStream.java:100)
This makes some kind of sense, though I think a proper ANTLR exception might have been better.
What I really don't get, though, is that when I feed this through the complete parser:
QueryParser parser = new QueryParser(tokens);
parser.removeErrorListeners();
parser.addErrorListener(LoggingErrorListener.get());
parser.setErrorHandler(new BailErrorStrategy());
// Performance hack as per the ANTLR v4 FAQ
parser.getInterpreter().setPredictionMode(PredictionMode.SLL);
ParseTree expression;
try
{
expression = parser.startExpression();
}
catch (Exception e)
{
// It catches a StringIndexOutOfBoundsException here.
parser.reset();
parser.getInterpreter().setPredictionMode(PredictionMode.LL);
expression = parser.startExpression();
}
I get:
tokens = {org.antlr.v4.runtime.CommonTokenStream#1811}
channel = 0
tokenSource = {MyQueryLexer#1810}
tokens = {java.util.ArrayList#1816} size = 3
0 = {org.antlr.v4.runtime.CommonToken#1818} "[#0,0:0='A',<13>,1:0]"
1 = {org.antlr.v4.runtime.CommonToken#1819} "[#1,1:1=' ',<32>,1:1]"
2 = {org.antlr.v4.runtime.CommonToken#1820} "[#2,3:2='<EOF>',<-1>,1:3]"
p = 2
fetchedEOF = true
expression = {MyQueryParser$StartExpressionContext#1813} "[]"
children = {java.util.ArrayList#1827} size = 3
0 = {MyQueryParser$ExpressionContext#1831} "[87]"
1 = {org.antlr.v4.runtime.tree.TerminalNodeImpl#1832} " "
2 = {org.antlr.v4.runtime.tree.TerminalNodeImpl#1833} "<EOF>"
Here I would have expected to get a RecognitionException, but somehow the parsing succeeds, and is missing the invalid bit of the token data at the end.
Questions are:
(1) Is this by design?
(2) If so, how can I detect this and have it treated as a syntax error?
Further investigation
When I went looking for the culprit for who was catching the StringIndexOutOfBoudsException and eating it, it turns out that it comes all the way out to our catch block. So I guess ANTLR never got a chance to finish building that last invalid token..?
I'm not entirely sure what's supposed to happen, but I guess I expected that ANTLR would have caught it, created an invalid token and continued.
I then drilled further in and found that Token#nextToken() was throwing an exception, and the docs made it seem like that wasn't supposed to happen, so I ended up filing a ticket about that.
Until very recent builds, ANTLR4's adaptive mechanism has the "feature" of being able to recover from single-token-missing and single-extra-token parses if there were only one viable alternative in that part of the token stream. Now recently, apparently that behavior has changed. So if you're using an older build as I am, you'll still see the adaptive parsing. Maybe Parr and Harwill will fix that.
Like you, I recognized the need for a perfect input stream and zero parse errors, "overlooked" or not. To create a "strict parser" follow these steps:
Make a class called perhaps "StrictErrorStrategy that inherit from/extend DefaultErrorStrategy. You need to override the Recover, RecoverInline, and Sync methods. Bottom line here is we throw exceptions for anything that goes wrongs, and make no attempt to re-sync the code after an extra/missing token. Here's my C# code, your java will look very similar:
public class StrictErrorStrategy : DefaultErrorStrategy
{
public override void Recover(Parser recognizer, RecognitionException e)
{
IToken token = recognizer.CurrentToken;
string message = string.Format("parse error at line {0}, position {1} right before {2} ", token.Line, token.Column, GetTokenErrorDisplay(token));
throw new Exception(message, e);
}
public override IToken RecoverInline(Parser recognizer)
{
IToken token = recognizer.CurrentToken;
string message = string.Format("parse error at line {0}, position {1} right before {2} ", token.Line, token.Column, GetTokenErrorDisplay(token));
throw new Exception(message, new InputMismatchException(recognizer));
}
public override void Sync(Parser recognizer) { /* do nothing to resync */}
}
Make a new lexer that implements a single method:
public class StrictLexer : <YourGeneratedLexerNameHere>
{
public override void Recover(LexerNoViableAltException e)
{
string message = string.Format("lex error after token {0} at position {1}", _lasttoken.Text, e.StartIndex);
throw new ParseCanceledException(BasicEnvironment.SyntaxError);
}
}
Use your lexer and strategy:
AntlrInputStream inputStream = new AntlrInputStream(stream);
StrictLexer lexer = new BailLexer(inputStream);
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
LISBASICParser parser = new LISBASICParser(tokenStream);
parser.RemoveErrorListeners();
parser.ErrorHandler = new StrictErrorStrategy();
This works great, actual code from one of my projects that has a "zero-tolerance rule" about syntax errors. I got the code and ideas from Terence Parr's great book on ANTLR4.

Understanding the context data structure in Antlr4

I'm trying to write a code translator in Java with the help of Antlr4 and had great success with the grammar part so far. However I'm now banging my head against a wall wrapping my mind around the parse tree data structure that I need to work on after my input has been parsed.
I'm trying to use the visitor template to go over my parse tree. I'll show you an example to illustrate the points of my confusion.
My grammar:
grammar pqlc;
// Lexer
//Schlüsselwörter
EXISTS: 'exists';
REDUCE: 'reduce';
QUERY: 'query';
INT: 'int';
DOUBLE: 'double';
CONST: 'const';
STDVECTOR: 'std::vector';
STDMAP: 'std::map';
STDSET: 'std::set';
C_EXPR: 'c_expr';
INTEGER_LITERAL : (DIGIT)+ ;
fragment DIGIT: '0'..'9';
DOUBLE_LITERAL : DIGIT '.' DIGIT+;
LPAREN : '(';
RPAREN : ')';
LBRACK : '[';
RBRACK : ']';
DOT : '.';
EQUAL : '==';
LE : '<=';
GE : '>=';
GT : '>';
LT : '<';
ADD : '+';
MUL : '*';
AND : '&&';
COLON : ':';
IDENTIFIER : JavaLetter JavaLetterOrDigit*;
fragment JavaLetter : [a-zA-Z$_]; // these are the "java letters" below 0xFF
fragment JavaLetterOrDigit : [a-zA-Z0-9$_]; // these are the "java letters or digits" below 0xFF
WS
: [ \t\r\n\u000C]+ -> skip
;
COMMENT
: '/*' .*? '*/' -> skip
;
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
// Parser
//start_rule: query;
query :
quant_expr
| qexpr+
| IDENTIFIER // order IDENTIFIER and qexpr+?
| numeral
| c_expr //TODO
;
c_type : INT | DOUBLE | CONST;
bin_op: AND | ADD | MUL | EQUAL | LT | GT | LE| GE;
qexpr:
LPAREN query RPAREN bin_op_query?
// query bin_op query
| IDENTIFIER bin_op_query? // copied from query to resolve left recursion problem
| numeral bin_op_query? // ^
| quant_expr bin_op_query? // ^
|c_expr bin_op_query?
// query.find(query)
| IDENTIFIER find_query? // copied from query to resolve left recursion problem
| numeral find_query? // ^
| quant_expr find_query?
|c_expr find_query?
// query[query]
| IDENTIFIER array_query? // copied from query to resolve left recursion problem
| numeral array_query? // ^
| quant_expr array_query?
|c_expr array_query?
// | qexpr bin_op_query // bad, resolved by quexpr+ in query
;
bin_op_query: bin_op query bin_op_query?; // resolve left recursion of query bin_op query
find_query: '.''find' LPAREN query RPAREN;
array_query: LBRACK query RBRACK;
quant_expr:
quant id ':' query
| QUERY LPAREN match RPAREN ':' query
| REDUCE LPAREN IDENTIFIER RPAREN id ':' query
;
match:
STDVECTOR LBRACK id RBRACK EQUAL cm
| STDMAP '.''find' LPAREN cm RPAREN EQUAL cm
| STDSET '.''find' LPAREN cm RPAREN
;
cm:
IDENTIFIER
| numeral
| c_expr //TODO
;
quant :
EXISTS;
id :
c_type IDENTIFIER
| IDENTIFIER // Nach Seite 2 aber nicht der Übersicht. Laut übersicht id -> aber dann wäre Regel 1 ohne +
;
numeral :
INTEGER_LITERAL
| DOUBLE_LITERAL
;
c_expr:
C_EXPR
;
Now let's parse the following string:
double x: x >= c_expr
Visually I'll get this tree:
Let's say my visitor is in the visitQexpr(#NotNull pqlcParser.QexprContext ctx) routine when it hits the branch Qexpr(x bin_op_query).
My question is, how can I tell that the left children ("x" in the tree) is a terminal node, or more specifically an "IDENTIFIER"? There are no visiting rules for Terminal nodes since they aren't rules.
ctx.getChild(0) has no RuleIndex. I guess I could use that to check if I'm in a terminal or not, but that still wouldn't tell me if I was in IDENTIFIER or another kind of terminal token. I need to be able to tell the difference somehow.
I had more questions but in the time it took me to write the explanation I forgot them :<
Thanks in advance.
You can add labels to tokens and access them/check if they exist in the surrounding context:
id :
c_type labelA = IDENTIFIER
| labelB = IDENTIFIER
;
You could also do this to create different visits:
id :
c_type IDENTIFIER #idType1 //choose more appropriate names!
| IDENTIFIER #idType2
;
This will create different visitors for the two alternatives and I suppose (i.e. have not verified) that the visitor for id will not be called.
I prefer the following approach though:
id :
typeDef
| otherId
;
typeDef: c_type IDENTIFIER;
otherId : IDENTIFIER ;
This is a more heavily typed system. But you can very specifically visit nodes. Some rules of thumb I use:
Use | only when all alternatives are parser rules.
Wrap each Token in a parser rule (like otherId) to give them "more meaning".
It's ok to mix parser rules and tokens, if the tokens are not really important (like ;) and therefore not needed in the parse tree.

Antlr Extraneous Input

I have a grammar file BoardFile.g4 that has (relevant parts only):
grammar Board;
//Tokens
GADGET : 'squareBumper' | 'circleBumper' | 'triangleBumper' | 'leftFlipper' | 'rightFlipper' | 'absorber' | 'portal' ;
NAME : [A-Za-z_][A-Za-z_0-9]* ;
INT : [0-9]+ ;
FLOAT : '-'?[0-9]+('.'[0-9]+)? ;
COMMENT : '#' ~( '\r' | '\n' )*;
WHITESPACE : [ \t\r\n]+ -> skip ;
KEY : [a-z] | [0-9] | 'shift' | 'ctrl' | 'alt' | 'meta' | 'space' | 'left' | 'right' | 'up' | 'down' | 'minus' | 'equals' | 'backspace' | 'openbracket' | 'closebracket' | 'backslash' | 'semicolon' | 'quote' | 'enter' | 'comma' | 'period' | 'slash' ;
KEYPRESS : 'keyup' | 'keydown' ;
//Rules
file : define+ EOF ;
define : board | ball | gadget | fire | COMMENT | key ;
board : 'board' 'name' '=' name ('gravity' '=' gravity)? ('friction1' '=' friction1)? ('friction2' '=' friction2)? ;
ball : 'ball' 'name' '=' name 'x' '=' xfloat 'y' '=' yfloat 'xVelocity' '=' xvel 'yVelocity' '=' yvel ;
gadget : gadgettype 'name' '=' name 'x' '=' xint 'y' '=' yint ('width' '=' width 'height' '=' height)? ('orientation' '=' orientation)? ('otherBoard' '=' name 'otherPortal' '=' name)? ;
fire : 'fire' 'trigger' '=' trigger 'action' '=' action ;
key : keytype 'key' '=' KEY 'action' '=' name ;
name : NAME ;
gadgettype : GADGET ;
keytype : KEYPRESS ;
gravity : FLOAT ;
friction1 : FLOAT ;
friction2 : FLOAT ;
trigger : NAME ;
action : NAME ;
yfloat : FLOAT ;
xfloat : FLOAT ;
yint : INT ;
xint : INT ;
xvel : FLOAT ;
yvel : FLOAT ;
orientation : INT ;
width : INT ;
height : INT ;
This generates the lexer and parser fine. However, when I use it against the following file, it gives the following error:
line 12:0 extraneous input 'keyup' expecting {<EOF>, KEYPRESS}
File to Parse:
board name=keysBoard gravity=5.0 friction1=0.0 friction2=0.0
# define a ball
ball name=Ball x=0.5 y=0.5 xVelocity=2.5 yVelocity=2.5
# add some flippers
leftFlipper name=FlipL1 x=16 y=2 orientation=0
leftFlipper name=FlipL2 x=16 y=9 orientation=0
# add keys. lots of keys.
keyup key=space action=apple
keydown key=a action=ball
keyup key=backslash action=cat
keydown key=period action=dog
I went through other questions about this error in SO but none helped me. I cannot figure out what's going wrong. Why am I getting this error?
The string "keyup" is being tokenized as a NAME token: that is the problem.
You must realize that the lexer operates independently from the parser. If the parser is trying to match a KEYPRESS token, the lexer does not "listen" to it, but just constructs a token following the rules:
match the rule that consumes the most characters
if there are more rules that match the same amount of characters, choose the one that is defined first
Taking these rules into account, and the order of your rules:
NAME : [A-Za-z_][A-Za-z_0-9]* ;
INT : [0-9]+ ;
KEY : [a-z] | [0-9] | 'shift' | 'ctrl' | 'alt' | 'meta' | 'space' | 'left' | 'right' | 'up' | 'down' | 'minus' | 'equals' | 'backspace' | 'openbracket' | 'closebracket' | 'backslash' | 'semicolon' | 'quote' | 'enter' | 'comma' | 'period' | 'slash' ;
KEYPRESS : 'keyup' | 'keydown' ;
a NAME token will be created before most of the KEY alternatives, and all of the KEYPRESS alternatives will be created.
And since an INT matches one or more digits and is defined before KEY which also has a single digit alternative, it is clear that the lexer will never produce a KEY or KEYPRESS token.
If you move the NAME and INT rule below the KEY and KEYPRESS rules, then most of the tokens will be constructed as you expect, is my guess.
EDIT
A possible solution would look like:
KEY : [a-z] | 'shift' | 'ctrl' | 'alt' | 'meta' | 'space' | 'left' | 'right' | 'up' | 'down' | 'minus' | 'equals' | 'backspace' | 'openbracket' | 'closebracket' | 'backslash' | 'semicolon' | 'quote' | 'enter' | 'comma' | 'period' | 'slash' ;
KEYPRESS : 'keyup' | 'keydown' ;
NAME : [A-Za-z_][A-Za-z_0-9]* ;
SINGLE_DIGIT : [0-9] ;
INT : [0-9]+ ;
I.e. I removed the [0-9] alternative from KEY and introduced a SINGLE_DIGIT rule (which is placed before the INT rule!).
Now create some extra parser rules:
integer : INT | SINGLE_DIGIT ;
key : KEY | SINGLE_DIGIT ;
and change all occurrences of INT inside parser rules to integer (don't call your rule int: it is a reserved word) and change all KEY to key.
And you might also want to do something similar to NAME and the [a-z] alternative in KEY (i.e. a single lowercase char would now never be tokenized as a NAME, always as a KEY).

ANTLR replace tokens in a recursive manner

I have the following grammar:
rule: q=QualifiedName {System.out.println($q.text);};
QualifiedName
:
i=Identifier { $i.setText($i.text + "_");}
('[' (QualifiedName+ | Integer)? ']')*
;
Integer
: Digit Digit*
;
fragment
Digit
: '0'..'9'
;
fragment
Identifier
: ( '_'
| '$'
| ('a'..'z' | 'A'..'Z')
)
('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '$')*
;
and the code from Java:
ANTLRStringStream stream = new ANTLRStringStream("array1[array2[array3[index]]]");
TestLexer lexer = new TestLexer(stream);
CommonTokenStream tokens = new TokenRewriteStream(lexer);
TestParser parser = new TestParser(tokens);
try {
parser.rule();
} catch (RecognitionException e) {
e.printStackTrace();
}
For the input: array1[array2[array3[index]]], i want to modify each identifier. I was expecting to see the output: array1_[array_2[array3_[index_]]], but the output was the same as the input.
So the question is: why the setText() method doesn't work here?
EDIT:
I modified Bart's answer in the following way:
rule: q=qualifiedName {System.out.println($q.modified);};
qualifiedName returns [String modified]
:
Identifier
('[' (qualifiedName+ | Integer)? ']')*
{
$modified = $text + "_";
}
;
Identifier
: ( '_'
| '$'
| ('a'..'z' | 'A'..'Z')
)
('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '$')*
;
Integer
: Digit Digit*
;
fragment
Digit
: '0'..'9'
;
I want to modify each token matched by the rule qualifiedName. I tried the code above, and for the input array1[array2[array3[index]]] i was expecting to see the output array1[array2[array3[index_]_]_]_, but instead only the last token was modified: array1[array2[array3[index]]]_.
How can i solve this?
You can only use setText(...) once a token is created. You're recursively calling this token and setting some other text, which won't work. You'll need to create a parser rule out of QualifiedName instead of a lexer rule, and remove the fragment before Identifier.
rule: q=qualifiedName {System.out.println($q.text);};
qualifiedName
:
i=Identifier { $i.setText($i.text + "_");}
('[' (qualifiedName+ | Integer)? ']')*
;
Identifier
: ( '_'
| '$'
| ('a'..'z' | 'A'..'Z')
)
('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '$')*
;
Integer
: Digit Digit*
;
fragment
Digit
: '0'..'9'
;
Now, it will print: array1_[array2_[array3_[index_]]] on the console.
EDIT
I have no idea why you'd want to do that, but it seems you're simply trying to rewrite ] into ]_, which can be done in the same way as I showed above:
qualifiedName
:
Identifier
('[' (qualifiedName+ | Integer)? t=']' {$t.setText("]_");} )*
;

Categories

Resources