I was trying to parse System.out.println() statement as an OutputStatement for Java Grammar. Here's the production rule in EBNF:
Statement::=( LabeledStatement | AssertStatement | Block | EmptyStatement | StatementExpression | SwitchStatement | IfStatement | WhileStatement | DoStatement | ForStatement | BreakStatement | ContinueStatement | ReturnStatement | ThrowStatement | SynchronizedStatement | TryStatement|OutputStatement)
OutputStatement::="System.out.print"["ln"]"("Arguments")" ";"
This is strictly according to the Java Grammar as specified in the javacc folder file C:\javacc-6.0\examples\JavaGrammars\Java 1.0.2.jj
Now when I coded the production rule in JavaCC it came as:
OutputStmt OutputStatement():
{
Token tk;
Expression args;
boolean ln=false;
int line;
int column;
}
{
{line=token.beginLine;column=token.beginColumn;args=null;ln=false;}
tk=<STRING_LITERAL> LOOKAHEAD({tk.image.equals("System")})
"."
tk=<STRING_LITERAL> LOOKAHEAD({tk.image.equals("out")})
"."
tk=<STRING_LITERAL> LOOKAHEAD({tk.image.equals("print")})
[
tk=<STRING_LITERAL> LOOKAHEAD({tk.image.equals("ln")})
{
ln=true;
}
]
"("
args=Expression()
")" ";"
{
return new OutputStmt(line,column,token.endLine,token.endColumn,ln,args);
}
}
Now this throws LOOKAHEAD Warnings and Errors in the Parser generated.Can anyone please help?
EDIT: The main problem as it seems is that JavaCC is generating methods which are not initializing Token tk and which is giving me the error tk not resolved.
The following will work.
OutputStmt OutputStatement() :
{
Token tk;
Expression args;
boolean ln;
int line;
int column;
}
{
{line=token.beginLine;column=token.beginColumn;args=null;ln=false;}
LOOKAHEAD({getToken(1).image.equals("System")})
<ID>
"."
LOOKAHEAD({getToken(1).image.equals("out")})
<ID>
"."
LOOKAHEAD({getToken(1).image.equals("println") || getToken(1).image.equals("print") })
tk=<ID> { ln = tk.image.equals("println" ) ; }
"("
args=Expression()
")" ";"
{ return new OutputStmt(line,column,token.endLine,token.endColumn,ln,args); }
}
Note that I changed STRING_LITERAL to the more traditional ID.
Related
I want to create a recursive descendant parser in java for following grammar (I have managed to create tokens). This is the relevant part of the grammar:
expression ::= numeric_expression | identifier | "null"
identifier ::= "a..z,$,_"
numeric_expression ::= ( ( "-" | "++" | "--" ) expression )
| ( expression ( "++" | "--" ) )
| ( expression ( "+" | "+=" | "-" | "-=" | "*" | "*=" | "/" | "/=" | "%" | "%=" ) expression )
arglist ::= expression { "," expression }
I have written code for parsing numeric_expression (assuming if invalid token, return null):
NumericAST<? extends OpAST> parseNumericExpr() {
OpAST op;
if (token.getCodes() == Lexer.CODES.UNARY_OP) { //Check for unary operator like "++" or "--" etc
op = new UnaryOpAST(token.getValue());
token = getNextToken();
AST expr = parseExpr(); // Method that returns expression node.
if (expr == null) {
op = null;
return null;
} else {
if (checkSemi()) {
System.out.println("UNARY AST CREATED");
return new NumericAST<OpAST>(expr, op, false);
}
else {
return null;
}
}
} else { // Binary operation like "a+b", where a,b ->expression
AST expr = parseExpr();
if (expr == null) {
return null;
} else {
token = getNextToken();
if (token.getCodes() == Lexer.CODES.UNARY_OP) {
op = new UnaryOpAST(token.getValue());
return new NumericAST<OpAST>(expr, op, true);
} else if (token.getCodes() == Lexer.CODES.BIN_OP) {
op = new BinaryOpAST(token.getValue());
token = getNextToken();
AST expr2 = parseExpr();
if (expr2 == null) {
op = null;
expr = null;
return null;
} else {
if (checkSemi()) {
System.out.println("BINARY AST CREATED");
return new NumericAST<OpAST>(expr, op, expr2);
}
else {
return null;
}
}
} else {
expr = null;
return null;
}
}
}
}
Now, if i get a unary operator like ++ i can directly call this method, but I dont know to recognize other grammar, starting with same productions, like arglist and numeric_expression having "expression" as start production.
My question is:
How to recognize whether to call parseNumericExpr() or parseArgList() (method not mentioned above) if i get an expression token?
In order to write a recursive descent parser, you need an appropriate top-down grammar, normally an LL(1) grammar, although it's common to write the grammar using EBNF operators, as shown in the example grammar on Wikipedia's page on recursive descent grammars.
Unfortunately, your grammar is not LL(1), and the question you raise is a consequence of that fact. An LL(1) grammar has the property that the parser can always determine which production to use by examining only the next input token, which puts some severe constraints on the grammar, including:
No two productions for the same non-terminal can start with the same symbol.
No production can be left-recursive (i.e. the first symbol on the right-hand side is the defining non-terminal).
Here's a small rearrangement of your grammar which will work:
-- I added number here in order to be explicit.
atom ::= identifier | number | "null" | "(" expression ")"
-- I added function calls here, but it's arguable that this syntax accepts
-- a lot of invalid expressions
primary ::= atom { "++" | "--" | "(" [ arglist ] ")" }
factor ::= [ "-" | "++" | "--" ] primary
term ::= factor { ( "*" | "/" | "%" ) factor }
value ::= term { ( "+" | "-" ) term }
-- This adds the ordinary "=" assignment to the list in case it was
-- omitted by accident. Also, see the note below.
expression ::= { value ( "=" | "+#" | "-=" | "*=" | "/=" | "%=" ) } value
arglist ::= expression { "," expression }
The last expression rule is an attempt to capture the usual syntax of assignment operators (which associate to the right, not to the left), but it suffers from a classic problem address by this highly related question. I don't think I have a better answer to this issue than the one I wrote three years ago, so I hope it is still useful.
I have a grammar like this (anything which looks convoluted is a result of it being a subset of the actual grammar which contains more red herrings):
grammar Query;
startExpression
: WS? expression WS? EOF
;
expression
| maybeDefaultBooleanExpression
;
maybeDefaultBooleanExpression
: defaultBooleanExpression
| queryFragment
;
defaultBooleanExpression
: nested += queryFragment (WS nested += queryFragment)+
;
queryFragment
: unquotedQuery
| quotedQuery
;
unquotedQuery
: UNQUOTED
;
quotedQuery
: QUOTED
;
UNQUOTED
: UnquotedStartChar
UnquotedChar*
;
fragment
UnquotedStartChar
: EscapeSequence
| ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\' | ':'
| '"' | '\u201C' | '\u201D' // DoubleQuote
| '\'' | '\u2018' | '\u2019' // SingleQuote
| '(' | ')' | '[' | ']' | '{' | '}' | '~'
| '&' | '|' | '!' | '^' | '?' | '*' | '/' | '+' | '-' | '$' )
;
fragment
UnquotedChar
: EscapeSequence
| ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\' | ':'
| '"' | '\u201C' | '\u201D' // DoubleQuote
| '\'' | '\u2018' | '\u2019' // SingleQuote
| '(' | ')' | '[' | ']' | '{' | '}' | '~'
| '&' | '|' | '!' | '^' | '?' | '*' )
;
QUOTED
: '"'
QuotedChar*
'"'
;
fragment
QuotedChar
: ~( '\\'
| | '\u201C' | '\u201D' // DoubleQuote
| '\r' | '\n' | '?' | '*' )
;
WS : ( ' ' | '\r' | '\t' | '\u000C' | '\n' )+;
If I call the lexer myself directly:
CharStream input = CharStreams.fromString("A \"");
QueryLexer lexer = new QueryLexer(input);
lexer.removeErrorListeners();
CommonTokenStream tokens = new CommonTokenStream(lexer);
System.out.println(tokens.LT(0));
System.out.println(tokens.LT(1));
System.out.println(tokens.LT(2));
System.out.println(tokens.LT(3));
I get:
java.lang.StringIndexOutOfBoundsException: String index out of range: 4
at java.lang.String.checkBounds(String.java:385)
at java.lang.String.<init>(String.java:462)
at org.antlr.v4.runtime.CodePointCharStream$CodePoint8BitCharStream.getText(CodePointCharStream.java:160)
at org.antlr.v4.runtime.Lexer.notifyListeners(Lexer.java:360)
at org.antlr.v4.runtime.Lexer.nextToken(Lexer.java:144)
at org.antlr.v4.runtime.BufferedTokenStream.fetch(BufferedTokenStream.java:169)
at org.antlr.v4.runtime.BufferedTokenStream.sync(BufferedTokenStream.java:152)
at org.antlr.v4.runtime.CommonTokenStream.LT(CommonTokenStream.java:100)
This makes some kind of sense, though I think a proper ANTLR exception might have been better.
What I really don't get, though, is that when I feed this through the complete parser:
QueryParser parser = new QueryParser(tokens);
parser.removeErrorListeners();
parser.addErrorListener(LoggingErrorListener.get());
parser.setErrorHandler(new BailErrorStrategy());
// Performance hack as per the ANTLR v4 FAQ
parser.getInterpreter().setPredictionMode(PredictionMode.SLL);
ParseTree expression;
try
{
expression = parser.startExpression();
}
catch (Exception e)
{
// It catches a StringIndexOutOfBoundsException here.
parser.reset();
parser.getInterpreter().setPredictionMode(PredictionMode.LL);
expression = parser.startExpression();
}
I get:
tokens = {org.antlr.v4.runtime.CommonTokenStream#1811}
channel = 0
tokenSource = {MyQueryLexer#1810}
tokens = {java.util.ArrayList#1816} size = 3
0 = {org.antlr.v4.runtime.CommonToken#1818} "[#0,0:0='A',<13>,1:0]"
1 = {org.antlr.v4.runtime.CommonToken#1819} "[#1,1:1=' ',<32>,1:1]"
2 = {org.antlr.v4.runtime.CommonToken#1820} "[#2,3:2='<EOF>',<-1>,1:3]"
p = 2
fetchedEOF = true
expression = {MyQueryParser$StartExpressionContext#1813} "[]"
children = {java.util.ArrayList#1827} size = 3
0 = {MyQueryParser$ExpressionContext#1831} "[87]"
1 = {org.antlr.v4.runtime.tree.TerminalNodeImpl#1832} " "
2 = {org.antlr.v4.runtime.tree.TerminalNodeImpl#1833} "<EOF>"
Here I would have expected to get a RecognitionException, but somehow the parsing succeeds, and is missing the invalid bit of the token data at the end.
Questions are:
(1) Is this by design?
(2) If so, how can I detect this and have it treated as a syntax error?
Further investigation
When I went looking for the culprit for who was catching the StringIndexOutOfBoudsException and eating it, it turns out that it comes all the way out to our catch block. So I guess ANTLR never got a chance to finish building that last invalid token..?
I'm not entirely sure what's supposed to happen, but I guess I expected that ANTLR would have caught it, created an invalid token and continued.
I then drilled further in and found that Token#nextToken() was throwing an exception, and the docs made it seem like that wasn't supposed to happen, so I ended up filing a ticket about that.
Until very recent builds, ANTLR4's adaptive mechanism has the "feature" of being able to recover from single-token-missing and single-extra-token parses if there were only one viable alternative in that part of the token stream. Now recently, apparently that behavior has changed. So if you're using an older build as I am, you'll still see the adaptive parsing. Maybe Parr and Harwill will fix that.
Like you, I recognized the need for a perfect input stream and zero parse errors, "overlooked" or not. To create a "strict parser" follow these steps:
Make a class called perhaps "StrictErrorStrategy that inherit from/extend DefaultErrorStrategy. You need to override the Recover, RecoverInline, and Sync methods. Bottom line here is we throw exceptions for anything that goes wrongs, and make no attempt to re-sync the code after an extra/missing token. Here's my C# code, your java will look very similar:
public class StrictErrorStrategy : DefaultErrorStrategy
{
public override void Recover(Parser recognizer, RecognitionException e)
{
IToken token = recognizer.CurrentToken;
string message = string.Format("parse error at line {0}, position {1} right before {2} ", token.Line, token.Column, GetTokenErrorDisplay(token));
throw new Exception(message, e);
}
public override IToken RecoverInline(Parser recognizer)
{
IToken token = recognizer.CurrentToken;
string message = string.Format("parse error at line {0}, position {1} right before {2} ", token.Line, token.Column, GetTokenErrorDisplay(token));
throw new Exception(message, new InputMismatchException(recognizer));
}
public override void Sync(Parser recognizer) { /* do nothing to resync */}
}
Make a new lexer that implements a single method:
public class StrictLexer : <YourGeneratedLexerNameHere>
{
public override void Recover(LexerNoViableAltException e)
{
string message = string.Format("lex error after token {0} at position {1}", _lasttoken.Text, e.StartIndex);
throw new ParseCanceledException(BasicEnvironment.SyntaxError);
}
}
Use your lexer and strategy:
AntlrInputStream inputStream = new AntlrInputStream(stream);
StrictLexer lexer = new BailLexer(inputStream);
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
LISBASICParser parser = new LISBASICParser(tokenStream);
parser.RemoveErrorListeners();
parser.ErrorHandler = new StrictErrorStrategy();
This works great, actual code from one of my projects that has a "zero-tolerance rule" about syntax errors. I got the code and ideas from Terence Parr's great book on ANTLR4.
I'm trying to change a grammar in the JSqlParser project, which deals with a javacc grammar file .jj specifying the standard SQL syntax. I had difficulty getting one section to work, I narrowed it down to the following , much simplified grammar.
basically I have a def of Column : [table ] . field
but table itself could also contain the "." char, which causes confusion.
I think intuitively the following grammar should accept all the following sentences:
select mytable.myfield
select myfield
select mydb.mytable.myfield
but in practice it only accepts the 2nd and 3rd above. whenever it sees the ".", it progresses to demanding the 2-dot version of table (i.e. the first derivation rule for table)
how can I make this grammar work?
Thanks a lot
Yang
options{
IGNORE_CASE=true ;
STATIC=false;
DEBUG_PARSER=true;
DEBUG_LOOKAHEAD=true;
DEBUG_TOKEN_MANAGER=false;
// FORCE_LA_CHECK=true;
UNICODE_INPUT=true;
}
PARSER_BEGIN(TT)
import java.util.*;
public class TT {
}
PARSER_END(TT)
///////////////////////////////////////////// main stuff concerned
void Statement() :
{ }
{
<K_SELECT> Column()
}
void Column():
{
}
{
[LOOKAHEAD(3) Table() "." ]
//[
//LOOKAHEAD(2) (
// LOOKAHEAD(5) <S_IDENTIFIER> "." <S_IDENTIFIER>
// |
// LOOKAHEAD(3) <S_IDENTIFIER>
//)
//
//
//
//]
Field()
}
void Field():
{}{
<S_IDENTIFIER>
}
void Table():
{}{
LOOKAHEAD(5) <S_IDENTIFIER> "." <S_IDENTIFIER>
|
LOOKAHEAD(3) <S_IDENTIFIER>
}
////////////////////////////////////////////////////////
SKIP:
{
" "
| "\t"
| "\r"
| "\n"
}
TOKEN: /* SQL Keywords. prefixed with K_ to avoid name clashes */
{
<K_CREATE: "CREATE">
|
<K_SELECT: "SELECT">
}
TOKEN : /* Numeric Constants */
{
< S_DOUBLE: ((<S_LONG>)? "." <S_LONG> ( ["e","E"] (["+", "-"])? <S_LONG>)?
|
<S_LONG> "." (["e","E"] (["+", "-"])? <S_LONG>)?
|
<S_LONG> ["e","E"] (["+", "-"])? <S_LONG>
)>
| < S_LONG: ( <DIGIT> )+ >
| < #DIGIT: ["0" - "9"] >
}
TOKEN:
{
< S_IDENTIFIER: ( <LETTER> | <ADDITIONAL_LETTERS> )+ ( <DIGIT> | <LETTER> | <ADDITIONAL_LETTERS> | <SPECIAL_CHARS>)* >
| < #LETTER: ["a"-"z", "A"-"Z", "_", "$"] >
| < #SPECIAL_CHARS: "$" | "_" | "#" | "#">
| < S_CHAR_LITERAL: "'" (~["'"])* "'" ("'" (~["'"])* "'")*>
| < S_QUOTED_IDENTIFIER: "\"" (~["\n","\r","\""])+ "\"" | ("`" (~["\n","\r","`"])+ "`") | ( "[" ~["0"-"9","]"] (~["\n","\r","]"])* "]" ) >
/*
To deal with database names (columns, tables) using not only latin base characters, one
can expand the following rule to accept additional letters. Here is the addition of german umlauts.
There seems to be no way to recognize letters by an external function to allow
a configurable addition. One must rebuild JSqlParser with this new "Letterset".
*/
| < #ADDITIONAL_LETTERS: ["ä","ö","ü","Ä","Ö","Ü","ß"] >
}
You could rewrite your grammar like this
Statement --> "select" Column
Column --> Prefix <ID>
Prefix --> (<ID> ".")*
Now the only choice is whether to iterate or not. Assuming a "." can't follow a Column, this is easily done with a lookahead of 2:
Statement --> "select" Column
Column --> Prefix <ID>
Prefix --> (LOOKAHEAD( <ID> ".") <ID> ".")*
indeed the following grammar in flex+bison (LR parser) works fine , recognizing all the following sentences correctly:
create mydb.mytable
create mytable
select mydb.mytable.myfield
select mytable.myfield
select myfield
so it is indeed due to limitation of LL parser
%%
statement:
create_sentence
|
select_sentence
;
create_sentence: CREATE table
;
select_sentence: SELECT table '.' ID
|
SELECT ID
;
table : table '.' ID
|
ID
;
%%
If you need Table to be its own nonterminal, you can do this by using a boolean parameter that says whether the table is expected to be followed by a dot.
void Statement():{}{
"select" Column() | "create" "table" Table(false) }
void Column():{}{
[LOOKAHEAD(<ID> ".") Table(true) "."] <ID> }
void Table(boolean expectDot):{}{
<ID> MoreTable(expectDot) }
void MoreTable(boolean expectDot) {
LOOKAHEAD("." <ID> ".", {expectDot}) "." <ID> MoreTable(expectDot)
|
LOOKAHEAD(".", {!expectDot}) "." <ID> MoreTable(expectDot)
|
{}
}
Doing it this way precludes using Table in any syntactic lookahead specifications either directly or indirectly. E.g. you shouldn't have LOOKAHEAD( Table()) anywhere in your grammar, because semantic lookahead is not used during syntactic lookahead. See the FAQ for more information on that.
Your examples are parsed perfectly well using JSqlParser V0.9.x (https://github.com/JSQLParser/JSqlParser)
CCJSqlParserUtil.parse("SELECT mycolumn");
CCJSqlParserUtil.parse("SELECT mytable.mycolumn");
CCJSqlParserUtil.parse("SELECT mydatabase.mytable.mycolumn");
Here is a short javaCC code:
PARSER_BEGIN(TestParser)
public class TestParser
{
}
PARSER_END(TestParser)
SKIP :
{
" "
| "\t"
| "\n"
| "\r"
}
TOKEN : /* LITERALS */
{
<VOID: "void">
| <LPAR: "("> | <RPAR: ")">
| <LBRAC: "{"> | <RBRAC: "}">
| <COMMA: ",">
| <DATATYPE: "int">
| <#LETTER: ["_","a"-"z","A"-"Z"] >
| <#DIGIT: ["0"-"9"] >
| <DOUBLE_QUOTE_LITERAL: "\"" (~["\""])*"\"" >
| <IDENTIFIER: <LETTER> (<LETTER>|<DIGIT>)* >
| <VARIABLE: "$"<IDENTIFIER> >
}
public void input():{} { (statement())+ <EOF> }
private void statement():{}
{
<VOID> <IDENTIFIER> <LPAR> (<DATATYPE> <IDENTIFIER> (<COMMA> <DATATYPE> <IDENTIFIER>)*)? <RPAR>
<LBRAC>
<RBRAC>
}
I'd like this parser to handle the following kind of input with a "grammar-free" section (character '}' would be the end of the section ):
void fun(int i, int j)
{
Hello world the value of i is ${i}
and j=${j}.
}
the grammar-free section would return a
java.util.List<String_or_VariableReference>
How should I modify my javacc parser to handle this section ?
Thanks.
If I understand the question correctly, you want to allow essentially arbitrary input for a while and then switch back to your language. If you can decide when to make the switch based purely on tokens, then this is easy to do using two lexical states. Use the default state for your programming language. When a "{" is seen in the DEFAULT state, switch to the other state
TOKEN: { <LBRACE : "{" > : FREE }
In the FREE state, when a "}" is seen, switch back to the DEFAULT state; when any other character is seen, pass it on to the parser.
<FREE> TOKEN { <RBRACE : "}" > : DEFAULT }
<FREE> TOKEN { <OTHER : ~["}"] > : FREE }
In the parser you can have
void freeSection() : {} { <LBRACE> (<OTHER>)* <RBRACE> }
If you want to do something with all those OTHER characters, see question 5.2 in the FAQ. http://www.engr.mun.ca/~theo/JavaCC-FAQ
If you want to capture variable references such as "${i}" in the FREE state, you can to that too. Add
<FREE> TOKEN { <VARREF : "${" (["a"-"Z"]|["A"-"Z"])* "}" > }
I created a simple grammar in AntlWorks. Then I generated code and I have two files: grammarLexer.java and grammarParser.java. My goal is to create mapping my grammar to java language. What should I do next to achieve it?
Here is my grammar:
`
grammar grammar;
prog : ((FOR | WHILE | IF | PRINT | DECLARATION | ENTER | (WS* FUNCTION) | VARIABLE) | FUNCTION_DEC)+;
FOR : WS* 'for' WS+ VARIABLE WS+ DIGIT+ WS+ DIGIT+ WS* ENTER ( FOR | WHILE | IF | PRINT | DECLARATION | ENTER | (WS* FUNCTION) | INC_DEC )* WS* 'end' WS* ENTER;
WHILE : WS* 'while' WS+ (VARIABLE | DIGIT+) WS* EQ_OPERATOR WS* (VARIABLE | DIGIT+) WS* ENTER (FOR | WHILE | IF | PRINT | DECLARATION | ENTER | (WS* FUNCTION) | (WS* INC_DEC))* WS* 'end' WS* ENTER;
IF : WS* 'if' WS+ ( FUNCTION | VARIABLE | DIGIT+) WS* EQ_OPERATOR WS* (VARIABLE | DIGIT+) WS* ENTER (FOR | WHILE | IF | PRINT | DECLARATION | ENTER | (WS* FUNCTION) | INC_DEC)* ( WS* 'else' ENTER (FOR | WHILE | IF | PRINT | DECLARATION | ENTER | (WS* FUNCTION) | (WS* INC_DEC))*)? WS* 'end' WS* ENTER;
CHAR : ('a'..'z'|'A'..'Z')+;
EQ_OPERATOR : ('<' | '>' | '==' | '>=' | '<=' | '!=');
DIGIT : '0'..'9'+;
ENTER : '\n';
WS : ' ' | '\t';
PRINT_TEMPLATE : WS+ (('"' (CHAR | DIGIT | WS)* '"') | VARIABLE | DIGIT+ | FUNCTION | INC_DEC);
PRINT : WS* 'print' PRINT_TEMPLATE (',' PRINT_TEMPLATE)* WS* ENTER;
VARIABLE : CHAR(CHAR|DIGIT)*;
FUN_TEMPLATE : WS* (VARIABLE | DIGIT+ | '"' (CHAR | DIGIT | WS)* '"');
FUNCTION : VARIABLE '(' (FUN_TEMPLATE (WS* ',' FUN_TEMPLATE)*)? ')' WS* ENTER*;
DECLARATION : WS* VARIABLE WS* ('=' WS* (DIGIT+ | '"' (CHAR | DIGIT | WS)* '"' | VARIABLE)) WS* ENTER;
FUNCTION_DEC : WS*'def' WS* FUNCTION ( FOR | WHILE | IF | PRINT | DECLARATION | ENTER | (WS* FUNCTION) | INC_DEC )* WS* 'end' WS* ENTER*;
INC_DEC : VARIABLE ('--' | '++') WS* ENTER*;`
Here is my Main class for parser:
`
import org.antlr.runtime.ANTLRStringStream;
import org.antlr.runtime.CommonToken;
import org.antlr.runtime.CommonTokenStream;
import org.antlr.runtime.Parser;
public class Main {
public static void main(String[] args) throws Exception {
// the input source
String source =
"for i 1 3\n " +
"printHi()\n " +
"end\n " +
"if fun(y, z) == 0\n " +
"end\n ";
// create an instance of the lexer
grammarLexer lexer = new grammarLexer(new ANTLRStringStream(source));
// wrap a token-stream around the lexer
CommonTokenStream tokens = new CommonTokenStream(lexer);
// traverse the tokens and print them to see if the correct tokens are created
int n = 1;
for(Object o : tokens.getTokens()) {
CommonToken token = (CommonToken)o;
System.out.println("token(" + n + ") = " + token.getText().replace("\n", "\\n"));
n++;
}
grammarParser parser = new grammarParser(tokens);
parser.file();
}
}
`
As I already mentioned in comments: your overuse of lexer rules is wrong. Look at lexer rules as being the fundamental building blocks of your language. Much like how you'd describe water in chemistry. You would not describe water like this:
WATER
: 'HHO'
;
I.e.: as a single element. Water should be described as 3 separate elements:
water
: Hydrogen Hydrogen Oxygen
;
Hydrogen : 'H';
Oxygen : 'O';
where Hydrogen and Oxygen are the fundamental building blocks (lexer rules) and water is the compound (the parser rule).
A good rule of thumb is that if you're creating lexer rules that consist of several other lexer rules, chances are there's something fishy in your grammar. This is not always the case, of course.
Let's say you want to parse the following input:
for i 1 3
print(i)
end
if fun(y, z) == 0
print('foo')
end
A grammar could look like this:
grammar T;
options {
output=AST;
}
tokens {
BLOCK;
CALL;
PARAMS;
}
// parser rules
parse
: block EOF!
;
block
: stat* -> ^(BLOCK stat*)
;
stat
: for_stat
| if_stat
| func_call
;
for_stat
: FOR^ ID expr expr block END!
;
if_stat
: IF^ expr block END!
;
expr
: eq_expr
;
eq_expr
: atom (('==' | '!=')^ atom)*
;
atom
: func_call
| INT
| ID
| STR
;
func_call
: ID '(' params ')' -> ^(CALL ID params)
;
params
: (expr (',' expr)*)? -> ^(PARAMS expr*)
;
// lexer rules
FOR : 'for';
END : 'end';
IF : 'if';
ID : ('a'..'z' | 'A'..'Z')+;
INT : '0'..'9'+;
STR : '\'' ~('\'')* '\'';
SP : (' ' | '\t' | '\r' | '\n')+ {skip();};
And if you now run this test class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String src =
"for i 1 3 \n" +
" print(i) \n" +
"end \n" +
" \n" +
"if fun(y, z) == 0 \n" +
" print('foo') \n" +
"end \n";
TLexer lexer = new TLexer(new ANTLRStringStream(src));
TParser parser = new TParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.parse().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
you'll see some output being printed to the console which corresponds to the following AST: