Antlr4 - Get token name - java

I have following (reduced) grammar:
grammar Test;
IDENTIFIER: [a-z]+ [a-zA-Z0-9]*;
WS: [ \t\n] -> skip;
compilationUnit:
field* EOF;
field:
type IDENTIFIER;
type:
(builtinType|complexType) ('[' ']')*;
builtinType:
'bool' | 'u8';
complexType:
IDENTIFIER;
And following program:
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
public class Main{
public static void main(String[] args){
TestLexer tl=new TestLexer(CharStreams.fromString("u8 foo bool bar complex baz complex[][] baz2"));
TestParser tp=new TestParser(new CommonTokenStream(tl));
TestParser.CompilationUnitContext cuc=tp.compilationUnit();
System.out.println("CompilationUnit:"+cuc);
for(var field:cuc.field()){
System.out.println("Field: "+field);
System.out.println("Field.type: "+ field.type());
System.out.println("Field.type.builtinType: "+field.type().builtinType());
System.out.println("Field.type.complexType: "+field.type().complexType());
if(field.type().complexType()!=null)
System.out.println("Field.type.complexType.IDENTIFIER: "+field.type().complexType().IDENTIFIER());
}
}
}
To differentiate complexType and builtinType, I can look, which is not-null.
But, if I want to distinguish between bool and u8, how can I do that?
This question would answer my question, but it is for Antlr3.

Either use alternative labels:
builtinType
: 'bool' #builtinTypeBool
| 'u8' #builtinTypeU8
;
and/or define these tokens in the lexer:
builtinType
: BOOL
| U8
;
BOOL : 'bool';
U8 : 'u8';
so that you can more easily inspect the token's type in a visitor/listener:
YourParser.BuiltinTypeContext ctx = ...
if (ctx.start.getType() == YourLexer.BOOL) {
// it's a BOOL token
}

Related

Antlr3: building parse tree for qualified names

I couldn't find a question/answer that comes close to helping with my issue. Therefore, I am posting this question here.
I am trying to build a parse tree for qualified names. The below example shows an example.
E.g.,
foo_boo.aaa.ccc1_c
Here I have dot separated words. I am using antlr3 and below is my grammer.
parse
: expr
;
list_expr : <I removed the grammar here>
SimpleType : ('a'..'z'|'A'..'Z'|'_')('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
QualifiedType : SimpleType | SimpleType ('\.' SimpleType)+;
expr : list_expr
| QualifiedType
| union_expr;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
Here, SympleType represents grammar for a word. My requirement is to build the grammar for the QualifiedType. The current grammar given in above is not working as expected (QualifiedType : SimpleType | SimpleType ('\.'SimpleType)+;). How to write correct grammar for Qualified names (Dot separated words)?
Make QualifiedType a parser rule instead of a lexer rule:
qualifiedType : SimpleType ('.' SimpleType)*;
Also, '\.' does not need an escape: '.' is OK.
EDIT
You'll have to set the output to AST and apply some tree rewrite rules to make it work properly. Here's a quick demo:
grammar T;
options {
output=AST;
}
tokens {
Root;
QualifiedName;
}
parse
: qualifiedType EOF -> ^(Root qualifiedType)
;
qualifiedType
: SimpleType ('.' SimpleType)* -> ^(QualifiedName SimpleType+)
;
SimpleType
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '0'..'9' | '_')*
;
And if you now run the code:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.CommonTree;
import org.antlr.runtime.tree.DOTTreeGenerator;
import org.antlr.stringtemplate.StringTemplate;
public class Main {
public static void main(String[] args) throws Exception {
TLexer lexer = new TLexer(new ANTLRStringStream("foo_boo.aaa.ccc1_c"));
TParser parser = new TParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.parse().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
you'll get some DOT output, which corresponds to the following AST:

ANTLR grammar returning java method name which can be executed using java compiler

I have the below ANTLR grammar:
tree grammar findApp;
options {
tokenVocab=TSRTemplate;
ASTLabelType=CommonTree;
backtrack=true;
}
#header {
}
#members {
}
prog returns [String value]
: ^(ROOT f=formula) {$value = $f.value; }
;
formula returns [String value]
: ^(DEFINITION INTEGER t=calc_expr) {$value = $t.value;}
;
calculation_expr returns [String value]
#init {
$value = null;
}
: calculation_unit {$value = $calculation_unit.value;}
| ^(PLUS n1=calculation_expr n2=calculation_expr) {
$value="addApproach(<calculation_expr1>, <calculation_expr2>)";
}
;
calc_unit returns [String value]
: cf_expr {$value = $cf_expr.value;}
;
cf_expr returns [String value]
: ^(GET_APP string_expr)
{
$value = "getApp("+$STRING.text + ")";
}
;
What I need to achieve is as below:
For example, the below grammar:
cf_expr returns [String value]
: ^(GET_APP string_expr)
{
$value = "getApp("+$STRING.text + ")";
}
;
will return getApp(<some parameter>), which is a method name.
This method is implemented in a java class (Method.java),
I want my grammar to return this method name and compile this java class (Method.java) and execute the method implementation in this class.
Any suggestions on how to integrate java class compiling with ANTLR will be helpful.

How to implement JJTree on grammar

I have an assignment to use JavaCC to make a Top-Down Parser with Semantic Analysis for a language supplied by the lecturer. I have the production rules written out and no errors.
I'm completely stuck on how to use JJTree for my code and my hours of scouring the internet for tutorials hasn't gotten me anywhere.
Just wondering could anyone take some time out to explain how to implement JJTree in the code?
Or if there's a hidden step-by-step tutorial out there somewhere that would be a great help!
Here are some of my production rules in case they help.
Thanks in advance!
void program() : {}
{
(decl())* (function())* main_prog()
}
void decl() #void : {}
{
(
var_decl() | const_decl()
)
}
void var_decl() #void : {}
{
<VAR> ident_list() <COLON> type()
(<COMMA> ident_list() <COLON> type())* <SEMIC>
}
void const_decl() #void : {}
{
<CONSTANT> identifier() <COLON> type() <EQUAL> expression()
( <COMMA> identifier() <COLON> type() <EQUAL > expression())* <SEMIC>
}
void function() #void : {}
{
type() identifier() <LBR> param_list() <RBR>
<CBL>
(decl())*
(statement() <SEMIC> )*
returnRule() (expression() | {} )<SEMIC>
<CBR>
}
Creating an AST using JavaCC looks a lot like creating a "normal" parser (defined in a jj file). If you already have a working grammar, it's (relatively) easy :)
Here are the steps needed to create an AST:
rename your jj grammar file to jjt
decorate it with root-labels (the italic words are my own terminology...)
invoke jjtree on your jjt grammar, which will generate a jj file for you
invoke javacc on your generated jj grammar
compile the generated java source files
test it
Here's a quick step-by-step tutorial, assuming you're using MacOS or *nix, have the javacc.jar file in the same directory as your grammar file(s) and java and javac are on your system's PATH:
1
Assuming your jj grammar file is called TestParser.jj, rename it:
mv TestParser.jj TestParser.jjt
2
Now the tricky part: decorating your grammar so that the proper AST structure is created. You decorate an AST (or node, or production rule (all the same)) by adding a # followed by an identifier after it (and before the :). In your original question, you have a lot of #void in different productions, meaning you're creating the same type of AST's for different production rules: this is not what you want.
If you don't decorate your production, the name of the production is used as the type of the node (so, you can remove the #void):
void decl() :
{}
{
var_decl()
| const_decl()
}
Now the rule simply returns whatever AST the rule var_decl() or const_decl() returned.
Let's now have a look at the (simplified) var_decl rule:
void var_decl() #VAR :
{}
{
<VAR> id() <COL> id() <EQ> expr() <SCOL>
}
void id() #ID :
{}
{
<ID>
}
void expr() #EXPR :
{}
{
<ID>
}
which I decorated with the #VAR type. This now means that this rule will return the following tree structure:
VAR
/ | \
/ | \
ID ID EXPR
As you can see, the terminals are discarded from the AST! This also means that the id and expr rules loose the text their <ID> terminal matched. Of course, this is not what you want. For the rules that need to keep the inner text the terminal matched, you need to explicitly set the .value of the tree to the .image of the matched terminal:
void id() #ID :
{Token t;}
{
t=<ID> {jjtThis.value = t.image;}
}
void expr() #EXPR :
{Token t;}
{
t=<ID> {jjtThis.value = t.image;}
}
causing the input "var x : int = i;" to look like this:
VAR
|
.---+------.
/ | \
/ | \
ID["x"] ID["int"] EXPR["i"]
This is how you create a proper structure for your AST. Below follows a small grammar that is a very simple version of your own grammar including a small main method to test it all:
// TestParser.jjt
PARSER_BEGIN(TestParser)
public class TestParser {
public static void main(String[] args) throws ParseException {
TestParser parser = new TestParser(new java.io.StringReader(args[0]));
SimpleNode root = parser.program();
root.dump("");
}
}
PARSER_END(TestParser)
TOKEN :
{
< OPAR : "(" >
| < CPAR : ")" >
| < OBR : "{" >
| < CBR : "}" >
| < COL : ":" >
| < SCOL : ";" >
| < COMMA : "," >
| < VAR : "var" >
| < EQ : "=" >
| < CONST : "const" >
| < ID : ("_" | <LETTER>) ("_" | <ALPHANUM>)* >
}
TOKEN :
{
< #DIGIT : ["0"-"9"] >
| < #LETTER : ["a"-"z","A"-"Z"] >
| < #ALPHANUM : <LETTER> | <DIGIT> >
}
SKIP : { " " | "\t" | "\r" | "\n" }
SimpleNode program() #PROGRAM :
{}
{
(decl())* (function())* <EOF> {return jjtThis;}
}
void decl() :
{}
{
var_decl()
| const_decl()
}
void var_decl() #VAR :
{}
{
<VAR> id() <COL> id() <EQ> expr() <SCOL>
}
void const_decl() #CONST :
{}
{
<CONST> id() <COL> id() <EQ> expr() <SCOL>
}
void function() #FUNCTION :
{}
{
type() id() <OPAR> params() <CPAR> <OBR> /* ... */ <CBR>
}
void type() #TYPE :
{Token t;}
{
t=<ID> {jjtThis.value = t.image;}
}
void id() #ID :
{Token t;}
{
t=<ID> {jjtThis.value = t.image;}
}
void params() #PARAMS :
{}
{
(param() (<COMMA> param())*)?
}
void param() #PARAM :
{Token t;}
{
t=<ID> {jjtThis.value = t.image;}
}
void expr() #EXPR :
{Token t;}
{
t=<ID> {jjtThis.value = t.image;}
}
3
Let the jjtree class (included in javacc.jar) create a jj file for you:
java -cp javacc.jar jjtree TestParser.jjt
4
The previous step has created the file TestParser.jj (if everything went okay). Let javacc (also present in javacc.jar) process it:
java -cp javacc.jar javacc TestParser.jj
5
To compile all source files, do:
javac -cp .:javacc.jar *.java
(on Windows, do: javac -cp .;javacc.jar *.java)
6
The moment of truth has arrived: let's see if everything actually works! To let the parser process the input:
var n : int = I;
const x : bool = B;
double f(a,b,c)
{
}
execute the following:
java -cp . TestParser "var n : int = I; const x : bool = B; double f(a,b,c) { }"
and you should see the following being printed on your console:
PROGRAM
decl
VAR
ID
ID
EXPR
decl
CONST
ID
ID
EXPR
FUNCTION
TYPE
ID
PARAMS
PARAM
PARAM
PARAM
Note that you don't see the text the ID's matched, but believe me, they're there. The method dump() simply does not show it.
HTH
EDIT
For a working grammar including expressions, you could have a look at the following expression evaluator of mine: https://github.com/bkiers/Curta (the grammar is in src/grammar). You might want to have a look at how to create root-nodes in case of binary expressions.
Here is an example that uses JJTree
http://anandsekar.github.io/writing-an-interpretter-using-javacc/

In antlr, is there a way to get parsed text of a CommonTree in AST mode?

A simple example:
(grammar):
stat: ID '=' expr NEWLINE -> ^('=' ID expr)
expr: atom '+' atom -> ^(+ atom atom)
atom: INT | ID
...
(input text): a = 3 + 5
The corresponding CommonTree for '3 + 5' contains a '+' token and two children (3, 5).
At this point, what is the best way to recover the original input text that parsed into this tree ('3 + 5')?
I've got the text, position and the line number of individual tokens in the CommonTree object, so theoretically it's possible to make sure only white space tokens are discarded and piece them together using this information, but it looks error prone.
Is there a better way to do this?
Is there a better way to do this?
Better, I don't know. There is another way, of course. You decide what's better.
Another option would be to create a custom AST node class (and corresponding node-adapter) and add the matched text to this AST node during parsing. The trick here is to not use skip(), which discards the token from the lexer, but to put it on the HIDDEN channel. This is effectively the same, however, the text these (hidden) tokens match are still available in the parser.
A quick demo: put all these 3 file in a directory named demo:
demo/T.g
grammar T;
options {
output=AST;
ASTLabelType=XTree;
}
#parser::header {
package demo;
import demo.*;
}
#lexer::header {
package demo;
import demo.*;
}
parse
: expr EOF -> expr
;
expr
#after{$expr.tree.matched = $expr.text;}
: Int '+' Int ';' -> ^('+' Int Int)
;
Int
: '0'..'9'+
;
Space
: ' ' {$channel=HIDDEN;}
;
demo/XTree.java
package demo;
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
public class XTree extends CommonTree {
protected String matched;
public XTree(Token t) {
super(t);
matched = null;
}
}
demo/Main.java
package demo;
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
public class Main {
public static void main(String[] args) throws Exception {
String source = "12 + 42 ;";
TLexer lexer = new TLexer(new ANTLRStringStream(source));
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.setTreeAdaptor(new CommonTreeAdaptor(){
#Override
public Object create(Token t) {
return new XTree(t);
}
});
XTree root = (XTree)parser.parse().getTree();
System.out.println("tree : " + root.toStringTree());
System.out.println("matched : " + root.matched);
}
}
You can run this demo by opening a shell and cd-ing to the directory that holds the demo directory and execute the following:
java -cp demo/antlr-3.3.jar org.antlr.Tool demo/T.g
javac -cp demo/antlr-3.3.jar demo/*.java
java -cp .:demo/antlr-3.3.jar demo.Main
which will produce the following output:
tree : (+ 12 42)
matched : 12 + 42 ;
Another possibility is to use TokenRewriteStream which has several toString() methods.
To borrow from #Bart Kiers' example Demo/Main.java
TokenRewriteStream tokens = new TokenRewriteStream(lexer)
TParser parser = new TParser(tokens);
...
tokens.toString(n.getTokenStartIndex(), n.getTokenStopIndex() + 1).trim()
So given any node 'n' of your parse tree, calling toString() on it as above will produce the string that "generated" this node.

Fetching expressions from a string using ANTLR in JAVA

Given a String like..
(a+(a+b)), (d*e) :- (e-f)
Note: (d*e) and (e-f) are different expressions. How can I fetch the expressions from this string. I have the grammar defined as..
parse returns [String value]
: addExp {$value=$addExp.value;} EOF
;
addExp returns [String value]
: multExp {$value=$multExp.value;} (('+' | '-' | '*') multExp{$value+= '+' + $multExp.value;})*
;
multExp returns [String value]
: atom {$value=$atom.value;} (('*' | '/') atom {$value+=$atom.value;)*
;
atom returns [String value]
: x=ID {$value=$x.text;}
| '(' addExp ')' {$value='('+$addExp.value+')';}
;
ID : 'a'..'z' | 'A'..'Z';
I tried..
ANTLRStringStream a=new ANTLRStringStream("(a+(a+b)), (d*e) :- (e-f)");
SLexer l=new SLexer(a);
CommonTokenStream c=new CommonTokenStream(l);
SParser p=new Sparser(c);
String exp;
while(exp = p.parse())
{
System.out.println(exp);
}
I'm thinking of something like hasNext() and then fetching.
Your lexer rules TEXT possibly matches an empty string, causing the lexer to create an infinite amount of tokens. Also, you don't need all those return statements after your rule: you can simply grab what a parser (or lexer) rule matched by adding .text after it.
You could let your parser return a List<String>, or let it return a single String repeatedly invoke that parser rule until EOF is encountered.
A little demo:
grammar T;
#parser::members {
public static void main(String[] args) throws Exception {
String src = "likes(a, b) :- likes(a, X), likes(X, b). hates(a, b) " +
":- hates(a,X), hates(X,b). likes(a,b) :- says(god, likes(a,b)).";
TLexer lexer = new TLexer(new ANTLRStringStream(src));
TParser parser = new TParser(new CommonTokenStream(lexer));
List<String> statements = parser.parse();
for(String s : statements) {
System.out.println(s);
}
}
}
parse returns [List<String> statements]
#init{$statements = new ArrayList<String>();}
: (statement {$statements.add($statement.text);} ~TEXT+)+ EOF
;
statement
: TEXT OPAR params CPAR
;
params
: (param (COMMA param)*)?
;
param
: TEXT
| statement
;
COMMA : ',';
OPAR : '(';
CPAR : ')';
TEXT : ('a'..'z' | 'A'..'Z')+;
SPACE : (' ' | '\t') {$channel=HIDDEN;};
OTHER : . ;
Note that ~TEXT+ in the parse rule matches one or more tokens other than TEXT.
If you now create a lexer and parser and run the TParser class:
*nix/MacOS
java -cp antlr-3.3.jar org.antlr.Tool T.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar TParser
or
Windows
java -cp antlr-3.3.jar org.antlr.Tool T.g
javac -cp antlr-3.3.jar *.java
java -cp .;antlr-3.3.jar TParser
you will see the following being printed to your console:
likes(a, b)
likes(a, X)
likes(X, b)
hates(a, b)
hates(a,X)
hates(X,b)
likes(a,b)
says(god, likes(a,b))
EDIT
And here's how to return a single String opposed to a List<String>:
#parser::members {
public static void main(String[] args) throws Exception {
String src = "likes(a, b) :- likes(a, X), likes(X, b). hates(a, b) " +
":- hates(a,X), hates(X,b). likes(a,b) :- says(god, likes(a,b)).";
TLexer lexer = new TLexer(new ANTLRStringStream(src));
TParser parser = new TParser(new CommonTokenStream(lexer));
String s;
while((s = parser.parse()) != null) {
System.out.println(s);
}
}
}
parse returns [String s]
: statement ~(TEXT| EOF)* {$s = $statement.text;}
| EOF {$s = null;}
;
You should just be able to call sentence() repeatedly until you hit the end of input.

Categories

Resources