I have an assignment to use JavaCC to make a Top-Down Parser with Semantic Analysis for a language supplied by the lecturer. I have the production rules written out and no errors.
I'm completely stuck on how to use JJTree for my code and my hours of scouring the internet for tutorials hasn't gotten me anywhere.
Just wondering could anyone take some time out to explain how to implement JJTree in the code?
Or if there's a hidden step-by-step tutorial out there somewhere that would be a great help!
Here are some of my production rules in case they help.
Thanks in advance!
void program() : {}
{
(decl())* (function())* main_prog()
}
void decl() #void : {}
{
(
var_decl() | const_decl()
)
}
void var_decl() #void : {}
{
<VAR> ident_list() <COLON> type()
(<COMMA> ident_list() <COLON> type())* <SEMIC>
}
void const_decl() #void : {}
{
<CONSTANT> identifier() <COLON> type() <EQUAL> expression()
( <COMMA> identifier() <COLON> type() <EQUAL > expression())* <SEMIC>
}
void function() #void : {}
{
type() identifier() <LBR> param_list() <RBR>
<CBL>
(decl())*
(statement() <SEMIC> )*
returnRule() (expression() | {} )<SEMIC>
<CBR>
}
Creating an AST using JavaCC looks a lot like creating a "normal" parser (defined in a jj file). If you already have a working grammar, it's (relatively) easy :)
Here are the steps needed to create an AST:
rename your jj grammar file to jjt
decorate it with root-labels (the italic words are my own terminology...)
invoke jjtree on your jjt grammar, which will generate a jj file for you
invoke javacc on your generated jj grammar
compile the generated java source files
test it
Here's a quick step-by-step tutorial, assuming you're using MacOS or *nix, have the javacc.jar file in the same directory as your grammar file(s) and java and javac are on your system's PATH:
1
Assuming your jj grammar file is called TestParser.jj, rename it:
mv TestParser.jj TestParser.jjt
2
Now the tricky part: decorating your grammar so that the proper AST structure is created. You decorate an AST (or node, or production rule (all the same)) by adding a # followed by an identifier after it (and before the :). In your original question, you have a lot of #void in different productions, meaning you're creating the same type of AST's for different production rules: this is not what you want.
If you don't decorate your production, the name of the production is used as the type of the node (so, you can remove the #void):
void decl() :
{}
{
var_decl()
| const_decl()
}
Now the rule simply returns whatever AST the rule var_decl() or const_decl() returned.
Let's now have a look at the (simplified) var_decl rule:
void var_decl() #VAR :
{}
{
<VAR> id() <COL> id() <EQ> expr() <SCOL>
}
void id() #ID :
{}
{
<ID>
}
void expr() #EXPR :
{}
{
<ID>
}
which I decorated with the #VAR type. This now means that this rule will return the following tree structure:
VAR
/ | \
/ | \
ID ID EXPR
As you can see, the terminals are discarded from the AST! This also means that the id and expr rules loose the text their <ID> terminal matched. Of course, this is not what you want. For the rules that need to keep the inner text the terminal matched, you need to explicitly set the .value of the tree to the .image of the matched terminal:
void id() #ID :
{Token t;}
{
t=<ID> {jjtThis.value = t.image;}
}
void expr() #EXPR :
{Token t;}
{
t=<ID> {jjtThis.value = t.image;}
}
causing the input "var x : int = i;" to look like this:
VAR
|
.---+------.
/ | \
/ | \
ID["x"] ID["int"] EXPR["i"]
This is how you create a proper structure for your AST. Below follows a small grammar that is a very simple version of your own grammar including a small main method to test it all:
// TestParser.jjt
PARSER_BEGIN(TestParser)
public class TestParser {
public static void main(String[] args) throws ParseException {
TestParser parser = new TestParser(new java.io.StringReader(args[0]));
SimpleNode root = parser.program();
root.dump("");
}
}
PARSER_END(TestParser)
TOKEN :
{
< OPAR : "(" >
| < CPAR : ")" >
| < OBR : "{" >
| < CBR : "}" >
| < COL : ":" >
| < SCOL : ";" >
| < COMMA : "," >
| < VAR : "var" >
| < EQ : "=" >
| < CONST : "const" >
| < ID : ("_" | <LETTER>) ("_" | <ALPHANUM>)* >
}
TOKEN :
{
< #DIGIT : ["0"-"9"] >
| < #LETTER : ["a"-"z","A"-"Z"] >
| < #ALPHANUM : <LETTER> | <DIGIT> >
}
SKIP : { " " | "\t" | "\r" | "\n" }
SimpleNode program() #PROGRAM :
{}
{
(decl())* (function())* <EOF> {return jjtThis;}
}
void decl() :
{}
{
var_decl()
| const_decl()
}
void var_decl() #VAR :
{}
{
<VAR> id() <COL> id() <EQ> expr() <SCOL>
}
void const_decl() #CONST :
{}
{
<CONST> id() <COL> id() <EQ> expr() <SCOL>
}
void function() #FUNCTION :
{}
{
type() id() <OPAR> params() <CPAR> <OBR> /* ... */ <CBR>
}
void type() #TYPE :
{Token t;}
{
t=<ID> {jjtThis.value = t.image;}
}
void id() #ID :
{Token t;}
{
t=<ID> {jjtThis.value = t.image;}
}
void params() #PARAMS :
{}
{
(param() (<COMMA> param())*)?
}
void param() #PARAM :
{Token t;}
{
t=<ID> {jjtThis.value = t.image;}
}
void expr() #EXPR :
{Token t;}
{
t=<ID> {jjtThis.value = t.image;}
}
3
Let the jjtree class (included in javacc.jar) create a jj file for you:
java -cp javacc.jar jjtree TestParser.jjt
4
The previous step has created the file TestParser.jj (if everything went okay). Let javacc (also present in javacc.jar) process it:
java -cp javacc.jar javacc TestParser.jj
5
To compile all source files, do:
javac -cp .:javacc.jar *.java
(on Windows, do: javac -cp .;javacc.jar *.java)
6
The moment of truth has arrived: let's see if everything actually works! To let the parser process the input:
var n : int = I;
const x : bool = B;
double f(a,b,c)
{
}
execute the following:
java -cp . TestParser "var n : int = I; const x : bool = B; double f(a,b,c) { }"
and you should see the following being printed on your console:
PROGRAM
decl
VAR
ID
ID
EXPR
decl
CONST
ID
ID
EXPR
FUNCTION
TYPE
ID
PARAMS
PARAM
PARAM
PARAM
Note that you don't see the text the ID's matched, but believe me, they're there. The method dump() simply does not show it.
HTH
EDIT
For a working grammar including expressions, you could have a look at the following expression evaluator of mine: https://github.com/bkiers/Curta (the grammar is in src/grammar). You might want to have a look at how to create root-nodes in case of binary expressions.
Here is an example that uses JJTree
http://anandsekar.github.io/writing-an-interpretter-using-javacc/
Related
I have following (reduced) grammar:
grammar Test;
IDENTIFIER: [a-z]+ [a-zA-Z0-9]*;
WS: [ \t\n] -> skip;
compilationUnit:
field* EOF;
field:
type IDENTIFIER;
type:
(builtinType|complexType) ('[' ']')*;
builtinType:
'bool' | 'u8';
complexType:
IDENTIFIER;
And following program:
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
public class Main{
public static void main(String[] args){
TestLexer tl=new TestLexer(CharStreams.fromString("u8 foo bool bar complex baz complex[][] baz2"));
TestParser tp=new TestParser(new CommonTokenStream(tl));
TestParser.CompilationUnitContext cuc=tp.compilationUnit();
System.out.println("CompilationUnit:"+cuc);
for(var field:cuc.field()){
System.out.println("Field: "+field);
System.out.println("Field.type: "+ field.type());
System.out.println("Field.type.builtinType: "+field.type().builtinType());
System.out.println("Field.type.complexType: "+field.type().complexType());
if(field.type().complexType()!=null)
System.out.println("Field.type.complexType.IDENTIFIER: "+field.type().complexType().IDENTIFIER());
}
}
}
To differentiate complexType and builtinType, I can look, which is not-null.
But, if I want to distinguish between bool and u8, how can I do that?
This question would answer my question, but it is for Antlr3.
Either use alternative labels:
builtinType
: 'bool' #builtinTypeBool
| 'u8' #builtinTypeU8
;
and/or define these tokens in the lexer:
builtinType
: BOOL
| U8
;
BOOL : 'bool';
U8 : 'u8';
so that you can more easily inspect the token's type in a visitor/listener:
YourParser.BuiltinTypeContext ctx = ...
if (ctx.start.getType() == YourLexer.BOOL) {
// it's a BOOL token
}
I am new to JavaCC, and have read multiple lookahead tutorials. However when testing lookahead on a simple grammar file have left me puzzled. In this grammar file I just made two parsing rules, 1->double, 2->integers.
The program is supposed to choose on of them, if the input suits the context.
options
{
STATIC =false;
debug_parser = true;
debug_lookahead = true;
}
PARSER_BEGIN(testin)
public class testin
{
public void parse() throws ParseException
{
testin b = new testin(System.in);
b.go();
}
}
PARSER_END(testin)
TOKEN:
{
//testing lookahead
<NUMBER : (["0"-"9"])+>|
<DOT : ".">
}
void go()
:
{}
{
LOOKAHEAD(2) doub() | number()
}
void doub()
:
{
Token bool;
}
{
bool = <NUMBER><DOT><NUMBER>
{
System.out.println(Double.parseDouble(bool.image));
}
}
void number()
:
{
Token mo;
}
{
mo = <NUMBER>
{
System.out.println(Integer.parseInt(mo.image));
}
}
After testing with this code when typing with decimals the input, it works, but when typing with an integer it doesn't and it doesn't output anything. Here is the debug output:
Call: go
Call: doub(LOOKING AHEAD...)
75
Visited token: <<NUMBER>: "75" at line 1 column 1>; Expected token:<<NUMBER>>
Try to specify the whitespace characters:
SKIP : {
< " " | "\t" | "\r" | "\n" >
}
Also, try to specify where the <EOF> is expected:
void go()
:
{}
{
(LOOKAHEAD(2) doub() | number())
<EOF>
}
Here is a short javaCC code:
PARSER_BEGIN(TestParser)
public class TestParser
{
}
PARSER_END(TestParser)
SKIP :
{
" "
| "\t"
| "\n"
| "\r"
}
TOKEN : /* LITERALS */
{
<VOID: "void">
| <LPAR: "("> | <RPAR: ")">
| <LBRAC: "{"> | <RBRAC: "}">
| <COMMA: ",">
| <DATATYPE: "int">
| <#LETTER: ["_","a"-"z","A"-"Z"] >
| <#DIGIT: ["0"-"9"] >
| <DOUBLE_QUOTE_LITERAL: "\"" (~["\""])*"\"" >
| <IDENTIFIER: <LETTER> (<LETTER>|<DIGIT>)* >
| <VARIABLE: "$"<IDENTIFIER> >
}
public void input():{} { (statement())+ <EOF> }
private void statement():{}
{
<VOID> <IDENTIFIER> <LPAR> (<DATATYPE> <IDENTIFIER> (<COMMA> <DATATYPE> <IDENTIFIER>)*)? <RPAR>
<LBRAC>
<RBRAC>
}
I'd like this parser to handle the following kind of input with a "grammar-free" section (character '}' would be the end of the section ):
void fun(int i, int j)
{
Hello world the value of i is ${i}
and j=${j}.
}
the grammar-free section would return a
java.util.List<String_or_VariableReference>
How should I modify my javacc parser to handle this section ?
Thanks.
If I understand the question correctly, you want to allow essentially arbitrary input for a while and then switch back to your language. If you can decide when to make the switch based purely on tokens, then this is easy to do using two lexical states. Use the default state for your programming language. When a "{" is seen in the DEFAULT state, switch to the other state
TOKEN: { <LBRACE : "{" > : FREE }
In the FREE state, when a "}" is seen, switch back to the DEFAULT state; when any other character is seen, pass it on to the parser.
<FREE> TOKEN { <RBRACE : "}" > : DEFAULT }
<FREE> TOKEN { <OTHER : ~["}"] > : FREE }
In the parser you can have
void freeSection() : {} { <LBRACE> (<OTHER>)* <RBRACE> }
If you want to do something with all those OTHER characters, see question 5.2 in the FAQ. http://www.engr.mun.ca/~theo/JavaCC-FAQ
If you want to capture variable references such as "${i}" in the FREE state, you can to that too. Add
<FREE> TOKEN { <VARREF : "${" (["a"-"Z"]|["A"-"Z"])* "}" > }
Given a String like..
(a+(a+b)), (d*e) :- (e-f)
Note: (d*e) and (e-f) are different expressions. How can I fetch the expressions from this string. I have the grammar defined as..
parse returns [String value]
: addExp {$value=$addExp.value;} EOF
;
addExp returns [String value]
: multExp {$value=$multExp.value;} (('+' | '-' | '*') multExp{$value+= '+' + $multExp.value;})*
;
multExp returns [String value]
: atom {$value=$atom.value;} (('*' | '/') atom {$value+=$atom.value;)*
;
atom returns [String value]
: x=ID {$value=$x.text;}
| '(' addExp ')' {$value='('+$addExp.value+')';}
;
ID : 'a'..'z' | 'A'..'Z';
I tried..
ANTLRStringStream a=new ANTLRStringStream("(a+(a+b)), (d*e) :- (e-f)");
SLexer l=new SLexer(a);
CommonTokenStream c=new CommonTokenStream(l);
SParser p=new Sparser(c);
String exp;
while(exp = p.parse())
{
System.out.println(exp);
}
I'm thinking of something like hasNext() and then fetching.
Your lexer rules TEXT possibly matches an empty string, causing the lexer to create an infinite amount of tokens. Also, you don't need all those return statements after your rule: you can simply grab what a parser (or lexer) rule matched by adding .text after it.
You could let your parser return a List<String>, or let it return a single String repeatedly invoke that parser rule until EOF is encountered.
A little demo:
grammar T;
#parser::members {
public static void main(String[] args) throws Exception {
String src = "likes(a, b) :- likes(a, X), likes(X, b). hates(a, b) " +
":- hates(a,X), hates(X,b). likes(a,b) :- says(god, likes(a,b)).";
TLexer lexer = new TLexer(new ANTLRStringStream(src));
TParser parser = new TParser(new CommonTokenStream(lexer));
List<String> statements = parser.parse();
for(String s : statements) {
System.out.println(s);
}
}
}
parse returns [List<String> statements]
#init{$statements = new ArrayList<String>();}
: (statement {$statements.add($statement.text);} ~TEXT+)+ EOF
;
statement
: TEXT OPAR params CPAR
;
params
: (param (COMMA param)*)?
;
param
: TEXT
| statement
;
COMMA : ',';
OPAR : '(';
CPAR : ')';
TEXT : ('a'..'z' | 'A'..'Z')+;
SPACE : (' ' | '\t') {$channel=HIDDEN;};
OTHER : . ;
Note that ~TEXT+ in the parse rule matches one or more tokens other than TEXT.
If you now create a lexer and parser and run the TParser class:
*nix/MacOS
java -cp antlr-3.3.jar org.antlr.Tool T.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar TParser
or
Windows
java -cp antlr-3.3.jar org.antlr.Tool T.g
javac -cp antlr-3.3.jar *.java
java -cp .;antlr-3.3.jar TParser
you will see the following being printed to your console:
likes(a, b)
likes(a, X)
likes(X, b)
hates(a, b)
hates(a,X)
hates(X,b)
likes(a,b)
says(god, likes(a,b))
EDIT
And here's how to return a single String opposed to a List<String>:
#parser::members {
public static void main(String[] args) throws Exception {
String src = "likes(a, b) :- likes(a, X), likes(X, b). hates(a, b) " +
":- hates(a,X), hates(X,b). likes(a,b) :- says(god, likes(a,b)).";
TLexer lexer = new TLexer(new ANTLRStringStream(src));
TParser parser = new TParser(new CommonTokenStream(lexer));
String s;
while((s = parser.parse()) != null) {
System.out.println(s);
}
}
}
parse returns [String s]
: statement ~(TEXT| EOF)* {$s = $statement.text;}
| EOF {$s = null;}
;
You should just be able to call sentence() repeatedly until you hit the end of input.
I am parsing an InputStream for certain patterns to extract values from it, e.g. I would have something like
<span class="filename">foo
I don't want to use a full fledged html parser as I am not interested in the document structure but only in some well defined bits of information. (Only their order is important)
Currently I am using a very simple approach, I have an Object for each Pattern that contains a char[] of the opening and closing 'tag' (in the example opening would be <span class="filename"><a href="and closing " to get the url) and a position marker. For each character read by of the InputStream, I iterate over all Patterns and call the match(char) function that returns true once the opening pattern does match, from then on I collect the following chars in a StringBuilder until the now active pattern does match() again. I then call a function with the ID of the Pattern and the String read, to process it further.
While this works fine in most cases, I wanted to include regular expressions in the pattern, so I could also match something like
<span class="filename" id="234217">foo
At this point I was sure I would reinvent the wheel as this most certainly would have been done before, and I don't really want to write my own regex parser to begin with. However, I could not find anything that would do what I was looking for.
Unfortunately the Scanner class only matches one pattern, not a list of patterns, what alternatives could I use? It should not be heavy and work with Android.
You mean you want to match any <span> element with a given class attribute, irrespective of other attributes it may have? That's easy enough:
Scanner sc = new Scanner(new File("test.txt"), "UTF-8");
Pattern p = Pattern.compile(
"<span[^>]*class=\"filename\"[^>]*>\\s*<a[^>]*href=\"([^\"]+)\""
);
while (sc.findWithinHorizon(p, 0) != null)
{
MatchResult m = sc.match();
System.out.println(m.group(1));
}
The file "test.txt" contains the text of your question, and the output is:
http://example.com/foo
and closing
http://example.com/foo
the Scanner.useDelimiter(Pattern) API seems to be what you're looking for. You would have to use an OR (|) separated pattern string.
This pattern can get really complicated really quickly though.
You are right to think this has all been done before :) What you are talking about is a problem of tokenizing and parsing and I therefore suggest you consider JavaCC.
There is something of a learning curve with JavaCC as you learn to understand it's grammar, so below is an implementation to get you started.
The grammar is a chopped down version of the standard JavaCC grammar for HTML. You can add more productions for matching other patterns.
options {
JDK_VERSION = "1.5";
static = false;
}
PARSER_BEGIN(eg1)
import java.util.*;
public class eg1 {
private String currentTag;
private String currentSpanClass;
private String currentHref;
public static void main(String args []) throws ParseException {
System.out.println("Starting parse");
eg1 parser = new eg1(System.in);
parser.parse();
System.out.println("Finishing parse");
}
}
PARSER_END(eg1)
SKIP :
{
< ( " " | "\t" | "\n" | "\r" )+ >
| < "<!" ( ~[">"] )* ">" >
}
TOKEN :
{
<STAGO: "<" > : TAG
| <ETAGO: "</" > : TAG
| <PCDATA: ( ~["<"] )+ >
}
<TAG> TOKEN [IGNORE_CASE] :
{
<A: "a" > : ATTLIST
| <SPAN: "span" > : ATTLIST
| <DONT_CARE: (["a"-"z"] | ["0"-"9"])+ > : ATTLIST
}
<ATTLIST> SKIP :
{
< " " | "\t" | "\n" | "\r" >
| < "--" > : ATTCOMM
}
<ATTLIST> TOKEN :
{
<TAGC: ">" > : DEFAULT
| <A_EQ: "=" > : ATTRVAL
| <#ALPHA: ["a"-"z","A"-"Z","_","-","."] >
| <#NUM: ["0"-"9"] >
| <#ALPHANUM: <ALPHA> | <NUM> >
| <A_NAME: <ALPHA> ( <ALPHANUM> )* >
}
<ATTRVAL> TOKEN :
{
<CDATA: "'" ( ~["'"] )* "'"
| "\"" ( ~["\""] )* "\""
| ( ~[">", "\"", "'", " ", "\t", "\n", "\r"] )+
> : ATTLIST
}
<ATTCOMM> SKIP :
{
< ( ~["-"] )+ >
| < "-" ( ~["-"] )+ >
| < "--" > : ATTLIST
}
void attribute(Map<String,String> attrs) :
{
Token n, v = null;
}
{
n=<A_NAME> [ <A_EQ> v=<CDATA> ]
{
String attval;
if (v == null) {
attval = "#DEFAULT";
} else {
attval = v.image;
if( attval.startsWith("\"") && attval.endsWith("\"") ) {
attval = attval.substring(1,attval.length()-1);
} else if( attval.startsWith("'") && attval.endsWith("'") ) {
attval = attval.substring(1,attval.length()-1);
}
}
if( attrs!=null ) attrs.put(n.image.toLowerCase(),attval);
}
}
void attList(Map<String,String> attrs) : {}
{
( attribute(attrs) )+
}
void tagAStart() : {
Map<String,String> attrs = new HashMap<String,String>();
}
{
<STAGO> <A> [ attList(attrs) ] <TAGC>
{
currentHref=attrs.get("href");
if( currentHref != null && "filename".equals(currentSpanClass) )
{
System.out.println("Found URL: "+currentHref);
}
}
}
void tagAEnd() : {}
{
<ETAGO> <A> <TAGC>
{
currentHref=null;
}
}
void tagSpanStart() : {
Map<String,String> attrs = new HashMap<String,String>();
}
{
<STAGO> <SPAN> [ attList(attrs) ] <TAGC>
{
currentSpanClass=attrs.get("class");
}
}
void tagSpanEnd() : {}
{
<ETAGO> <SPAN> <TAGC>
{
currentSpanClass=null;
}
}
void tagDontCareStart() : {}
{
<STAGO> <DONT_CARE> [ attList(null) ] <TAGC>
}
void tagDontCareEnd() : {}
{
<ETAGO> <DONT_CARE> <TAGC>
}
void parse() : {}
{
(
LOOKAHEAD(2) tagAStart() |
LOOKAHEAD(2) tagAEnd() |
LOOKAHEAD(2) tagSpanStart() |
LOOKAHEAD(2) tagSpanEnd() |
LOOKAHEAD(2) tagDontCareStart() |
LOOKAHEAD(2) tagDontCareEnd() |
<PCDATA>
)*
}