I'm trying to learn to use ANTLR, but I cannot figure out what's wrong with my code in this case. I hope this will be really easy for anyone with some experience with it. This is the grammar (really short).
grammar SmallTest;
#header {
package parseTest;
import java.util.ArrayList;
}
prog returns [ArrayList<ArrayList<String>> all]
:(stat { if ($all == null)
$all = new ArrayList<ArrayList<String>>();
$all.add($stat.res);
} )+
;
stat returns [ArrayList<String> res]
:(element { if ($res == null)
$res = new ArrayList<String>();
$res.add($element.text);
} )+ NEWLINE
| NEWLINE
;
element: ('a'..'z'|'A'..'Z')+ ;
NEWLINE:'\r'? '\n' ;
The problem is that when I generate the Java code there are some empty if conditions, and the compiler displays an error because of that, I could edit that manually, but that would probably be much worse. I guess something is wrong in this.
Sorry for asking, this has to be really stupid, but my example is so similar to those in the site that I cannot imagine a way to atomize the differences any more.
Thank you very much.
You should put the initialization of your lists inside the #init { ... } block of the rules, which get executed before anything in the rule is matched.
Also, your element rule should not be a parser rule, but a lexer rule instead (it should start with a capital!).
And the entry point of your parser, the prog rule, should end with the EOF token otherwise the parser might stop before all tokens are handled properly.
Finally, the #header { ... } section only applies to the parser (it is a short-hand for #parser::header { ... }), you need to add the package declaration to the lexer as well.
A working demo:
SmallTest.g
grammar SmallTest;
#header {
package parseTest;
import java.util.ArrayList;
}
#lexer::header {
package parseTest;
}
prog returns [ArrayList<ArrayList<String>> all]
#init {$all = new ArrayList<ArrayList<String>>();}
: (stat {$all.add($stat.res);})+ EOF
;
stat returns [ArrayList<String> res]
#init {$res = new ArrayList<String>();}
: (ELEMENT {$res.add($ELEMENT.text);})* NEWLINE
;
ELEMENT : ('a'..'z'|'A'..'Z')+ ;
NEWLINE : '\r'? '\n' ;
SPACE : ' ' {skip();};
Main.java
package parseTest;
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
SmallTestLexer lexer = new SmallTestLexer(new ANTLRStringStream("a bb ccc\ndddd eeeee\n"));
SmallTestParser parser = new SmallTestParser(new CommonTokenStream(lexer));
System.out.println(parser.prog());
}
}
And to run it all, do:
java -cp antlr-3.3.jar org.antlr.Tool parseTest/SmallTest.g
javac -cp .:antlr-3.3.jar parseTest/*.java
java -cp .:antlr-3.3.jar parseTest.Main
which yields:
[[a, bb, ccc], [dddd, eeeee]]
Try converting element into a token
ELEMENT: ('a'..'z'|'A'..'Z')+ ;
Related
I'm trying to create a grammar which parses a file line by line.
grammar Comp;
options
{
language = Java;
}
#header {
package analyseur;
import java.util.*;
import component.*;
}
#parser::members {
/** Line to write in the new java file */
public String line;
}
start
: objectRule {System.out.println("OBJ"); line = $objectRule.text;}
| anyString {System.out.println("ANY"); line = $anyString.text;}
;
objectRule : ObjectKeyword ID ;
anyString : ANY_STRING ;
ObjectKeyword : 'Object' ;
ID : [a-zA-Z]+ ;
ANY_STRING : (~'\n')+ ;
WhiteSpace : (' '|'\t') -> skip;
When I send the lexem 'Object o' to the grammar, the output is ANY instead of OBJ.
'Object o' => 'ANY' // I would like OBJ
I know the ANY_STRING is longer but I wrote lexer tokens in the order. What is the problem ?
Thank you very much for your help ! ;)
For lexer rules, the rule with the longest match wins, independent of rule ordering. If the match length is the same, then the first listed rule wins.
To make rule order meaningful, reduce the possible match length of the ANY_STRING rule to be the same or less than any key word or id:
ANY_STRING: ~( ' ' | '\n' | '\t' ) ; // also?: '\r' | '\f' | '_'
Update
To see what the lexer is actually doing, dump the token stream.
I cannot get JavaCC to properly disambiguate tokens by their place in a grammar. I have the following JJTree file (I'll call it bug.jjt):
options
{
LOOKAHEAD = 3;
CHOICE_AMBIGUITY_CHECK = 2;
OTHER_AMBIGUITY_CHECK = 1;
SANITY_CHECK = true;
FORCE_LA_CHECK = true;
}
PARSER_BEGIN(MyParser)
import java.util.*;
public class MyParser {
public static void main(String[] args) throws ParseException {
MyParser parser = new MyParser(new java.io.StringReader(args[0]));
SimpleNode root = parser.production();
root.dump("");
}
}
PARSER_END(MyParser)
SKIP:
{
" "
}
TOKEN:
{
<STATE: ("state")>
|<PROD_NAME: (["a"-"z"])+ >
}
SimpleNode production():
{}
{
(
<PROD_NAME>
<STATE>
<EOF>
)
{return jjtThis;}
}
Generate the parser code with the following:
java -cp C:\path\to\javacc.jar jjtree bug.jjt
java -cp C:\path\to\javacc.jar javacc bug.jj
Now after compiling this, you can give run MyParser from the command line with a string to parse as the argument. It prints production if successful and spews an error if it fails.
I tried two simple inputs: foo state and state state. The first one parses, but the second one does not, since both state strings are tokenized as <STATE>. As I set LOOKAHEAD to 3, I expected it to use the grammar and see that one string state must be <STATE> and the other must be <PROD_NAME. However, no such luck. I have tried changing the various lookahead parameters to no avail. I am also not able to use tokenizer states (where you define different tokens allowable in different states), as this example is part of a more complicated system that will probably have a lot of these types of ambiguities.
Can anyone tell me how to make JavaCC properly disambiguate these tokens, without using tokenizer states?
This is covered in the FAQ under question 4.19.
There are three strategies outlined there
Putting choices in the grammar. See Bart Kiers's answer.
Using semantic look ahead. For this approach you get rid of the production defining STATE and write your grammar like this
void SimpleNode production():
{}
{
(
<PROD_NAME>
( LOOKAHEAD({getToken(1).kind == PROD_NAME && getToken(1).image.equals("state")})
<PROD_NAME>
...
|
...other choices...
)
)
{return jjtThis;}
}
If there are no other choices, then
void SimpleNode production():
{}
{
(
<PROD_NAME>
( LOOKAHEAD({getToken(1).kind == PROD_NAME && getToken(1).image.equals("state")})
<PROD_NAME>
...
|
{ int[][] expTokSeqs = new int[][] { new int[] {STATE } } ;
throw new ParseException(token, expTokSeqs, tokenImage) ; }
)
)
{return jjtThis;}
}
But, in this case, you need a production for STATE, as it is mentioned in the initialization of expTokSeqs. So you need a production.
< DUMMY > TOKEN : { < STATE : "state" > }
where DUMMY is a state that is never gone to.
Using lexical states. The title of the OP's question suggests he doesn't want to do this, but not why. It can be done if the state switching can be contained in the token manager. Suppose a file is a sequence of productions and each of production look like this.
name state : "a" | "b" name ;
That is it starts with a name, then the keyword "state" a colon, some tokens and finally a semicolon. (I'm just making this up as I have no idea what sort of language the OP is trying to parse.) Then you can use three lexical states DEFAULT, S0, and S1.
In the DEFAULT any sequence of letters (including "state") is a PROD_NAME. In DEFAULT, recognizing a PROD_NAME switches the state to S0.
In S0 any sequence of letters except "state" is a PROD_NAME and "state" is a STATE. In S0, recognizing a STATE token causes the tokenizer to switch to state S1.
In S1 any any sequence of letters (including "state") is a PROD_NAME. In S1, recognizing a SEMICOLON switches the state to DEFAULT.
So our example is tokenized like this
name state : "a" | "b" name ;
|__||______||_________________||_________
DEF- S0 S1 DEFAULT
AULT
The productions are written like this
<*> SKIP: { " " }
<S0> TOKEN: { <STATE: "state"> : S1 }
<DEFAULT> TOKEN:{ <PROD_NAME: (["a"-"z"])+ > : S0 }
<S0,S1> TOKEN:{ <PROD_NAME: (["a"-"z"])+ > }
<S1> TOKEN: { <SEMICOLON : ";" > : DEFAULT
<S0, DEFAULT> TOKEN : { <SEMICOLON : ";" > }
<*> TOKEN {
COLON : ":"
| ...etc...
}
It is possible for the parser to send state switching commands back to the tokenizer, but it is tricky to get it right and fragile. Se question 3.12 of the FAQ.
Lookahead does not concern the lexer while it composes characters to tokens. It is used by the parser when it matches non-terminals as composed from terminals (tokens).
If you define "state" to result in a token STATE, well, then that's what it is.
I agree with you, that tokenizer states aren't a good solution for permitting keywords to be used as identifiers. Is this really necessary? There's a good reason for HLL's not to permit this.
OTOH, if you can rewrite your grammar using just <PROD_NAME>s you might postpone the recognitions of the keywords during semantic analysis.
The LOOKAHEAD option only applies to the parser (production rules). The tokenizer is not affected by this: it will produce tokens without worrying what a production rule is trying to match. The input "state" will always be tokenized as a STATE, even if the parser is trying to match a PROD_NAME.
You could do something like this (untested, pseudo-ish grammar code ahead!):
SimpleNode production():
{}
{
(
prod_name_or_state()
<STATE>
<EOF>
)
{return jjtThis;}
}
SimpleNode prod_name_or_state():
{}
{
(
<PROD_NAME>
| <STATE>
)
{return jjtThis;}
}
which would match both "foo state" and "state state".
Or the equivalent, but more compact:
SimpleNode production():
{}
{
(
( <PROD_NAME> | <STATE> )
<STATE>
<EOF>
)
{return jjtThis;}
}
I'm missing some basic knowledge. Started playing around with ATLR today missing any source telling me how to do the following:
I'd like to parse a configuration file a program of mine currently reads in a very ugly way. Basically it looks like:
A [Data] [Data]
B [Data] [Data] [Data]
where A/B/... are objects with their associated data following (dynamic amount, only simple digits).
A grammar should not be that hard but how to use ANTLR now?
lexer only: A/B are tokens and I ask for the tokens he read. How to ask this and how to detect malformatted input?
lexer & parser: A/B are parser rules and... how to know the parser processed successfully A/B? The same object could appear multiple times in the file and I need to consider every single one. It's more like listing instances in the config file.
Edit:
My problem is not the grammer but how to get informed by parser/lexer what they actually found/parsed? Best would be: invoke a function upon recognition of a rule like recursive descent
ANTLR production rules can have return value(s) you can use to get the contents of your configuration file.
Here's a quick demo:
grammar T;
parse returns [java.util.Map<String, List<Integer>> map]
#init{$map = new java.util.HashMap<String, List<Integer>>();}
: (line {$map.put($line.key, $line.values);} )+ EOF
;
line returns [String key, List<Integer> values]
: Id numbers (NL | EOF)
{
$key = $Id.text;
$values = $numbers.list;
}
;
numbers returns [List<Integer> list]
#init{$list = new ArrayList<Integer>();}
: (Num {$list.add(Integer.parseInt($Num.text));} )+
;
Num : '0'..'9'+;
Id : ('a'..'z' | 'A'..'Z')+;
NL : '\r'? '\n' | '\r';
Space : (' ' | '\t')+ {skip();};
If you runt the class below:
import org.antlr.runtime.*;
import java.util.*;
public class Main {
public static void main(String[] args) throws Exception {
String input = "A 12 34\n" +
"B 5 6 7 8\n" +
"C 9";
TLexer lexer = new TLexer(new ANTLRStringStream(input));
TParser parser = new TParser(new CommonTokenStream(lexer));
Map<String, List<Integer>> values = parser.parse();
System.out.println(values);
}
}
the following will be printed to the console:
{A=[12, 34], B=[5, 6, 7, 8], C=[9]}
The grammar should be something like this (it's pseudocode not ANTLR):
FILE ::= STATEMENT ('\n' STATEMENT)*
STATEMENT ::= NAME ITEM*
ITEM = '[' \d+ ']'
NAME = \w+
If you are looking for way to execute code when something is parsed, you should either use actions or AST (look them up in the documentation).
I'm starting with ANTLR, but I get some errors and I really don't understand why.
Here you have my really simple grammar
grammar Expr;
options {backtrack=true;}
#header {}
#members {}
expr returns [String s]
: (LETTER SPACE DIGIT | TKDC) {$s = $DIGIT.text + $TKDC.text;}
;
// TOKENS
SPACE : ' ' ;
LETTER : 'd' ;
DIGIT : '0'..'9' ;
TKDC returns [String s] : 'd' SPACE 'C' {$s = "d C";} ;
This is the JAVA source, where I only ask for the "expr" result:
import org.antlr.runtime.*;
class Testantlr {
public static void main(String[] args) throws Exception {
ExprLexer lex = new ExprLexer(new ANTLRFileStream(args[0]));
CommonTokenStream tokens = new CommonTokenStream(lex);
ExprParser parser = new ExprParser(tokens);
try {
System.out.println(parser.expr());
} catch (RecognitionException e) {
e.printStackTrace();
}
}
}
The problem comes when my input file has the following content d 9.
I get the following error:
x line 1:2 mismatched character '9' expecting 'C'
x line 1:3 no viable alternative at input '<EOF>'
Does anyone knwos the problem here?
There are a few things wrong with your grammar:
lexer rules can only return Tokens, so returns [String s] is ignored after TKDC;
backtrack=true in your options section does not apply to lexer rules, that is why you get mismatched character '9' expecting 'C' (no backtracking there!);
the contents of your expr rule: (LETTER SPACE DIGIT | TKDC) {$s = $DIGIT.text + $TKDC.text;} doesn't make much sense (to me). You either want to match LETTER SPACE DIGIT or TKDC, yet you're trying to grab the text of both choices: $DIGIT.text and $TKDC.text.
It looks to me TKDC needs to be "promoted" to a parser rule instead.
I think you dumbed down your example a bit too much to illustrate the problem you were facing. Perhaps it's a better idea to explain your actual problem instead: what are you trying to parse exactly?
I'm writing a Perl script and I've come to a point where I need to parse a Java source file line by line checking for references to a fully qualified Java class name. I know the class I'm looking for up front; also the fully qualified name of the source file that is being searched (based on its path).
For example find all valid references to foo.bar.Baz inside the com/bob/is/YourUncle.java file.
At this moment the cases I can think of that it needs to account for are:
The file being parsed is in the same package as the search class.
find foo.bar.Baz references in foo/bar/Boing.java
It should ignore comments.
// this is a comment saying this method returns a foo.bar.Baz or Baz instance
// it shouldn't count
/* a multiline comment as well
this shouldn't count
if I put foo.bar.Baz or Baz in here either */
In-line fully qualified references.
foo.bar.Baz fb = new foo.bar.Baz();
References based off an import statement.
import foo.bar.Baz;
...
Baz b = new Baz();
What would be the most efficient way to do this in Perl 5.8? Some fancy regex perhaps?
open F, $File::Find::name or die;
# these three things are already known
# $classToFind looking for references of this class
# $pkgToFind the package of the class you're finding references of
# $currentPkg package name of the file being parsed
while(<F>){
# ... do work here
}
close F;
# the results are availble here in some form
You also need to skip quoted strings (you can't even skip comments correctly if you don't also deal with quoted strings).
I'd probably write a fairly simple, efficient, and incomplete tokenizer very similar to the one I wrote in node 566467.
Based on that code I'd probably just dig through the non-comment/non-string chunks looking for \bimport\b and \b\Q$toFind\E\b matches. Perhaps similar to:
if( m[
\G
(?:
[^'"/]+
| /(?![/*])
)+
]xgc
) {
my $code = substr( $_, $-[0], $+[0] - $-[0] );
my $imported = 0;
while( $code =~ /\b(import\s+)?\Q$package\E\b/g ) {
if( $1 ) {
... # Found importing of package
while( $code =~ /\b\Q$class\E\b/g ) {
... # Found mention of imported class
}
last;
}
... # Found a package reference
}
} elsif( m[ \G ' (?: [^'\\]+ | \\. )* ' ]xgc
|| m[ \G " (?: [^"\\]+ | \\. )* " ]xgc
) {
# skip quoted strings
} elsif( m[\G//.*]gc ) {
# skip C++ comments
A Regex is probably the best solution for this, although I did find the following module in CPAN that you might be able to use
Java::JVM::Classfile - Parses compiled class files and returns info about them. You would have to compile the files before you could use this.
Also, remember that it can be tricky to catch all possible variants of a multi-line comment with a regex.
This is really just a straight grep for Baz (or for /(foo.bar.| )Baz/ if you're concerned about false positives from some.other.Baz), but ignoring comments, isn't it?
If so, I'd knock together a state engine to track whether you're in a multiline comment or not. The regexes needed aren't anything special. Something along the lines of (untested code):
my $in_comment;
my %matches;
my $line_num = 0;
my $full_target = 'foo.bar.Baz';
my $short_target = (split /\./, $full_target)[-1]; # segment after last . (Baz)
while (my $line = <F>) {
$line_num++;
if ($in_comment) {
next unless $line =~ m|\*/|; # ignore line unless it ends the comment
$line =~ s|.*\*/||; # delete everything prior to end of comment
} elsif ($line =~ m|/\*|) {
if ($line =~ m|\*/|) { # catch /* and */ on same line
$line =~ s|/\*.*\*/||;
} else {
$in_comment = 1;
$line =~ s|/\*.*||; # clear from start of comment to end of line
}
}
$line =~ s/\\\\.*//; # remove single-line comments
$matches{$line_num} = $line if $line =~ /$full_target| $short_target/;
}
for my $key (sort keys %matches) {
print $key, ': ', $matches{$key}, "\n";
}
It's not perfect and the in/out of comment state can be messed up by nested multiline comments or if there are multiple multiline comments on the same line, but that's probably good enough for most real-world cases.
To do it without the state engine, you'd need to slurp into a single string, delete the /.../ comments, and split it back into separate lines, and grep those for non-//-comment hits. But you wouldn't be able to include line numbers in the output that way.
This is what I came up with that works for all the different cases I've thrown at it. I'm still a Perl noob and its probably not the fastest thing in the world but it should work for what I need. Thanks for all the answers they helped me look at it in different ways.
my $className = 'Baz';
my $searchPkg = 'foo.bar';
my #potentialRefs, my #confirmedRefs;
my $samePkg = 0;
my $imported = 0;
my $currentPkg = 'com.bob';
$currentPkg =~ s/\//\./g;
if($currentPkg eq $searchPkg){
$samePkg = 1;
}
my $inMultiLineComment = 0;
open F, $_ or die;
my $lineNum = 0;
while(<F>){
$lineNum++;
if($inMultiLineComment){
if(m|^.*?\*/|){
s|^.*?\*/||; #get rid of the closing part of the multiline comment we're in
$inMultiLineComment = 0;
}else{
next;
}
}
if(length($_) > 0){
s|"([^"\\]*(\\.[^"\\]*)*)"||g; #remove strings first since java cannot have multiline string literals
s|/\*.*?\*/||g; #remove any multiline comments that start and end on the same line
s|//.*$||; #remove the // comments from what's left
if (m|/\*.*$|){
$inMultiLineComment = 1 ;#now if you have any occurence of /* then then at least some of the next line is in the multiline comment
s|/\*.*$||g;
}
}else{
next; #no sense continuing to process a blank string
}
if (/^\s*(import )?($searchPkg)?(.*)?\b$className\b/){
if($imported || $samePkg){
push(#confirmedRefs, $lineNum);
}else {
push(#potentialRefs, $lineNum);
}
if($1){
$imported = 1;
} elsif($2){
push(#confirmedRefs, $lineNum);
}
}
}
close F;
if($imported){
push(#confirmedRefs,#potentialRefs);
}
for (#confirmedRefs){
print "$_\n";
}
If you are feeling adventurous enough you could have a look at Parse::RecDescent.