ANTLR: parse configuration file - java

I'm missing some basic knowledge. Started playing around with ATLR today missing any source telling me how to do the following:
I'd like to parse a configuration file a program of mine currently reads in a very ugly way. Basically it looks like:
A [Data] [Data]
B [Data] [Data] [Data]
where A/B/... are objects with their associated data following (dynamic amount, only simple digits).
A grammar should not be that hard but how to use ANTLR now?
lexer only: A/B are tokens and I ask for the tokens he read. How to ask this and how to detect malformatted input?
lexer & parser: A/B are parser rules and... how to know the parser processed successfully A/B? The same object could appear multiple times in the file and I need to consider every single one. It's more like listing instances in the config file.
Edit:
My problem is not the grammer but how to get informed by parser/lexer what they actually found/parsed? Best would be: invoke a function upon recognition of a rule like recursive descent

ANTLR production rules can have return value(s) you can use to get the contents of your configuration file.
Here's a quick demo:
grammar T;
parse returns [java.util.Map<String, List<Integer>> map]
#init{$map = new java.util.HashMap<String, List<Integer>>();}
: (line {$map.put($line.key, $line.values);} )+ EOF
;
line returns [String key, List<Integer> values]
: Id numbers (NL | EOF)
{
$key = $Id.text;
$values = $numbers.list;
}
;
numbers returns [List<Integer> list]
#init{$list = new ArrayList<Integer>();}
: (Num {$list.add(Integer.parseInt($Num.text));} )+
;
Num : '0'..'9'+;
Id : ('a'..'z' | 'A'..'Z')+;
NL : '\r'? '\n' | '\r';
Space : (' ' | '\t')+ {skip();};
If you runt the class below:
import org.antlr.runtime.*;
import java.util.*;
public class Main {
public static void main(String[] args) throws Exception {
String input = "A 12 34\n" +
"B 5 6 7 8\n" +
"C 9";
TLexer lexer = new TLexer(new ANTLRStringStream(input));
TParser parser = new TParser(new CommonTokenStream(lexer));
Map<String, List<Integer>> values = parser.parse();
System.out.println(values);
}
}
the following will be printed to the console:
{A=[12, 34], B=[5, 6, 7, 8], C=[9]}

The grammar should be something like this (it's pseudocode not ANTLR):
FILE ::= STATEMENT ('\n' STATEMENT)*
STATEMENT ::= NAME ITEM*
ITEM = '[' \d+ ']'
NAME = \w+

If you are looking for way to execute code when something is parsed, you should either use actions or AST (look them up in the documentation).

Related

Positive lookahead using token types in lexer rules

I am migrating a grammar that I initially wrote using GrammarKit (GrammarKit uses Flex as its lexer).
I am struggling finding the best way to write a positive lookahead using token types in lexer rules.
Below my first experiment with a (very) simplified version of my problem using lookahead based on characters in the stream:
grammar PossitiveLookAheadCharacters;
#header {
package lookahead;
}
#lexer::members {
private boolean isChar(int charPosition, char testChar) {
return _input.LA(charPosition) == testChar;
}
}
r : CONS | DOT | LEFT_PAR | RIGHT_PAR;
LEFT_PAR : '(';
RIGHT_PAR : ')';
CONS : DOT {isChar(1, '(')}? {isChar(2, ')')}?;
DOT : '.';
WS : [ \t\r\n]+ -> skip ;
This works ok because the lookahead is just based on character comparison.
If I test this using the test rig, I obtain the following expected output:
> grun lookahead.PossitiveLookAheadCharacters r -tokens
.()
[#0,0:0='.',<CONS>,1:0]
[#1,1:1='(',<'('>,1:1]
[#2,2:2=')',<')'>,1:2]
[#3,4:3='<EOF>',<EOF>,2:0]
But I cannot make this work correctly if I want to write the look ahead based on token types and not in characters from the stream (as I can easily do in Flex). After some trial and error this is the closest I have arrived:
grammar PossitiveLookAheadTokenType;
#header {
package lookahead;
}
#lexer::members {
private boolean isToken(int tokenPosition, int tokenId) {
int tokenAtPosition = new UnbufferedTokenStream(this).LA(tokenPosition);
System.out.println("LA(" + tokenPosition + ") = " + tokenAtPosition);
return tokenAtPosition == tokenId;
}
}
r : CONS | DOT | LEFT_PAR | RIGHT_PAR;
LEFT_PAR : '(';
RIGHT_PAR : ')';
CONS : DOT {isToken(1, LEFT_PAR)}? {isToken(2, RIGHT_PAR)}?;
DOT : '.';
WS : [ \t\r\n]+ -> skip ;
If I test this using the test rig, I see that the tests expressions are evaluated correctly (in short, this predicate is true: LA(1) == LEFT_PAR && LA(2) == RIGHT_PAR). But the first recognized token is not [#0,0:0='.',<CONS>,1:0] as expected, but [#0,2:2=')',<')'>,1:2] instead. Below the full output of my test:
? grun lookahead.PossitiveLookAheadTokenType r -tokens
.()
LA(1) = 1
LA(2) = 2
[#0,2:2=')',<')'>,1:2]
[#1,1:1='(',<'('>,1:1]
[#2,2:2=')',<')'>,1:2]
[#3,4:3='<EOF>',<EOF>,2:0]
I thought that the problem may be that the input stream is not in the right position anymore, so I tried resetting its position, as illustrates this new version of the isToken method:
private boolean isToken(int tokenPosition, int tokenId) {
int streamPosition = _input.index();
int tokenAtPosition = new UnbufferedTokenStream(this).LA(tokenPosition);
_input.seek(streamPosition);
return tokenAtPosition == tokenId;
}
But this did not help.
So my ANTLR4 question is: What is the recommended way to write a positive lookahead in lexer rules using token types and not plain characters ?
In Flex this is fully possible and it is as simple as writing something like:
{MY_MATCH}/{TOKEN_TO_THE_RIGHT}
What I love about the Flex approach here is that it is fully declarative and based on token types, not chars. I am wondering if something similar is possible in ANTLR4.
This cannot work the way you imagine it because what you are trying to do is to use a token (which is the result of a lexer rule) in an ongoing lexer rule. That means the lexer is in the process of determining the current token and therefore cannot determine another token at the same time.
What you probably want is a parser rule. In this scenario the lexer has done all the work already and you can easily look around for other tokens.
cons: DOT {isToken(1, LEFT_PAR) && isToken(2, RIGHT_PAR)}?;
r : cons | DOT | LEFT_PAR | RIGHT_PAR;
#parser::members {
private boolean isToken(int position, int tokenType) {
return _input.LT(position).getType() == tokenType;
}
}

ANTLR4 JAVA -Is it possible to extract fragments from the lexer at the Parser Listener point?

I have a Lexer Rule as follows:
PREFIX : [abcd]'_';
EXTRA : ('xyz' | 'XYZ' );
SUFFIX : [ab];
TCHAN : PREFIX EXTRA? DIGIT+ SUFFIX?;
and a parser rule:
tpin : TCHAN
;
In the exit_tpin() Listiner method, is there a syntax where I can extract the DIGIT component of the token? Right now I can get the ctx.TCHAN() element, but this is a string. I just want the digit portion of TCHAN.
Or should I remove TCHAN as a TOKEN and move that rule to be tpin (i.e)
tpin : PREFIX EXTRA? DIGIT+ SUFFIX?
Where I know how to extract DIGIT from the listener.
My guess is that by the time the TOKEN is presented to the parser it is too late to deconstruct it... but I was wondering if some ANTLR guru's out there knew of a technique.
If I re-write my TOKENIZER, there is a possiblity that TCHAN tokens will be missed for INT/ID tokens (I think thats why I ended up parsing as I do).
I can always do some regexp work in the listener method... but that seemed like bad form ... as I had the individual components earlier. I'm just lazy, and was wondering if a techniqe other than refactoring the parsing grammar was possible.
In The Definitive ANTLR Reference you can find examples of complex lexers where much of the work is done. But when learning ANTLR, I would advise to consider the lexer mostly for its splitting function of the input stream into small tokens. Then do the big work in the parser. In the present case I would do :
grammar Question;
/* extract digit */
question
: tpin EOF
;
tpin
// : PREFIX EXTRA? DIGIT+ SUFFIX?
// {System.out.println("The only useful information is " + $DIGIT.text);}
: PREFIX EXTRA? number SUFFIX?
{System.out.println("The only useful information is " + $number.text);}
;
number
: DIGIT+
;
PREFIX : [abcd]'_';
EXTRA : ('xyz' | 'XYZ' );
DIGIT : [0-9] ;
SUFFIX : [ab];
WS : [ \t\r\n]+ -> skip ;
Say the input is d_xyz123456b. With the first version
: PREFIX EXTRA? DIGIT+ SUFFIX?
you get
$ grun Question question -tokens data.txt
[#0,0:1='d_',<PREFIX>,1:0]
[#1,2:4='xyz',<EXTRA>,1:2]
[#2,5:5='1',<DIGIT>,1:5]
[#3,6:6='2',<DIGIT>,1:6]
[#4,7:7='3',<DIGIT>,1:7]
[#5,8:8='4',<DIGIT>,1:8]
[#6,9:9='5',<DIGIT>,1:9]
[#7,10:10='6',<DIGIT>,1:10]
[#8,11:11='b',<SUFFIX>,1:11]
[#9,13:12='<EOF>',<EOF>,2:0]
The only useful information is 6
Because the parsing of DIGIT+ translates to a loop which reuses DIGIT
setState(12);
_errHandler.sync(this);
_la = _input.LA(1);
do {
{
{
setState(11);
((TpinContext)_localctx).DIGIT = match(DIGIT);
}
}
setState(14);
_errHandler.sync(this);
_la = _input.LA(1);
} while ( _la==DIGIT );
and $DIGIT.text translates to ((TpinContext)_localctx).DIGIT.getText(), only the last digit is retained. That's why I define a subrule number
: PREFIX EXTRA? number SUFFIX?
which makes it easy to capture the value :
[#0,0:1='d_',<PREFIX>,1:0]
[#1,2:4='xyz',<EXTRA>,1:2]
[#2,5:5='1',<DIGIT>,1:5]
[#3,6:6='2',<DIGIT>,1:6]
[#4,7:7='3',<DIGIT>,1:7]
[#5,8:8='4',<DIGIT>,1:8]
[#6,9:9='5',<DIGIT>,1:9]
[#7,10:10='6',<DIGIT>,1:10]
[#8,11:11='b',<SUFFIX>,1:11]
[#9,13:12='<EOF>',<EOF>,2:0]
The only useful information is 123456
You can even make it simpler :
tpin
: PREFIX EXTRA? INT SUFFIX?
{System.out.println("The only useful information is " + $INT.text);}
;
PREFIX : [abcd]'_';
EXTRA : ('xyz' | 'XYZ' );
INT : [0-9]+ ;
SUFFIX : [ab];
WS : [ \t\r\n]+ -> skip ;
$ grun Question question -tokens data.txt
[#0,0:1='d_',<PREFIX>,1:0]
[#1,2:4='xyz',<EXTRA>,1:2]
[#2,5:10='123456',<INT>,1:5]
[#3,11:11='b',<SUFFIX>,1:11]
[#4,13:12='<EOF>',<EOF>,2:0]
The only useful information is 123456
In the listener you have a direct access to these values through the rule context TpinContext :
public static class TpinContext extends ParserRuleContext {
public Token INT;
public TerminalNode PREFIX() { return getToken(QuestionParser.PREFIX, 0); }
public TerminalNode INT() { return getToken(QuestionParser.INT, 0); }
public TerminalNode EXTRA() { return getToken(QuestionParser.EXTRA, 0); }
public TerminalNode SUFFIX() { return getToken(QuestionParser.SUFFIX, 0); }

Antlr4 doesn't recognize identifiers

I'm trying to create a grammar which parses a file line by line.
grammar Comp;
options
{
language = Java;
}
#header {
package analyseur;
import java.util.*;
import component.*;
}
#parser::members {
/** Line to write in the new java file */
public String line;
}
start
: objectRule {System.out.println("OBJ"); line = $objectRule.text;}
| anyString {System.out.println("ANY"); line = $anyString.text;}
;
objectRule : ObjectKeyword ID ;
anyString : ANY_STRING ;
ObjectKeyword : 'Object' ;
ID : [a-zA-Z]+ ;
ANY_STRING : (~'\n')+ ;
WhiteSpace : (' '|'\t') -> skip;
When I send the lexem 'Object o' to the grammar, the output is ANY instead of OBJ.
'Object o' => 'ANY' // I would like OBJ
I know the ANY_STRING is longer but I wrote lexer tokens in the order. What is the problem ?
Thank you very much for your help ! ;)
For lexer rules, the rule with the longest match wins, independent of rule ordering. If the match length is the same, then the first listed rule wins.
To make rule order meaningful, reduce the possible match length of the ANY_STRING rule to be the same or less than any key word or id:
ANY_STRING: ~( ' ' | '\n' | '\t' ) ; // also?: '\r' | '\f' | '_'
Update
To see what the lexer is actually doing, dump the token stream.

How to get line number in ANTLR3 tree-parser #init action

In ANTLR, version 3, how can the line number be obtained in the #init action of a high-level tree-parser rule?
For example, in the #init action below, I'd like to push the line number along with the sentence text.
sentence
#init { myNodeVisitor.pushScriptContext( new MyScriptContext( $sentence.text )); }
: assignCommand
| actionCommand;
finally {
m_nodeVisitor.popScriptContext();
}
I need to push the context before the execution of the actions associated with symbols in the rules.
Some things that don't work:
Using $sentence.line -- it's not defined, even though $sentence.text is.
Moving the paraphrase push into the rule actions. Placed before the rule, no token in the rule is available. Placed after the rule, the action happens after actions associated with the rule symbols.
Using this expression in the #init action, which compiles but returns the value 0: getTreeNodeStream().getTreeAdaptor().getToken( $sentence.start ).getLine(). EDIT: Actually, this does work, if $sentence.start is either a real token or an imaginary with a reference -- see Bart Kiers answer below.
It seems like if I can easily get, in the #init rule, the matched text and the first matched token, there should be an easy way to get the line number as well.
You can look 1 step ahead in the token/tree-stream of a tree grammar using the following: CommonTree ahead = (CommonTree)input.LT(1), which you can place in the #init section.
Every CommonTree (the default Tree implementation in ANTLR) has a getToken() method which return the Token associated with this tree. And each Token has a getLine() method, which, not surprisingly, returns the line number of this token.
So, if you do the following:
sentence
#init {
CommonTree ahead = (CommonTree)input.LT(1);
int line = ahead.getToken().getLine();
System.out.println("line=" + line);
}
: assignCommand
| actionCommand
;
you should be able to see some correct line numbers being printed. I say some, because this won't go as planned in all cases. Let me demonstrate using a simple example grammar:
grammar ASTDemo;
options {
output=AST;
}
tokens {
ROOT;
ACTION;
}
parse
: sentence+ EOF -> ^(ROOT sentence+)
;
sentence
: assignCommand
| actionCommand
;
assignCommand
: ID ASSIGN NUMBER -> ^(ASSIGN ID NUMBER)
;
actionCommand
: action ID -> ^(ACTION action ID)
;
action
: START
| STOP
;
ASSIGN : '=';
START : 'start';
STOP : 'stop';
ID : ('a'..'z' | 'A'..'Z')+;
NUMBER : '0'..'9'+;
SPACE : (' ' | '\t' | '\r' | '\n')+ {skip();};
whose tree grammar looks like:
tree grammar ASTDemoWalker;
options {
output=AST;
tokenVocab=ASTDemo;
ASTLabelType=CommonTree;
}
walk
: ^(ROOT sentence+)
;
sentence
#init {
CommonTree ahead = (CommonTree)input.LT(1);
int line = ahead.getToken().getLine();
System.out.println("line=" + line);
}
: assignCommand
| actionCommand
;
assignCommand
: ^(ASSIGN ID NUMBER)
;
actionCommand
: ^(ACTION action ID)
;
action
: START
| STOP
;
And if you run the following test class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
public class Main {
public static void main(String[] args) throws Exception {
String src = "\n\n\nABC = 123\n\nstart ABC";
ASTDemoLexer lexer = new ASTDemoLexer(new ANTLRStringStream(src));
ASTDemoParser parser = new ASTDemoParser(new CommonTokenStream(lexer));
CommonTree root = (CommonTree)parser.parse().getTree();
ASTDemoWalker walker = new ASTDemoWalker(new CommonTreeNodeStream(root));
walker.walk();
}
}
you will see the following being printed:
line=4
line=0
As you can see, "ABC = 123" produced the expected output (line 4), but "start ABC" didn't (line 0). This is because the root of the action rule is a ACTION token and this token is never defined in the lexer, only in the tokens{...} block. And because it doesn't really exist in the input, by default the line 0 is attached to it. If you want to change the line number, you need to provide a "reference" token as a parameter to this so called imaginary ACTION token which it uses to copy attributes into itself.
So, if you change the actionCommand rule in the combined grammar into:
actionCommand
: ref=action ID -> ^(ACTION[$ref.start] action ID)
;
the line number would be as expected (line 6).
Note that every parser rule has a start and end attribute which is a reference to the first and last token, respectively. If action was a lexer rule (say FOO), then you could have omitted the .start from it:
actionCommand
: ref=FOO ID -> ^(ACTION[$ref] action ID)
;
Now the ACTION token has copied all attributes from whatever $ref is pointing to, except the type of the token, which is of course int ACTION. But this also means that it copied the text attribute, so in my example, the AST created by ref=action ID -> ^(ACTION[$ref.start] action ID) could look like:
[text=START,type=ACTION]
/ \
/ \
/ \
[text=START,type=START] [text=ABC,type=ID]
Of course, it's a proper AST because the types of the nodes are unique, but it makes debugging confusing since ACTION and START share the same .text attribute.
You can copy all attributes to an imaginary token except the .text and .type by providing a second string parameter, like this:
actionCommand
: ref=action ID -> ^(ACTION[$ref.start, "Action"] action ID)
;
And if you now run the same test class again, you will see the following printed:
line=4
line=6
And if you inspect the tree that is generated, it'll look like this:
[type=ROOT, text='ROOT']
[type=ASSIGN, text='=']
[type=ID, text='ABC']
[type=NUMBER, text='123']
[type=ACTION, text='Action']
[type=START, text='start']
[type=ID, text='ABC']

Simple ANTLR error

I'm starting with ANTLR, but I get some errors and I really don't understand why.
Here you have my really simple grammar
grammar Expr;
options {backtrack=true;}
#header {}
#members {}
expr returns [String s]
: (LETTER SPACE DIGIT | TKDC) {$s = $DIGIT.text + $TKDC.text;}
;
// TOKENS
SPACE : ' ' ;
LETTER : 'd' ;
DIGIT : '0'..'9' ;
TKDC returns [String s] : 'd' SPACE 'C' {$s = "d C";} ;
This is the JAVA source, where I only ask for the "expr" result:
import org.antlr.runtime.*;
class Testantlr {
public static void main(String[] args) throws Exception {
ExprLexer lex = new ExprLexer(new ANTLRFileStream(args[0]));
CommonTokenStream tokens = new CommonTokenStream(lex);
ExprParser parser = new ExprParser(tokens);
try {
System.out.println(parser.expr());
} catch (RecognitionException e) {
e.printStackTrace();
}
}
}
The problem comes when my input file has the following content d 9.
I get the following error:
x line 1:2 mismatched character '9' expecting 'C'
x line 1:3 no viable alternative at input '<EOF>'
Does anyone knwos the problem here?
There are a few things wrong with your grammar:
lexer rules can only return Tokens, so returns [String s] is ignored after TKDC;
backtrack=true in your options section does not apply to lexer rules, that is why you get mismatched character '9' expecting 'C' (no backtracking there!);
the contents of your expr rule: (LETTER SPACE DIGIT | TKDC) {$s = $DIGIT.text + $TKDC.text;} doesn't make much sense (to me). You either want to match LETTER SPACE DIGIT or TKDC, yet you're trying to grab the text of both choices: $DIGIT.text and $TKDC.text.
It looks to me TKDC needs to be "promoted" to a parser rule instead.
I think you dumbed down your example a bit too much to illustrate the problem you were facing. Perhaps it's a better idea to explain your actual problem instead: what are you trying to parse exactly?

Categories

Resources