Handling EOF, white spaces, and new lines in ANTLR - java

I'm trying to write a grammar to handle binary numbers and compute their values:
grammar T;
options
{
backtrack=true;
}
prog :
(b2 = binarynum NEWLINE)+ EOF {System.out.println($binarynum.value);}
|
b1 = binarynum EOF {System.out.println($binarynum.value);}
;
binarynum returns [double value] :
s1=string '.' s2=string
{$value = $s1.value + $s2.value/Math.pow(2.0,$s2.length);}
|
string
{$value = $string.value;}
;
string returns [double value, int length] :
bit s2=string
{$value = $bit.value*Math.pow(2.0,$s2.length)+$s2.value; $length = $s2.length+1; }
|
bit
{$value = $bit.value; $length = 1; }
;
bit returns [double value] :
'0'
{ $value = 0;}
|
'1'
{ $value = 1;}
;
NEWLINE: ('\r')? '\n' {skip();} ;
Java code:
import org.antlr.runtime.*;
public class TestT {
public static void main(String[] args) throws Exception {
// Create an TLexer that feeds from that stream
//TLexer lexer = new TLexer(new ANTLRInputStream(System.in));
TLexer lexer = new TLexer(new ANTLRFileStream("input.txt"));
// Create a stream of tokens fed by the lexer
CommonTokenStream tokens = new CommonTokenStream(lexer);
// Create a parser that feeds off the token stream
TParser parser = new TParser(tokens);
// Begin parsing at rule prog
parser.prog();
}
}
Input File ("input.txt") contains:
11111.111
1000
1000.1
Error: line 3:4 missing EOF at '.'
I first tested the code with having just one input with the prog statement as the following:
prog :
binarynum EOF {System.out.println($binarynum.value);}
;
Everything works out just fine when I do the above modification with one input, however I can't seem to get the hang of it when using multiple inputs separated by new lines.
Can someone please help me out and tell me where I went wrong.
I also have another question, when should the EOF not be included in the grammar? When I tested the grammar for one input after removing the EOF from the grammar I received no errors and a correct output.

Can someone please help me out and tell me where I went wrong.
Your lexer is skipping line breaks while your parser uses them. Remove {skip();} from the lexer rule.
I also have another question, when should the EOF not be included in the grammar?
You'll usually have it at the end of your top level parser rule, which will force the parser to consume the entire input.

Related

Antlr NoViableAltException Thrown In Java With White Spaces

I have a simple grammar defined in Antlr 3 as shown below:
grammar StringProcessor;
options {
output=AST;
}
#header {
package com.processor;
}
#rulecatch {
// ANTLR does not generate its normal rule try/catch
catch(RecognitionException e) {
throw e;
}
}
truevalue : 'true';
falsevalue : 'false';
nullvalue : 'null';
simpleValue : truevalue | falsevalue | nullvalue | STRING | INTEGER | FLOAT;
INTEGER : '0'..'9'+;
FLOAT : INTEGER'.'INTEGER;
QUOTE : '"';
SPECIALCHAR : '-'|':'|';'|'('|')'|'£'|'&'|'#'|','|'!'|'['|']'|'{'|'}'|'#'|'^'|'*'|'+'|'='|'_'|'<'|'>'|'€'|'$'|'%'|'/'|'.'|'?'|'~'|'|';
STRING : QUOTE('a'..'z'|'A'..'Z'|INTEGER|SPECIALCHAR|WS)+QUOTE;
WS : (' '|'\t'|'\f'|'\n'|'\r')+ {skip();}; // handle white space between keywords
When I try the following STRING in AntlrWorks in the intrepreter:
"5Java Developer"
This works. It includes the white space. But when I try to parse this from the Java program, it throws a NoViableAltException. I have seen other posts, but those solutions does not apply to my problem. The WS is part of the STRING. The problem is Java program does not parse anything with a white space, whereas the interprets displays correctly.
An example to show the Exception:
public static void main(String[] args) throws Exception {
String input = ("\"5Java Developer\"");
StringProcessorParser parser = buildParser(input);
CommonTree commonTree = (CommonTree) parser.simpleValue().getTree(); // exception thrown
}
public static StringProcessorParser buildParser(String query) {
CharStream cs = new ANTLRStringStream(query);
// the input needs to be lexed
StringProcessorLexer lexer = new StingProcessorLexer(cs);
CommonTokenStream tokens = new CommonTokenStream();
StringProcessorParser parser = new StringProcessorParser(tokens);
tokens.setTokenSource(lexer);
// use the ASTTreeAdaptor so that the grammar is aware to build tree in AST format
parser.setTreeAdaptor((TreeAdaptor) new ASTTreeAdaptor().getASTTreeAdaptor());
return parser;
}
Having:
input = new String("\"5JavaDeveloper\""); correctly parses.
Any idea why this is not working.
EDIT:
I have also tried adding the $channel = HIDDEN;
But still it does not work
WS : (' '|'\t'|'\f'|'\n'|'\r')+ { $channel = HIDDEN; skip();}; // handle white space between keywords
Removing the skip() has fixed my problem.

Regular Expressions - tree grammar Antlr Java

I'm trying to write a program in ANTLR (Java) concerning simplifying regular expression. I have already written some code (grammar file contents below)
grammar Regexp_v7;
options{
language = Java;
output = AST;
ASTLabelType = CommonTree;
backtrack = true;
}
tokens{
DOT;
REPEAT;
RANGE;
NULL;
}
fragment
ZERO
: '0'
;
fragment
DIGIT
: '1'..'9'
;
fragment
EPSILON
: '#'
;
fragment
FI
: '%'
;
ID
: EPSILON
| FI
| 'a'..'z'
| 'A'..'Z'
;
NUMBER
: ZERO
| DIGIT (ZERO | DIGIT)*
;
WHITESPACE
: ('\r' | '\n' | ' ' | '\t' ) + {$channel = HIDDEN;}
;
list
: (reg_exp ';'!)*
;
term
: ID -> ID
| '('! reg_exp ')'!
;
repeat_exp
: term ('{' range_exp '}')+ -> ^(REPEAT term (range_exp)+)
| term -> term
;
range_exp
: NUMBER ',' NUMBER -> ^(RANGE NUMBER NUMBER)
| NUMBER (',') -> ^(RANGE NUMBER NULL)
| ',' NUMBER -> ^(RANGE NULL NUMBER)
| NUMBER -> ^(RANGE NUMBER NUMBER)
;
kleene_exp
: repeat_exp ('*'^)*
;
concat_exp
: kleene_exp (kleene_exp)+ -> ^(DOT kleene_exp (kleene_exp)+)
| kleene_exp -> kleene_exp
;
reg_exp
: concat_exp ('|'^ concat_exp)*
;
My next goal is to write down tree grammar code, which is able to simplify regular expressions (e.g. a|a -> a , etc.). I have done some coding (see text below), but I have troubles with defining rule that treats nodes as subtrees (in order to simplify following kind of expressions e.g.: (a|a)|(a|a) to a, etc.)
tree grammar Regexp_v7Walker;
options{
language = Java;
tokenVocab = Regexp_v7;
ASTLabelType = CommonTree;
output=AST;
backtrack = true;
}
tokens{
NULL;
}
bottomup
: ^('*' ^('*' e=.)) -> ^('*' $e) //a** -> a*
| ^('|' i=.* j=.* {$i.tree.toStringTree() == $j.tree.toStringTree()} )
-> $i // There are 3 errors while this line is up and running:
// 1. CommonTree cannot be resolved,
// 2. i.tree cannot be resolved or is not a field,
// 3. i cannot be resolved.
;
Small driver class:
public class Regexp_Test_v7 {
public static void main(String[] args) throws RecognitionException {
CharStream stream = new ANTLRStringStream("a***;a|a;(ab)****;ab|ab;ab|aa;");
Regexp_v7Lexer lexer = new Regexp_v7Lexer(stream);
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
Regexp_v7Parser parser = new Regexp_v7Parser(tokenStream);
list_return list = parser.list();
CommonTree t = (CommonTree) list.getTree();
System.out.println("Original tree: " + t.toStringTree());
CommonTreeNodeStream nodes = new CommonTreeNodeStream(t);
Regexp_v7Walker s = new Regexp_v7Walker(nodes);
t = (CommonTree)s.downup(t);
System.out.println("Simplified tree: " + t.toStringTree());
Can anyone help me with solving this case?
Thanks in advance and regards.
Now, I'm no expert, but in your tree grammar:
add filter=true
change the second line of bottomup rule to:
^('|' i=. j=. {i.toStringTree().equals(j.toStringTree()) }? ) -> $i }
If I'm not mistaken by using i=.* you're allowing i to be non-existent and you'll get a NullPointerException on conversion to a String.
Both i and j are of type CommonTree because you've set it up this way: ASTLabelType = CommonTree, so you should call i.toStringTree().
And since it's Java and you're comparing Strings, use equals().
Also to make the expression in curly brackets a predicate, you need a question mark after the closing one.

Match String to Lexer : 'expecting' error

I have a small problem with my grammar.
I am trying to detect whether my string is a date comparison or not.
But the DATE lexer I created seem not to be recognized by antlr, and I get an error that I cannot solve.
Here is my input expression :
"FDT > '2007/10/09 12:00:0.0'"
I simply expect such a tree as output :
COMP_OP
FDT my_DATE
Here is my grammar :
// Aiming at parsing a complete BQS formed Query
grammar Logic;
options {
output=AST;
}
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
// precedence order is (low to high): or, and, not, [comp_op, geo_op, rel_geo_op, like, not like, exists], ()
parse
: expression EOF -> expression
; // ommit the EOF token
expression
: query
;
query
: atom (COMP_OP^ DATE)*
;
//END BIG PART
atom
: ID
| | '(' expression ')' -> expression
;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
// GENERAL OPERATORS:
DATE : '\'' YEAR '/' MONTH '/' DAY (' ' HOUR ':' MINUTE ':' SECOND)? '\'';
ID : (CHARACTER|DIGIT|','|'.'|'\''|':'|'/')+;
COMP_OP : '=' | '<' | '>' | '<>' | '<=' | '>=';
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
fragment YEAR : DIGIT DIGIT DIGIT DIGIT;
fragment MONTH : DIGIT DIGIT;
fragment DAY : DIGIT DIGIT;
fragment HOUR : DIGIT DIGIT;
fragment MINUTE : DIGIT DIGIT;
fragment SECOND : DIGIT DIGIT ('.' (DIGIT)+)?;
fragment DIGIT : '0'..'9' ;
fragment DIGIT_SEQ :(DIGIT)+;
fragment CHARACTER : ('a'..'z' | 'A'..'Z');
As an output error, I get :
line 1:25 mismatched character '.' expecting set null
line 1:27 mismatched input ''' expecting DATE
I also tried to remove the ' ' from my Date (thinking that it was perhaps the problem, as I remove them in the grammar.)
In this case, I get this error :
line 1:6 mismatched input ''2007/10/09' expecting DATE
Can anyone explain me why I get such an error, and how I could solve it ?
This question is q subset of my complete task, where I have to differentiate lots of omparisons (dates, geographic, strings, . . .). I would thus really need to be able to give 'tags' to my atoms.
Thank you very much !
As a complement, here is my current Java code :
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
// the expression
String src = "FDT > '2007/10/09 12:00:0.0'";
// create a lexer & parser
//LogicLexer lexer = new LogicLexer(new ANTLRStringStream(src));
//LogicParser parser = new LogicParser(new CommonTokenStream(lexer));
LogicLexer lexer = new LogicLexer(new ANTLRStringStream(src));
LogicParser parser = new LogicParser(new CommonTokenStream(lexer));
// invoke the entry point of the parser (the parse() method) and get the AST
CommonTree tree = (CommonTree)parser.parse().getTree();
// print the DOT representation of the AST
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
I finally got it,
Seems like even though I remove whitespaces, I still have to include them in expressions that contain one.
In addition, there was a small error in second definition, as the second digit was optional.
The grammar is thus slightly modified :
fragment SECOND : DIGIT (DIGIT)? ('.' (DIGIT)+)?;
which gives this output :
I know hope this will still work in my more complete grammar :)
Hope it helps someone.

Fetching expressions from a string using ANTLR in JAVA

Given a String like..
(a+(a+b)), (d*e) :- (e-f)
Note: (d*e) and (e-f) are different expressions. How can I fetch the expressions from this string. I have the grammar defined as..
parse returns [String value]
: addExp {$value=$addExp.value;} EOF
;
addExp returns [String value]
: multExp {$value=$multExp.value;} (('+' | '-' | '*') multExp{$value+= '+' + $multExp.value;})*
;
multExp returns [String value]
: atom {$value=$atom.value;} (('*' | '/') atom {$value+=$atom.value;)*
;
atom returns [String value]
: x=ID {$value=$x.text;}
| '(' addExp ')' {$value='('+$addExp.value+')';}
;
ID : 'a'..'z' | 'A'..'Z';
I tried..
ANTLRStringStream a=new ANTLRStringStream("(a+(a+b)), (d*e) :- (e-f)");
SLexer l=new SLexer(a);
CommonTokenStream c=new CommonTokenStream(l);
SParser p=new Sparser(c);
String exp;
while(exp = p.parse())
{
System.out.println(exp);
}
I'm thinking of something like hasNext() and then fetching.
Your lexer rules TEXT possibly matches an empty string, causing the lexer to create an infinite amount of tokens. Also, you don't need all those return statements after your rule: you can simply grab what a parser (or lexer) rule matched by adding .text after it.
You could let your parser return a List<String>, or let it return a single String repeatedly invoke that parser rule until EOF is encountered.
A little demo:
grammar T;
#parser::members {
public static void main(String[] args) throws Exception {
String src = "likes(a, b) :- likes(a, X), likes(X, b). hates(a, b) " +
":- hates(a,X), hates(X,b). likes(a,b) :- says(god, likes(a,b)).";
TLexer lexer = new TLexer(new ANTLRStringStream(src));
TParser parser = new TParser(new CommonTokenStream(lexer));
List<String> statements = parser.parse();
for(String s : statements) {
System.out.println(s);
}
}
}
parse returns [List<String> statements]
#init{$statements = new ArrayList<String>();}
: (statement {$statements.add($statement.text);} ~TEXT+)+ EOF
;
statement
: TEXT OPAR params CPAR
;
params
: (param (COMMA param)*)?
;
param
: TEXT
| statement
;
COMMA : ',';
OPAR : '(';
CPAR : ')';
TEXT : ('a'..'z' | 'A'..'Z')+;
SPACE : (' ' | '\t') {$channel=HIDDEN;};
OTHER : . ;
Note that ~TEXT+ in the parse rule matches one or more tokens other than TEXT.
If you now create a lexer and parser and run the TParser class:
*nix/MacOS
java -cp antlr-3.3.jar org.antlr.Tool T.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar TParser
or
Windows
java -cp antlr-3.3.jar org.antlr.Tool T.g
javac -cp antlr-3.3.jar *.java
java -cp .;antlr-3.3.jar TParser
you will see the following being printed to your console:
likes(a, b)
likes(a, X)
likes(X, b)
hates(a, b)
hates(a,X)
hates(X,b)
likes(a,b)
says(god, likes(a,b))
EDIT
And here's how to return a single String opposed to a List<String>:
#parser::members {
public static void main(String[] args) throws Exception {
String src = "likes(a, b) :- likes(a, X), likes(X, b). hates(a, b) " +
":- hates(a,X), hates(X,b). likes(a,b) :- says(god, likes(a,b)).";
TLexer lexer = new TLexer(new ANTLRStringStream(src));
TParser parser = new TParser(new CommonTokenStream(lexer));
String s;
while((s = parser.parse()) != null) {
System.out.println(s);
}
}
}
parse returns [String s]
: statement ~(TEXT| EOF)* {$s = $statement.text;}
| EOF {$s = null;}
;
You should just be able to call sentence() repeatedly until you hit the end of input.

How can I create a simple input validator by using ANTLR?

I wrote my grammar in ANTLRWorks and it worked pretty well and then I generated lexer and parser.
Well the code executes and there's no error.
But it makes me crazy even with a wrong input everything is fine. By this I mean that parser.prog() executes just fine. So where is the information that I should get as the result? I just want to check the input to figure it out that if it is a propositional logic statement or not?
I used the below to generate the code but it had some errors like it can not find the main class!
java antlr.jar org.antlr.Tool PropLogic.g
But this code worked :
java -cp antlr.jar org.antlr.Tool PropLogic.g
Here's the Grammar :
grammar PropLogic;
NOT : '!' ;
OR : '+' ;
AND : '.' ;
IMPLIES : '->' ;
SYMBOLS : ('a'..'z') | '~' ;
OP : '(' ;
CP : ')' ;
prog : formula ;
formula : NOT formula
| OP formula( AND formula CP | OR formula CP | IMPLIES formula CP)
| SYMBOLS ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
Here's my code:
import org.antlr.runtime.ANTLRStringStream;
import org.antlr.runtime.CommonTokenStream;
public class Tableaux {
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRStringStream("a b c");
PropLogicLexer lexer = new PropLogicLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
PropLogicParser parser = new PropLogicParser(tokens);
parser.prog();
}
}
Given the following test class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRStringStream(args[0]);
PropLogicLexer lexer = new PropLogicLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
PropLogicParser parser = new PropLogicParser(tokens);
parser.prog();
}
}
which can be invoked on *nix/MacOS like this:
java -cp .:antlr-3.2.jar Main "a b c"
or on Windows
java -cp .;antlr-3.2.jar Main "a b c"
does not produce any errors because your parser and lexer are "content" with the input. The lexer tokenizes the input into the following 3 tokens a, b and c (spaces are ignored). And the parser rule:
prog
: formula
;
matches a single formula, which in its turn matches a SYMBOLS token. Note that although you named it SYMBOLS (plural), it only matches a single lower case letter, or tilde (~):
SYMBOLS : ('a'..'z') | '~' ;
So, in short, from the input source "a b c", only a is being parsed by your parser. You probably want your parser to consume the entire token stream, which can be done by adding the EOF (end of file) token after the entry point of your grammar:
prog
: formula EOF
;
If you run the test class again and provide "a b c" as input, the following error is produced:
line 1:2 missing EOF at 'b'
EDIT
I tested you grammar including the EOF token:
grammar PropLogic;
prog
: formula EOF
;
formula
: NOT formula
| OP formula (AND formula CP | OR formula CP | IMPLIES formula CP)
| SYMBOLS
;
NOT : '!' ;
OR : '+' ;
AND : '.' ;
IMPLIES : '->' ;
SYMBOLS : ('a'..'z') | '~' ;
OP : '(' ;
CP : ')' ;
WHITESPACE : ('\t' | ' ' | '\r' | '\n'| '\u000C')+ { $channel = HIDDEN; } ;
with the class including the ANTLRStringStream:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRStringStream("a b c");
PropLogicLexer lexer = new PropLogicLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
PropLogicParser parser = new PropLogicParser(tokens);
parser.prog();
}
}
with both ANTLR 3.2, and ANTLR 3.3:
java -cp antlr-3.2.jar org.antlr.Tool PropLogic.g
javac -cp antlr-3.2.jar *.java
java -cp .:antlr-3.2.jar Main
line 1:2 missing EOF at 'b'
java -cp antlr-3.3.jar org.antlr.Tool PropLogic.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main
line 1:2 missing EOF at 'b'
And as you can see, both produce the error message:
line 1:2 missing EOF at 'b'

Categories

Resources