Regular Expressions - tree grammar Antlr Java

Regular Expressions - tree grammar Antlr Java - java

I'm trying to write a program in ANTLR (Java) concerning simplifying regular expression. I have already written some code (grammar file contents below)
grammar Regexp_v7;
options{
language = Java;
output = AST;
ASTLabelType = CommonTree;
backtrack = true;
}
tokens{
DOT;
REPEAT;
RANGE;
NULL;
}
fragment
ZERO
: '0'
;
fragment
DIGIT
: '1'..'9'
;
fragment
EPSILON
: '#'
;
fragment
FI
: '%'
;
ID
: EPSILON
| FI
| 'a'..'z'
| 'A'..'Z'
;
NUMBER
: ZERO
| DIGIT (ZERO | DIGIT)*
;
WHITESPACE
: ('\r' | '\n' | ' ' | '\t' ) + {$channel = HIDDEN;}
;
list
: (reg_exp ';'!)*
;
term
: ID -> ID
| '('! reg_exp ')'!
;
repeat_exp
: term ('{' range_exp '}')+ -> ^(REPEAT term (range_exp)+)
| term -> term
;
range_exp
: NUMBER ',' NUMBER -> ^(RANGE NUMBER NUMBER)
| NUMBER (',') -> ^(RANGE NUMBER NULL)
| ',' NUMBER -> ^(RANGE NULL NUMBER)
| NUMBER -> ^(RANGE NUMBER NUMBER)
;
kleene_exp
: repeat_exp ('*'^)*
;
concat_exp
: kleene_exp (kleene_exp)+ -> ^(DOT kleene_exp (kleene_exp)+)
| kleene_exp -> kleene_exp
;
reg_exp
: concat_exp ('|'^ concat_exp)*
;
My next goal is to write down tree grammar code, which is able to simplify regular expressions (e.g. a|a -> a , etc.). I have done some coding (see text below), but I have troubles with defining rule that treats nodes as subtrees (in order to simplify following kind of expressions e.g.: (a|a)|(a|a) to a, etc.)
tree grammar Regexp_v7Walker;
options{
language = Java;
tokenVocab = Regexp_v7;
ASTLabelType = CommonTree;
output=AST;
backtrack = true;
}
tokens{
NULL;
}
bottomup
: ^('*' ^('*' e=.)) -> ^('*' $e) //a** -> a*
| ^('|' i=.* j=.* {$i.tree.toStringTree() == $j.tree.toStringTree()} )
-> $i // There are 3 errors while this line is up and running:
// 1. CommonTree cannot be resolved,
// 2. i.tree cannot be resolved or is not a field,
// 3. i cannot be resolved.
;
Small driver class:
public class Regexp_Test_v7 {
public static void main(String[] args) throws RecognitionException {
CharStream stream = new ANTLRStringStream("a***;a|a;(ab)****;ab|ab;ab|aa;");
Regexp_v7Lexer lexer = new Regexp_v7Lexer(stream);
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
Regexp_v7Parser parser = new Regexp_v7Parser(tokenStream);
list_return list = parser.list();
CommonTree t = (CommonTree) list.getTree();
System.out.println("Original tree: " + t.toStringTree());
CommonTreeNodeStream nodes = new CommonTreeNodeStream(t);
Regexp_v7Walker s = new Regexp_v7Walker(nodes);
t = (CommonTree)s.downup(t);
System.out.println("Simplified tree: " + t.toStringTree());
Can anyone help me with solving this case?
Thanks in advance and regards.

Now, I'm no expert, but in your tree grammar:
add filter=true
change the second line of bottomup rule to:
^('|' i=. j=. {i.toStringTree().equals(j.toStringTree()) }? ) -> $i }
If I'm not mistaken by using i=.* you're allowing i to be non-existent and you'll get a NullPointerException on conversion to a String.
Both i and j are of type CommonTree because you've set it up this way: ASTLabelType = CommonTree, so you should call i.toStringTree().
And since it's Java and you're comparing Strings, use equals().
Also to make the expression in curly brackets a predicate, you need a question mark after the closing one.

Related

Making generated parser work in Java for ANTLR 4.8

I've been having trouble getting my generated parser to work in Java for ANTLR 4.8. There are other answers to this question, but it seems that ANTLR has changed things since 4.7 and all the other answers are before this change. My code is:
String formula = "(fm.a < fm.b) | (fm.a = fm.b)";
CharStream input = CharStreams.fromString(formula);
Antlr.LogicGrammerLexer lexer = new Antlr.LogicGrammerLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
Antlr.LogicGrammerParser parser = new Antlr.LogicGrammerParser(tokens);
ParseTree pt = new ParseTree(parser);
It appears to be reading in the formula correctly into the CharStream, but anything I try to do past that just isn't working at all. For example, if I try to print out the parse tree, nothing will be printed. The following line will print out nothing:
System.out.println(lexer._input.getText(new Interval(0, 100)));
Any advice appreciated.
EDIT: added the grammar file:
grammar LogicGrammer;
logicalStmt: BOOL_EXPR | '('logicalStmt' '*LOGIC_SYMBOL' '*logicalStmt')';
BOOL_EXPR: '('IDENTIFIER' '*MATH_SYMBOL' '*IDENTIFIER')';
IDENTIFIER: CHAR+('.'CHAR*)*;
CHAR: 'a'..'z' | 'A'..'Z' | '1'..'9';
LOGIC_SYMBOL: '~' | '|' | '&';
MATH_SYMBOL: '<' | '≤' | '=' | '≥' | '>';

This line:
ParseTree pt = new ParseTree(parser);
is incorrect. You need to call the start rule method on your parser object to get your parse tree
Antlr.LogicGrammerParser parser = new Antlr.LogicGrammerParser(tokens);
ParseTree pt = parser.logicalStmt();
So far as printing out your input, generally fields starting with an _ (like _input) are not intended for external use. Though I suspect the failure may be that you don't have 100 characters in your input stream, so the Interval is invalid. (I haven't tried it to see the exact failure)
I you include your grammar, one of us could easily attempt to generate and compile and, perhaps, be more specific.
Using your grammar, this works for me:
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.misc.Interval;
import org.antlr.v4.runtime.tree.ParseTree;
public class Logic {
public static void main(String... args) {
String formula = "(fm.a < fm.b) | (fm.a = fm.b)";
CharStream input = CharStreams.fromString(formula);
LogicGrammerLexer lexer = new LogicGrammerLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
LogicGrammerParser parser = new LogicGrammerParser(tokens);
ParseTree pt = parser.logicalStmt();
System.out.println(pt.toStringTree());
System.out.println(input.getText(new Interval(1, 28)));
}
}
output:
([] (fm.a < fm.b))
fm.a < fm.b) | (fm.a = fm.b)
BTW, a couple of minor suggestions for your grammar:
set up a rule to skip whitespace WS: [ \t\r\n]+ -> skip;
change BOOL_EXPR to a parser rule (since it's made up of a composition of tokens from other lexer rules:
grammar LogicGrammer
;
logicalStmt
: boolExpr
| '(' logicalStmt LOGIC_SYMBOL logicalStmt ')'
;
boolExpr: '(' IDENTIFIER MATH_SYMBOL IDENTIFIER ')';
IDENTIFIER: CHAR+ ('.' CHAR*)*;
CHAR: 'a' ..'z' | 'A' ..'Z' | '1' ..'9';
LOGIC_SYMBOL: '~' | '|' | '&';
MATH_SYMBOL: '<' | '≤' | '=' | '≥' | '>';
WS: [ \t\r\n]+ -> skip;

The BOOL_EXPR shouldn't be a lexer rule. I suggest you do something like this instead:
grammar LogicGrammer;
parse
: logicalStmt EOF
;
logicalStmt
: logicalStmt LOGIC_SYMBOL logicalStmt
| logicalStmt MATH_SYMBOL logicalStmt
| '(' logicalStmt ')'
| IDENTIFIER
;
IDENTIFIER
: CHAR+ ( '.'CHAR+ )*
;
LOGIC_SYMBOL
: [~|&]
;
MATH_SYMBOL
: [<≤=≥>]
;
SPACE
: [ \t\r\n] -> skip
;
fragment CHAR
: [a-zA-Z1-9]
;
which can be tested by running the following code:
String formula = "(fm.a < fm.b) | (fm.a = fm.b)";
LogicGrammerLexer lexer = new LogicGrammerLexer(CharStreams.fromString(formula));
LogicGrammerParser parser = new LogicGrammerParser(new CommonTokenStream(lexer));
ParseTree root = parser.parse();
System.out.println(root.toStringTree(parser));

How to parse this grammar?

I want to create a recursive descendant parser in java for following grammar (I have managed to create tokens). This is the relevant part of the grammar:
expression ::= numeric_expression | identifier | "null"
identifier ::= "a..z,$,_"
numeric_expression ::= ( ( "-" | "++" | "--" ) expression )
| ( expression ( "++" | "--" ) )
| ( expression ( "+" | "+=" | "-" | "-=" | "*" | "*=" | "/" | "/=" | "%" | "%=" ) expression )
arglist ::= expression { "," expression }
I have written code for parsing numeric_expression (assuming if invalid token, return null):
NumericAST<? extends OpAST> parseNumericExpr() {
OpAST op;
if (token.getCodes() == Lexer.CODES.UNARY_OP) { //Check for unary operator like "++" or "--" etc
op = new UnaryOpAST(token.getValue());
token = getNextToken();
AST expr = parseExpr(); // Method that returns expression node.
if (expr == null) {
op = null;
return null;
} else {
if (checkSemi()) {
System.out.println("UNARY AST CREATED");
return new NumericAST<OpAST>(expr, op, false);
}
else {
return null;
}
}
} else { // Binary operation like "a+b", where a,b ->expression
AST expr = parseExpr();
if (expr == null) {
return null;
} else {
token = getNextToken();
if (token.getCodes() == Lexer.CODES.UNARY_OP) {
op = new UnaryOpAST(token.getValue());
return new NumericAST<OpAST>(expr, op, true);
} else if (token.getCodes() == Lexer.CODES.BIN_OP) {
op = new BinaryOpAST(token.getValue());
token = getNextToken();
AST expr2 = parseExpr();
if (expr2 == null) {
op = null;
expr = null;
return null;
} else {
if (checkSemi()) {
System.out.println("BINARY AST CREATED");
return new NumericAST<OpAST>(expr, op, expr2);
}
else {
return null;
}
}
} else {
expr = null;
return null;
}
}
}
}
Now, if i get a unary operator like ++ i can directly call this method, but I dont know to recognize other grammar, starting with same productions, like arglist and numeric_expression having "expression" as start production.
My question is:
How to recognize whether to call parseNumericExpr() or parseArgList() (method not mentioned above) if i get an expression token?

In order to write a recursive descent parser, you need an appropriate top-down grammar, normally an LL(1) grammar, although it's common to write the grammar using EBNF operators, as shown in the example grammar on Wikipedia's page on recursive descent grammars.
Unfortunately, your grammar is not LL(1), and the question you raise is a consequence of that fact. An LL(1) grammar has the property that the parser can always determine which production to use by examining only the next input token, which puts some severe constraints on the grammar, including:
No two productions for the same non-terminal can start with the same symbol.
No production can be left-recursive (i.e. the first symbol on the right-hand side is the defining non-terminal).
Here's a small rearrangement of your grammar which will work:
-- I added number here in order to be explicit.
atom ::= identifier | number | "null" | "(" expression ")"
-- I added function calls here, but it's arguable that this syntax accepts
-- a lot of invalid expressions
primary ::= atom { "++" | "--" | "(" [ arglist ] ")" }
factor ::= [ "-" | "++" | "--" ] primary
term ::= factor { ( "*" | "/" | "%" ) factor }
value ::= term { ( "+" | "-" ) term }
-- This adds the ordinary "=" assignment to the list in case it was
-- omitted by accident. Also, see the note below.
expression ::= { value ( "=" | "+#" | "-=" | "*=" | "/=" | "%=" ) } value
arglist ::= expression { "," expression }
The last expression rule is an attempt to capture the usual syntax of assignment operators (which associate to the right, not to the left), but it suffers from a classic problem address by this highly related question. I don't think I have a better answer to this issue than the one I wrote three years ago, so I hope it is still useful.

Indentation management in ANTLR4 for a python interpreter

I'm implementing a python interpreter using ANTLR4 like lexer and parser generator. I used the BNF defined at this link:
https://github.com/antlr/grammars-v4/blob/master/python3/Python3.g4.
However the implementation of indentation with the INDENT and DEDENT tokens within the lexer::members do not work when i define a compound statement.
For example if i define the following statement:
x=10
while x>2 :
print("hello")
x=x-3
So in the line when i reassign the value of x variable i should have an indentation error that i don't have in my currest state.
Should i edit something into the lexer code or what?
This is the BNF that i'm using with the lexer::members and the NEWLINE rules defined in the above link.
grammar python;
tokens { INDENT, DEDENT }
#lexer::members {
// A queue where extra tokens are pushed on (see the NEWLINE lexer rule).
private java.util.LinkedList<Token> tokens = new java.util.LinkedList<>();
// The stack that keeps track of the indentation level.
private java.util.Stack<Integer> indents = new java.util.Stack<>();
// The amount of opened braces, brackets and parenthesis.
private int opened = 0;
// The most recently produced token.
private Token lastToken = null;
#Override
public void emit(Token t) {
super.setToken(t);
tokens.offer(t);
}
#Override
public Token nextToken() {
// Check if the end-of-file is ahead and there are still some DEDENTS expected.
if (_input.LA(1) == EOF && !this.indents.isEmpty()) {
// Remove any trailing EOF tokens from our buffer.
for (int i = tokens.size() - 1; i >= 0; i--) {
if (tokens.get(i).getType() == EOF) {
tokens.remove(i);
}
}
// First emit an extra line break that serves as the end of the statement.
this.emit(commonToken(pythonParser.NEWLINE, "\n"));
// Now emit as much DEDENT tokens as needed.
while (!indents.isEmpty()) {
this.emit(createDedent());
indents.pop();
}
// Put the EOF back on the token stream.
this.emit(commonToken(pythonParser.EOF, "<EOF>"));
//throw new Exception("indentazione inaspettata in riga "+this.getLine());
}
Token next = super.nextToken();
if (next.getChannel() == Token.DEFAULT_CHANNEL) {
// Keep track of the last token on the default channel.
this.lastToken = next;
}
return tokens.isEmpty() ? next : tokens.poll();
}
private Token createDedent() {
CommonToken dedent = commonToken(pythonParser.DEDENT, "");
dedent.setLine(this.lastToken.getLine());
return dedent;
}
private CommonToken commonToken(int type, String text) {
int stop = this.getCharIndex() - 1;
int start = text.isEmpty() ? stop : stop - text.length() + 1;
return new CommonToken(this._tokenFactorySourcePair, type, DEFAULT_TOKEN_CHANNEL, start, stop);
}
// Calculates the indentation of the provided spaces, taking the
// following rules into account:
//
// "Tabs are replaced (from left to right) by one to eight spaces
// such that the total number of characters up to and including
// the replacement is a multiple of eight [...]"
//
// -- https://docs.python.org/3.1/reference/lexical_analysis.html#indentation
static int getIndentationCount(String spaces) {
int count = 0;
for (char ch : spaces.toCharArray()) {
switch (ch) {
case '\t':
count += 8 - (count % 8);
break;
default:
// A normal space char.
count++;
}
}
return count;
}
boolean atStartOfInput() {
return super.getCharPositionInLine() == 0 && super.getLine() == 1;
}
}
parse
:( NEWLINE parse
| block ) EOF
;
block
: (statement NEWLINE?| functionDecl)*
;
statement
: assignment
| functionCall
| ifStatement
| forStatement
| whileStatement
| arithmetic_expression
;
assignment
: IDENTIFIER indexes? '=' expression
;
functionCall
: IDENTIFIER OPAREN exprList? CPAREN #identifierFunctionCall
| PRINT OPAREN? exprList? CPAREN? #printFunctionCall
;
arithmetic_expression
: expression
;
ifStatement
: ifStat elifStat* elseStat?
;
ifStat
: IF expression COLON NEWLINE INDENT block DEDENT
;
elifStat
: ELIF expression COLON NEWLINE INDENT block DEDENT
;
elseStat
: ELSE COLON NEWLINE INDENT block DEDENT
;
functionDecl
: DEF IDENTIFIER OPAREN idList? CPAREN COLON NEWLINE INDENT block DEDENT
;
forStatement
: FOR IDENTIFIER IN expression COLON NEWLINE INDENT block DEDENT elseStat?
;
whileStatement
: WHILE expression COLON NEWLINE INDENT block DEDENT elseStat?
;
idList
: IDENTIFIER (',' IDENTIFIER)*
;
exprList
: expression (COMMA expression)*
;
expression
: '-' expression #unaryMinusExpression
| '!' expression #notExpression
| expression '**' expression #powerExpression
| expression '*' expression #multiplyExpression
| expression '/' expression #divideExpression
| expression '%' expression #modulusExpression
| expression '+' expression #addExpression
| expression '-' expression #subtractExpression
| expression '>=' expression #gtEqExpression
| expression '<=' expression #ltEqExpression
| expression '>' expression #gtExpression
| expression '<' expression #ltExpression
| expression '==' expression #eqExpression
| expression '!=' expression #notEqExpression
| expression '&&' expression #andExpression
| expression '||' expression #orExpression
| expression '?' expression ':' expression #ternaryExpression
| expression IN expression #inExpression
| NUMBER #numberExpression
| BOOL #boolExpression
| NULL #nullExpression
| functionCall indexes? #functionCallExpression
| list indexes? #listExpression
| IDENTIFIER indexes? #identifierExpression
| STRING indexes? #stringExpression
| '(' expression ')' indexes? #expressionExpression
| INPUT '(' STRING? ')' #inputExpression
;
list
: '[' exprList? ']'
;
indexes
: ('[' expression ']')+
;
PRINT : 'print';
INPUT : 'input';
DEF : 'def';
IF : 'if';
ELSE : 'else';
ELIF : 'elif';
RETURN : 'return';
FOR : 'for';
WHILE : 'while';
IN : 'in';
NULL : 'null';
OR : '||';
AND : '&&';
EQUALS : '==';
NEQUALS : '!=';
GTEQUALS : '>=';
LTEQUALS : '<=';
POW : '**';
EXCL : '!';
GT : '>';
LT : '<';
ADD : '+';
SUBTRACT : '-';
MULTIPLY : '*';
DIVIDE : '/';
MODULE : '%';
OBRACE : '{' {opened++;};
CBRACE : '}' {opened--;};
OBRACKET : '[' {opened++;};
CBRACKET : ']' {opened--;};
OPAREN : '(' {opened++;};
CPAREN : ')' {opened--;};
SCOLON : ';';
ASSIGN : '=';
COMMA : ',';
QMARK : '?';
COLON : ':';
BOOL
: 'true'
| 'false'
;
NUMBER
: INT ('.' DIGIT*)?
;
IDENTIFIER
: [a-zA-Z_] [a-zA-Z_0-9]*
;
STRING
: ["] (~["\r\n] | '\\\\' | '\\"')* ["]
| ['] (~['\r\n] | '\\\\' | '\\\'')* [']
;
SKIPS
: ( SPACES | COMMENT | LINE_JOINING ){firstLine();} -> skip
;
NEWLINE
: ( {atStartOfInput()}? SPACES
| ( '\r'? '\n' | '\r' | '\f' ) SPACES?
)
{
String newLine = getText().replaceAll("[^\r\n\f]+", "");
String spaces = getText().replaceAll("[\r\n\f]+", "");
int next = _input.LA(1);
if (opened > 0 || next == '\r' || next == '\n' || next == '\f' || next == '#') {
// If we're inside a list or on a blank line, ignore all indents,
// dedents and line breaks.
skip();
}
else {
emit(commonToken(NEWLINE, newLine));
int indent = getIndentationCount(spaces);
int previous = indents.isEmpty() ? 0 : indents.peek();
if (indent == previous) {
// skip indents of the same size as the present indent-size
skip();
}
else if (indent > previous) {
indents.push(indent);
emit(commonToken(pythonParser.INDENT, spaces));
}
else {
// Possibly emit more than 1 DEDENT token.
while(!indents.isEmpty() && indents.peek() > indent) {
this.emit(createDedent());
indents.pop();
}
}
}
}
;
fragment INT
: [1-9] DIGIT*
| '0'
;
fragment DIGIT
: [0-9]
;
fragment SPACES
: [ \t]+
;
fragment COMMENT
: '#' ~[\r\n\f]*
;
fragment LINE_JOINING
: '\\' SPACES? ( '\r'? '\n' | '\r' | '\f' )
;

No, this should not be handled in the grammar. The lexer should simply emit the (faulty) INDENT token. The parser should, at runtime, produce an error. Something like this:
String source = "x=10\n" +
"while x>2 :\n" +
" print(\"hello\")\n" +
" x=x-3\n";
Python3Lexer lexer = new Python3Lexer(CharStreams.fromString(source));
Python3Parser parser = new Python3Parser(new CommonTokenStream(lexer));
// Remove default error-handling
parser.removeErrorListeners();
// Add custom error-handling
parser.addErrorListener(new BaseErrorListener() {
#Override
public void syntaxError(Recognizer<?, ?> recognizer, Object o, int i, int i1, String s, RecognitionException e) {
CommonToken token = (CommonToken) o;
if (token.getType() == Python3Parser.INDENT) {
// The parser encountered an unexpected INDENT token
// TODO throw your exception
}
// TODO handle other errors
}
});
// Trigger the error
parser.file_input();

ANTLR parsing is not finding correct lexer parts

I am a complete newcomer to ANTLR.
I have the following ANTLR grammar:
grammar DrugEntityRecognition;
// Parser Rules
derSentence : ACTION (INT | FRACTION | RANGE) FORM TEXT;
// Lexer Rules
ACTION : 'TAKE' | 'INFUSE' | 'INJECT' | 'INHALE' | 'APPLY' | 'SPRAY' ;
INT : [0-9]+ ;
FRACTION : [1] '/' [1-9] ;
RANGE : INT '-' INT ;
FORM : ('TABLET' | 'TABLETS' | 'CAPSULE' | 'CAPSULES' | 'SYRINGE') ;
TEXT : ('A'..'Z' | WHITESPACE | ',')+ ;
WHITESPACE : ('\t' | ' ' | '\r' | '\n' | '\u000C')+ -> skip ;
And when I try to parse a sentence as follows:
String upperLine = line.toUpperCase();
org.antlr.v4.runtime.CharStream stream = new ANTLRInputStream(upperLine);
DrugEntityRecognitionLexer lexer = new DrugEntityRecognitionLexer(stream);
lexer.removeErrorListeners();
lexer.addErrorListener(ThrowingErrorListener.INSTANCE);
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
DrugEntityRecognitionParser parser = new DrugEntityRecognitionParser(tokenStream);
try {
DrugEntityRecognitionParser.DerSentenceContext ctx = parser.derSentence();
StringBuilder sb = new StringBuilder();
sb.append("ACTION: ").append(ctx.ACTION());
sb.append(", ");
sb.append("FORM: ").append(ctx.FORM());
sb.append(", ");
sb.append("INT: ").append(ctx.INT());
sb.append(", ");
sb.append("FRACTION: ").append(ctx.FRACTION());
sb.append(", ");
sb.append("RANGE: ").append(ctx.RANGE());
System.out.println(upperLine);
System.out.println(sb.toString());
} catch (ParseCancellationException e) {
//e.printStackTrace();
}
An example of the input to lexer:
take 10 Tablet (25MG) by oral route every week
In this case ACTION node is not getting populated, but take is getting recognized only as a TEXT node, not an ACTION node. 10 is being recognized as an INT node, however.
How can I modify this grammar to work correctly, where ACTION node is populated correctly (as well as FORM, which is not being populated either)?

There are several problems in your grammar:
Your TEXT rule only matches uppercase letters. Same for ACTION.
You shouldn't mix punctuation and text in a single text rule (here the comma), otherwise you cannot freely allow whitespaces between tokens.
You don't match parentheses at all, hence (25MG) is not valid input and the parser returns in an error state.
You did not check for any syntax errors, to learn what went wrong during recognition.
Also, when in doubt, always print your token sequence from the token source to see if the input has actually been tokenized as you expect. Start there to fix your grammar before you go to the parser.
About case sensitivity: typically (if your language is case-insensitive) you have rules like these:
fragment A: [aA];
fragment B: [bB];
fragment C: [cC];
fragment D: [dD];
...
to match a letter in either case and then define your keywords so:
ACTION : T A K E | I N F U S E | I N J E C T | I N H A L E | A P P L Y | S P R A Y;

Match String to Lexer : 'expecting' error

I have a small problem with my grammar.
I am trying to detect whether my string is a date comparison or not.
But the DATE lexer I created seem not to be recognized by antlr, and I get an error that I cannot solve.
Here is my input expression :
"FDT > '2007/10/09 12:00:0.0'"
I simply expect such a tree as output :
COMP_OP
FDT my_DATE
Here is my grammar :
// Aiming at parsing a complete BQS formed Query
grammar Logic;
options {
output=AST;
}
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
// precedence order is (low to high): or, and, not, [comp_op, geo_op, rel_geo_op, like, not like, exists], ()
parse
: expression EOF -> expression
; // ommit the EOF token
expression
: query
;
query
: atom (COMP_OP^ DATE)*
;
//END BIG PART
atom
: ID
| | '(' expression ')' -> expression
;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
// GENERAL OPERATORS:
DATE : '\'' YEAR '/' MONTH '/' DAY (' ' HOUR ':' MINUTE ':' SECOND)? '\'';
ID : (CHARACTER|DIGIT|','|'.'|'\''|':'|'/')+;
COMP_OP : '=' | '<' | '>' | '<>' | '<=' | '>=';
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
fragment YEAR : DIGIT DIGIT DIGIT DIGIT;
fragment MONTH : DIGIT DIGIT;
fragment DAY : DIGIT DIGIT;
fragment HOUR : DIGIT DIGIT;
fragment MINUTE : DIGIT DIGIT;
fragment SECOND : DIGIT DIGIT ('.' (DIGIT)+)?;
fragment DIGIT : '0'..'9' ;
fragment DIGIT_SEQ :(DIGIT)+;
fragment CHARACTER : ('a'..'z' | 'A'..'Z');
As an output error, I get :
line 1:25 mismatched character '.' expecting set null
line 1:27 mismatched input ''' expecting DATE
I also tried to remove the ' ' from my Date (thinking that it was perhaps the problem, as I remove them in the grammar.)
In this case, I get this error :
line 1:6 mismatched input ''2007/10/09' expecting DATE
Can anyone explain me why I get such an error, and how I could solve it ?
This question is q subset of my complete task, where I have to differentiate lots of omparisons (dates, geographic, strings, . . .). I would thus really need to be able to give 'tags' to my atoms.
Thank you very much !
As a complement, here is my current Java code :
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
// the expression
String src = "FDT > '2007/10/09 12:00:0.0'";
// create a lexer & parser
//LogicLexer lexer = new LogicLexer(new ANTLRStringStream(src));
//LogicParser parser = new LogicParser(new CommonTokenStream(lexer));
LogicLexer lexer = new LogicLexer(new ANTLRStringStream(src));
LogicParser parser = new LogicParser(new CommonTokenStream(lexer));
// invoke the entry point of the parser (the parse() method) and get the AST
CommonTree tree = (CommonTree)parser.parse().getTree();
// print the DOT representation of the AST
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}

I finally got it,
Seems like even though I remove whitespaces, I still have to include them in expressions that contain one.
In addition, there was a small error in second definition, as the second digit was optional.
The grammar is thus slightly modified :
fragment SECOND : DIGIT (DIGIT)? ('.' (DIGIT)+)?;
which gives this output :
I know hope this will still work in my more complete grammar :)
Hope it helps someone.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regular Expressions - tree grammar Antlr Java - java

Related

Making generated parser work in Java for ANTLR 4.8

How to parse this grammar?

Indentation management in ANTLR4 for a python interpreter

ANTLR parsing is not finding correct lexer parts

Match String to Lexer : 'expecting' error

Categories

Resources