I have a small problem with my grammar.
I am trying to detect whether my string is a date comparison or not.
But the DATE lexer I created seem not to be recognized by antlr, and I get an error that I cannot solve.
Here is my input expression :
"FDT > '2007/10/09 12:00:0.0'"
I simply expect such a tree as output :
COMP_OP
FDT my_DATE
Here is my grammar :
// Aiming at parsing a complete BQS formed Query
grammar Logic;
options {
output=AST;
}
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
// precedence order is (low to high): or, and, not, [comp_op, geo_op, rel_geo_op, like, not like, exists], ()
parse
: expression EOF -> expression
; // ommit the EOF token
expression
: query
;
query
: atom (COMP_OP^ DATE)*
;
//END BIG PART
atom
: ID
| | '(' expression ')' -> expression
;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
// GENERAL OPERATORS:
DATE : '\'' YEAR '/' MONTH '/' DAY (' ' HOUR ':' MINUTE ':' SECOND)? '\'';
ID : (CHARACTER|DIGIT|','|'.'|'\''|':'|'/')+;
COMP_OP : '=' | '<' | '>' | '<>' | '<=' | '>=';
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
fragment YEAR : DIGIT DIGIT DIGIT DIGIT;
fragment MONTH : DIGIT DIGIT;
fragment DAY : DIGIT DIGIT;
fragment HOUR : DIGIT DIGIT;
fragment MINUTE : DIGIT DIGIT;
fragment SECOND : DIGIT DIGIT ('.' (DIGIT)+)?;
fragment DIGIT : '0'..'9' ;
fragment DIGIT_SEQ :(DIGIT)+;
fragment CHARACTER : ('a'..'z' | 'A'..'Z');
As an output error, I get :
line 1:25 mismatched character '.' expecting set null
line 1:27 mismatched input ''' expecting DATE
I also tried to remove the ' ' from my Date (thinking that it was perhaps the problem, as I remove them in the grammar.)
In this case, I get this error :
line 1:6 mismatched input ''2007/10/09' expecting DATE
Can anyone explain me why I get such an error, and how I could solve it ?
This question is q subset of my complete task, where I have to differentiate lots of omparisons (dates, geographic, strings, . . .). I would thus really need to be able to give 'tags' to my atoms.
Thank you very much !
As a complement, here is my current Java code :
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
// the expression
String src = "FDT > '2007/10/09 12:00:0.0'";
// create a lexer & parser
//LogicLexer lexer = new LogicLexer(new ANTLRStringStream(src));
//LogicParser parser = new LogicParser(new CommonTokenStream(lexer));
LogicLexer lexer = new LogicLexer(new ANTLRStringStream(src));
LogicParser parser = new LogicParser(new CommonTokenStream(lexer));
// invoke the entry point of the parser (the parse() method) and get the AST
CommonTree tree = (CommonTree)parser.parse().getTree();
// print the DOT representation of the AST
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
I finally got it,
Seems like even though I remove whitespaces, I still have to include them in expressions that contain one.
In addition, there was a small error in second definition, as the second digit was optional.
The grammar is thus slightly modified :
fragment SECOND : DIGIT (DIGIT)? ('.' (DIGIT)+)?;
which gives this output :
I know hope this will still work in my more complete grammar :)
Hope it helps someone.
Related
I'm designing a language that allows you to make predicates on data. Here is my lexer.
lexer grammar Studylexer;
fragment LETTER : [A-Za-z];
fragment DIGIT : [0-9];
fragment TWODIGIT : DIGIT DIGIT;
fragment MONTH: ('0' [1-9] | '1' [0-2]);
fragment DAY: ('0' [1-9] | '1' [1-9] | '2' [1-9] | '3' [0-1]);
TIMESTAMP: TWODIGIT ':' TWODIGIT; // représentation de la timestamp
DATE : TWODIGIT TWODIGIT MONTH DAY; // représentation de la date
ID : LETTER+; // match identifiers
STRING : '"' ( ~ '"' )* '"' ; // match string content
NEWLINE:'\r'? '\n' ; // return newlines to parser (is end-statement signal)
WS : [ \t]+ -> skip ; // toss out whitespace
LIST: ( LISTSTRING | LISTDATE | LISTTIMESTAMP ) ; // list of variabels;
// list of operators
GT: '>';
LT: '<';
GTEQ: '>=';
LTEQ:'<=';
EQ: '=';
IN: 'in';
fragment LISTSTRING: STRING ',' STRING (',' STRING)*; // list of strings
fragment LISTDATE : DATE ',' DATE (',' DATE)*; // list of dates
fragment LISTTIMESTAMP:TIMESTAMP ',' TIMESTAMP (',' TIMESTAMP )*; // list of timestamps
NAMES: 'filename' | 'timestamp' | 'tso' | 'region' | 'processType' | 'businessDate' | 'lastModificationDate'; // name of variables in the where block
KEY: ID '[' NAMES ']' | ID '.' NAMES; // predicat key
and here is a part of my grammar.
expr: KEY op = ('>' | '<') value = ( DATE | TIMESTAMP ) NEWLINE # exprGTORLT
| KEY op = ('>='| '<=') value = ( DATE | TIMESTAMP ) NEWLINE # exprGTEQORLTEQ
| KEY '=' value = ( STRING | DATE | TIMESTAMP ) NEWLINE # exprEQ
| KEY 'in' LIST NEWLINE #exprIn
When I make a predicate for example.
tab [key] in "value1", "value2"
ANTLR generates an error.
no viable alternative at input tab [key] in
What can I do to resolve this problem?
First tab [key] does not produce a KEY token like you want it to for two reasons:
It contains spaces and KEY doesn't allow any spaces. The best way to fix that would be to remove the KEY rule from your lexer and instead turn it into a parser rule (meaning you also need to turn [ and ] into their own tokens). Then the white space in your input would be between tokens and thus successfully skipped.
key is not actually one of the words listed in NAMES.
Then another issue is that in is recognized as an ID token, not an IN token. That's because both ID and IN would produce a match of the same length and in cases like that the rule that's listed first takes precedence. So you should define ID after all of the keywords because otherwise the keywords will never be matched.
I have a problem with my antlr grammar or(lexer). In my case I need to parse a string with custom text and find functions in it. Format of function $foo($bar(3),'strArg'). I found solution in this post ANTLR Nested Functions and little bit improved it for my needs. But while testing different cases I found one that brakes parser: $foo($3,'strArg'). This will throw IncorectSyntax exception. I tried many variants(for example not to skip $ and include it in parsing tree) but it all these attempts were unsuccessfully
Lexer
lexer grammar TLexer;
TEXT
: ~[$]
;
FUNCTION_START
: '$' -> pushMode(IN_FUNCTION), skip
;
mode IN_FUNCTION;
FUNTION_NESTED : '$' -> pushMode(IN_FUNCTION), skip;
ID : [a-zA-Z_]+;
PAR_OPEN : '(';
PAR_CLOSE : ')' -> popMode;
NUMBER : [0-9]+;
STRING : '\'' ( ~'\'' | '\'\'' )* '\'';
COMMA : ',';
SPACE : [ \t\r\n]-> skip;
Parser
options {
tokenVocab=TLexer;
}
parse
: atom* EOF
;
atom
: text
| function
;
text
: TEXT+
;
function
: ID params
;
params
: PAR_OPEN ( param ( COMMA param )* )? PAR_CLOSE
;
param
: NUMBER
| STRING
| function
;
The parser does not fail on $foo($3,'strArg'), because when it encounters the second $ it is already in IN_FUNCTION mode and it is expecting a parameter. It skips the character and reads a NUMBER.
If you want it to fail you need to unskip the dollar signs in the Lexer:
FUNCTION_START : '$' -> pushMode(IN_FUNCTION);
mode IN_FUNCTION;
FUNTION_START : '$' -> pushMode(IN_FUNCTION);
and modify the function rule:
function : FUNCTION_START ID params;
I'm using ANTLR 4 to try and parse task definitions. The task definitions look a little like the following:
task = { priority = 10; };
My grammar file then looks like the following:
grammar TaskGrammar;
/* Parser rules */
task : 'task' ASSIGNMENT_OP block EOF;
logical_entity : (TRUE | FALSE) # LogicalConst
| IDENTIFIER # LogicalVariable
;
numeric_entity : DECIMAL # NumericConst
| IDENTIFIER # NumericVariable
;
block : LBRACE (statement)* RBRACE SEMICOLON;
assignment : IDENTIFIER ASSIGNMENT_OP DECIMAL SEMICOLON
| IDENTIFIER ASSIGNMENT_OP block SEMICOLON
| IDENTIFIER ASSIGNMENT_OP QUOTED_STRING SEMICOLON
| IDENTIFIER ASSIGNMENT_OP CONSTANT SEMICOLON;
functionCall : IDENTIFIER LPAREN (parameter)*? RPAREN SEMICOLON;
parameter : DECIMAL
| QUOTED_STRING;
statement : assignment
| functionCall;
/* Lexxer rules */
IF : 'if' ;
THEN : 'then';
AND : 'and' ;
OR : 'or' ;
TRUE : 'true' ;
FALSE : 'false' ;
MULT : '*' ;
DIV : '/' ;
PLUS : '+' ;
MINUS : '-' ;
GT : '>' ;
GE : '>=' ;
LT : '<' ;
LE : '<=' ;
EQ : '==' ;
ASSIGNMENT_OP : '=' ;
LPAREN : '(' ;
RPAREN : ')' ;
LBRACE : '{' ;
RBRACE : '}' ;
SEMICOLON : ';' ;
// DECIMAL, IDENTIFIER, COMMENTS, WS are set using regular expressions
DECIMAL : '-'?[0-9]+('.'[0-9]+)? ;
IDENTIFIER : [a-zA-Z_][a-zA-Z_0-9]* ;
Value: STR_EXT | QUOTED_STRING | SINGLE_QUOTED
;
STR_EXT
:
[a-zA-Z0-9_/\.,\-:=~+!?$&^*\[\]#|]+;
Comment
:
'#' ~[\r\n]*;
CONSTANT : StringCharacters;
QUOTED_STRING
:
'"' StringCharacters? '"'
;
fragment
StringCharacters
: (~["\\] | EscapeSequence)+
;
fragment
EscapeSequence
: '\\' [btnfr"'\\]?
;
SINGLE_QUOTED
:
'\'' ~['\\]* '\'';
// COMMENT and WS are stripped from the output token stream by sending
// to a different channel 'skip'
COMMENT : '//' .+? ('\n'|EOF) -> skip ;
WS : [ \r\t\u000C\n]+ -> skip ;
This grammar compiles fine in ANTLR, but when it comes to trying to use the parser, I get the following error:
line 1:0 mismatched input 'task = { priority = 10; return = AND; };' expecting 'task'
org.antlr.v4.runtime.InputMismatchException
It looks like the parser isn't recognising the block part of the definition, but I can't quite see why. The block parse rule definition should match as far as I can tell. I would expect to have a TaskContext, with a child BlockContext containing a single AssignmentContext. I get the TaskContext, but it has the above exception.
Am I missing something here? This is my first attempt at using Antler, so may be getting confused between Lexxer and Parser rules...
Your STR_EXT consumes the entire input. That rule has to go: ANTLR's lexer will always try to match as much characters as possible.
I also see that CONSTANT might consume that entire input. It has to go to, or at least be changed to consume less chars.
I'm trying to write a program in ANTLR (Java) concerning simplifying regular expression. I have already written some code (grammar file contents below)
grammar Regexp_v7;
options{
language = Java;
output = AST;
ASTLabelType = CommonTree;
backtrack = true;
}
tokens{
DOT;
REPEAT;
RANGE;
NULL;
}
fragment
ZERO
: '0'
;
fragment
DIGIT
: '1'..'9'
;
fragment
EPSILON
: '#'
;
fragment
FI
: '%'
;
ID
: EPSILON
| FI
| 'a'..'z'
| 'A'..'Z'
;
NUMBER
: ZERO
| DIGIT (ZERO | DIGIT)*
;
WHITESPACE
: ('\r' | '\n' | ' ' | '\t' ) + {$channel = HIDDEN;}
;
list
: (reg_exp ';'!)*
;
term
: ID -> ID
| '('! reg_exp ')'!
;
repeat_exp
: term ('{' range_exp '}')+ -> ^(REPEAT term (range_exp)+)
| term -> term
;
range_exp
: NUMBER ',' NUMBER -> ^(RANGE NUMBER NUMBER)
| NUMBER (',') -> ^(RANGE NUMBER NULL)
| ',' NUMBER -> ^(RANGE NULL NUMBER)
| NUMBER -> ^(RANGE NUMBER NUMBER)
;
kleene_exp
: repeat_exp ('*'^)*
;
concat_exp
: kleene_exp (kleene_exp)+ -> ^(DOT kleene_exp (kleene_exp)+)
| kleene_exp -> kleene_exp
;
reg_exp
: concat_exp ('|'^ concat_exp)*
;
My next goal is to write down tree grammar code, which is able to simplify regular expressions (e.g. a|a -> a , etc.). I have done some coding (see text below), but I have troubles with defining rule that treats nodes as subtrees (in order to simplify following kind of expressions e.g.: (a|a)|(a|a) to a, etc.)
tree grammar Regexp_v7Walker;
options{
language = Java;
tokenVocab = Regexp_v7;
ASTLabelType = CommonTree;
output=AST;
backtrack = true;
}
tokens{
NULL;
}
bottomup
: ^('*' ^('*' e=.)) -> ^('*' $e) //a** -> a*
| ^('|' i=.* j=.* {$i.tree.toStringTree() == $j.tree.toStringTree()} )
-> $i // There are 3 errors while this line is up and running:
// 1. CommonTree cannot be resolved,
// 2. i.tree cannot be resolved or is not a field,
// 3. i cannot be resolved.
;
Small driver class:
public class Regexp_Test_v7 {
public static void main(String[] args) throws RecognitionException {
CharStream stream = new ANTLRStringStream("a***;a|a;(ab)****;ab|ab;ab|aa;");
Regexp_v7Lexer lexer = new Regexp_v7Lexer(stream);
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
Regexp_v7Parser parser = new Regexp_v7Parser(tokenStream);
list_return list = parser.list();
CommonTree t = (CommonTree) list.getTree();
System.out.println("Original tree: " + t.toStringTree());
CommonTreeNodeStream nodes = new CommonTreeNodeStream(t);
Regexp_v7Walker s = new Regexp_v7Walker(nodes);
t = (CommonTree)s.downup(t);
System.out.println("Simplified tree: " + t.toStringTree());
Can anyone help me with solving this case?
Thanks in advance and regards.
Now, I'm no expert, but in your tree grammar:
add filter=true
change the second line of bottomup rule to:
^('|' i=. j=. {i.toStringTree().equals(j.toStringTree()) }? ) -> $i }
If I'm not mistaken by using i=.* you're allowing i to be non-existent and you'll get a NullPointerException on conversion to a String.
Both i and j are of type CommonTree because you've set it up this way: ASTLabelType = CommonTree, so you should call i.toStringTree().
And since it's Java and you're comparing Strings, use equals().
Also to make the expression in curly brackets a predicate, you need a question mark after the closing one.
I'm trying to create a very simple grammar to learn to use ANTLR but I get the following message:
"The following alternatives can never be reached: 2"
This is my grammar attempt:
grammar Robot;
file : command+;
command : ( delay|type|move|click|rclick) ;
delay : 'wait' number ';';
type : 'type' id ';';
move : 'move' number ',' number ';';
click : 'click' ;
rclick : 'rlick' ;
id : ('a'..'z'|'A'..'Z')+ ;
number : ('0'..'9')+ ;
WS : (' ' | '\t' | '\r' | '\n' ) { skip();} ;
I'm using ANTLRWorks plugin for IDEA:
The .. (range) inside parser rules means something different than inside lexer rules. Inside lexer rules, it means: "from char X to char Y", and inside parser rule it matches "from token M to token N". And since you made number a parser rule, it does not do what you think it does (and are therefor receiving an obscure error message).
The solution: make number a lexer rule instead (so, capitalize it: Number):
grammar Robot;
file : command+;
command : (delay | type | move | Click | RClick) ;
delay : 'wait' Number ';';
type : 'type' Id ';';
move : 'move' Number ',' Number ';';
Click : 'click' ;
RClick : 'rlick' ;
Id : ('a'..'z'|'A'..'Z')+ ;
Number : ('0'..'9')+ ;
WS : (' ' | '\t' | '\r' | '\n') { skip();} ;
And as you can see, I also made id, click and rclick lexer rules instead. If you're not sure what the difference is between parser- and lexer rules, please say so and I'll add an explanation to this answer.