I'm designing a language that allows you to make predicates on data. Here is my lexer.
lexer grammar Studylexer;
fragment LETTER : [A-Za-z];
fragment DIGIT : [0-9];
fragment TWODIGIT : DIGIT DIGIT;
fragment MONTH: ('0' [1-9] | '1' [0-2]);
fragment DAY: ('0' [1-9] | '1' [1-9] | '2' [1-9] | '3' [0-1]);
TIMESTAMP: TWODIGIT ':' TWODIGIT; // représentation de la timestamp
DATE : TWODIGIT TWODIGIT MONTH DAY; // représentation de la date
ID : LETTER+; // match identifiers
STRING : '"' ( ~ '"' )* '"' ; // match string content
NEWLINE:'\r'? '\n' ; // return newlines to parser (is end-statement signal)
WS : [ \t]+ -> skip ; // toss out whitespace
LIST: ( LISTSTRING | LISTDATE | LISTTIMESTAMP ) ; // list of variabels;
// list of operators
GT: '>';
LT: '<';
GTEQ: '>=';
LTEQ:'<=';
EQ: '=';
IN: 'in';
fragment LISTSTRING: STRING ',' STRING (',' STRING)*; // list of strings
fragment LISTDATE : DATE ',' DATE (',' DATE)*; // list of dates
fragment LISTTIMESTAMP:TIMESTAMP ',' TIMESTAMP (',' TIMESTAMP )*; // list of timestamps
NAMES: 'filename' | 'timestamp' | 'tso' | 'region' | 'processType' | 'businessDate' | 'lastModificationDate'; // name of variables in the where block
KEY: ID '[' NAMES ']' | ID '.' NAMES; // predicat key
and here is a part of my grammar.
expr: KEY op = ('>' | '<') value = ( DATE | TIMESTAMP ) NEWLINE # exprGTORLT
| KEY op = ('>='| '<=') value = ( DATE | TIMESTAMP ) NEWLINE # exprGTEQORLTEQ
| KEY '=' value = ( STRING | DATE | TIMESTAMP ) NEWLINE # exprEQ
| KEY 'in' LIST NEWLINE #exprIn
When I make a predicate for example.
tab [key] in "value1", "value2"
ANTLR generates an error.
no viable alternative at input tab [key] in
What can I do to resolve this problem?
First tab [key] does not produce a KEY token like you want it to for two reasons:
It contains spaces and KEY doesn't allow any spaces. The best way to fix that would be to remove the KEY rule from your lexer and instead turn it into a parser rule (meaning you also need to turn [ and ] into their own tokens). Then the white space in your input would be between tokens and thus successfully skipped.
key is not actually one of the words listed in NAMES.
Then another issue is that in is recognized as an ID token, not an IN token. That's because both ID and IN would produce a match of the same length and in cases like that the rule that's listed first takes precedence. So you should define ID after all of the keywords because otherwise the keywords will never be matched.
Related
I'm using ANTLR 4 to parse a protocol's messages, let's name it 'X'. Before extracting a message's information , I have to check if it complies with X's rules.
Suppose we have to parse X's 'FOO' message that follows the following rules:
Message starts with the 'messageIdentifier' that consists of the 3-letter reserved word FOO.
Message contains 5 fields, of which the first 2 are mandatory (must be included) and the rest 3 are optional (can be not included).
Message's fields are separated by the character '/'. If there is no information in a field (that means that the field is optional and is omitted) the '/' character must be preserved. Optional fields and their associated filed separators '/' at the end of the message may be omitted where no further information within the message is reported.
A message can expand in multiple lines. Each line must have at least one non-empty field (mandatory or optional). Moreover, each line must start with a '/' character and end with a non-empty field following a '\n' character. Exception is the first line that always starts with the reserved word FOO.
Each message's field also has its own rules regarding the accepted tokens, which will be shown in the grammar below.
Sample examples of valid FOO messages:
FOO/MANDATORY_1/MANDATORY2/OPT 1/HELLO/100\n
FOO/MANDATORY_1/MANDATORY2\n
/OPT 1\n
/HELLO\n
/100\n
FOO/MANDATORY_1/MANDATORY2\n
FOO/MANDATORY_1/MANDATORY2//HELLO/100\n
FOO/MANDATORY_1/MANDATORY2///100\n
FOO/MANDATORY_1/MANDATORY2/OPT 1\n
FOO/MANDATORY_1/MANDATORY2
///100\n
Sample examples of non-valid FOO messages:
FOO\n
/MANDATORY_1/MANDATORY2/OPT 1/HELLO/100\n
FOO/MANDATORY_1/\n
MANDATORY2/OPT 1/HELLO/100\n
FOO/MANDATORY_1/MANDATORY2/OPT 1//\n
FOO/MANDATORY_1/MANDATORY2/OPT 1/\n
/100\n
FOO/MANDATORY_1/MANDATORY2/OPT 1/HELLO/100\n
FOO/MANDATORY_1/MANDATORY2/\n
FOO/MANDATORY_1/MANDATORY2/OPT 1/HELLO/100
Below follows the grammar for the above message:
grammar Foo_Message
/* Parser Rules */
startRule : 'FOO' mandatoryField_1 ;
mandatoryField_1 : '/' field_1 NL? mandatoryField_2 ;
mandatoryField_2 : '/' field_2 NL? optionalField_3 ;
optionalField_3 : '/' field_3 NL? optionalField_4
| '/' optionalField_4
| optionalField_4
;
optionalField_4 : '/' field_4 NL? optionalField_5
| '/' optionalField_5
| optionalField_5
;
optionalField_5 : '/' field_5 NL?
| NL
;
field_1 : (A | N | B | S)+ ;
field_2 : (A | N)+ ;
field_3 : (A | N | B)+ ;
field_4 : A+ ;
field_5 : N+ ;
/* Lexer Rules */
A : [A-Z]+ ;
N : [0-9]+ ;
B : ' ' -> skip ;
S : [*&##-_<>?!]+ ;
NL : '\r'? '\n' ;
The above grammar parses correctly any input that complies with FOO message's rules.
The problem resides in parsing a line that ends with the '/' character, which according to the protocol's FOO message's rules is an invalid input.
I understand that the second alternatives of rules 'optionalField_3', 'optionalField_4' and 'optionalField_5' lead to this behavior but I can't figure out how to make a rule for this.
Somehow I need the parser to remember that he came to 'optionalField_5' rule after seeing a non-omitted field in the previous rule, which if I am not mistaken can't be done in ANTLR as I can't check from which alternative of the previous rule I reached the current rule.
Is there a way to make the parser 'remember' this by some explicit option-rule? Or does my grammar need to be rearranged and if yes how?
This grammar accepts all examples, character for character copied/pasted from your post, and flags a parse error all "non-valid FOO messages".
grammar X;
file_ : s* EOF ;
s : FOO '/' f1 '/' f2 (
| NL? '/' f3
| NL? ('/' f3 NL? | '/' ) '/' f4
| NL? ('/' f3 NL? | '/' ) ('/' f4 NL? | '/') '/' f5
) NL;
f1 : (A | N | B | S)+ ;
f2 : (A | N | B)+ ;
f3 : (A | N | B)+ ;
f4 : A+ ;
f5 : N+ ;
FOO: 'FOO';
A : [A-Z]+ ;
N : [0-9]+ ;
B : ' ';
S : [*&##\-_<>?!]+ ;
NL : '\r'? '\n' ;
One can easily refactor this with folds and groupings.
In your previous grammar, lexer symbol B was marked as "skip". Skipped symbols do not appear on any token stream, and they should not be used directly on the right-hand side of a parser rule (see field_1 from your original grammar). It is innocuous because it is alted with other symbols, i.e. field_3:(A|N|B)+; will operate the same as field_3:(A|N)+;, but the rule field_3:(A|N|B)+; may be misleading to others because B will never appear in the parse tree. I felt that you wanted to include spaces in the fields, because perhaps you would want to compute the text for a field. Therefore, I changed the rule for B to appear as a token.
#5 from "non-valid FOO messages" is exactly the same character for character of #1 from "valid FOO messages", which you can see here:
#1: FOO/MANDATORY_1/MANDATORY2/OPT 1/HELLO/100\n
#5: FOO/MANDATORY_1/MANDATORY2/OPT 1/HELLO/100\n
I don't understand your comment "this allows the optional fields of the FOO message to come in any order". The grammar here and the previous grammar I mentioned in the comments force field3 to occur before field4, which occurs before field5. There is no way that field5 could occur before a field3: the requisite number of '/' must appear before field5. Fields can be empty (see #4 of "valid FOO messages"). To handle that, the field specified is a grouping, e.g., ('/' f3 NL? | '/' ). For this grouping, the only sentential forms are "/", "/f3", "/f3\n". Note, this grouping can only occur with a succeeding field, so it is impossible for two "\n" to be next to each other.
The other way to approach this is to use semantic predicates or evaluate the semantic equations after the entire parse.
If there are many more fields, then you will likely not want to add alts for f6, f7, ...., f10000. In that case, I would suggest that you allow an arbitrary type for each field in the parse:
s : FOO '/' f1 '/' f2 (
| NL? ('/' f NL? | '/' )* '/' f
) NL;
and validate the semantics afterwards.
Solution was to refactor my grammar to include rules for filledField and emptyField.
kaby76's post is marked as an answer as it helped towards the solution.
The refactored grammar:
grammar Foo_Message
/* Parser Rules */
startRule : 'FOO' mandatoryField_1 endRule ;
mandatoryField_1 : '/' field_1 NL? mandatoryField_2 ;
mandatoryField_2 : '/' field_2 NL? (filledOptionalField_3 | emptyOptionalField_3 )? ;
filledOptionalField_3 : '/' field_3 NL? (filledOptionalField_4 | emptyOptionalField_4)? ;
emptyOptionalField_3 : '/' (filledOptionalField_4 | emptyOptionalField_4) ;
filledOptionalField_4 : '/' field_4 NL? filledOptionalField_5? ;
emptyOptionalField_4 : '/' filledOptionalField_5 ;
filledOptionalField_5 : '/' field_5 ;
endRule : NL;
field_1 : (A | N | B | S)+ ;
field_2 : (A | N)+ ;
field_3 : (A | N | B)+ ;
field_4 : A+ ;
field_5 : N+ ;
/* Lexer Rules */
A : [A-Z]+ ;
N : [0-9]+ ;
B : ' ' -> skip ;
S : [*&##-_<>?!]+ ;
NL : '\r'? '\n' ;
I'm using ANTLR 4 to try and parse task definitions. The task definitions look a little like the following:
task = { priority = 10; };
My grammar file then looks like the following:
grammar TaskGrammar;
/* Parser rules */
task : 'task' ASSIGNMENT_OP block EOF;
logical_entity : (TRUE | FALSE) # LogicalConst
| IDENTIFIER # LogicalVariable
;
numeric_entity : DECIMAL # NumericConst
| IDENTIFIER # NumericVariable
;
block : LBRACE (statement)* RBRACE SEMICOLON;
assignment : IDENTIFIER ASSIGNMENT_OP DECIMAL SEMICOLON
| IDENTIFIER ASSIGNMENT_OP block SEMICOLON
| IDENTIFIER ASSIGNMENT_OP QUOTED_STRING SEMICOLON
| IDENTIFIER ASSIGNMENT_OP CONSTANT SEMICOLON;
functionCall : IDENTIFIER LPAREN (parameter)*? RPAREN SEMICOLON;
parameter : DECIMAL
| QUOTED_STRING;
statement : assignment
| functionCall;
/* Lexxer rules */
IF : 'if' ;
THEN : 'then';
AND : 'and' ;
OR : 'or' ;
TRUE : 'true' ;
FALSE : 'false' ;
MULT : '*' ;
DIV : '/' ;
PLUS : '+' ;
MINUS : '-' ;
GT : '>' ;
GE : '>=' ;
LT : '<' ;
LE : '<=' ;
EQ : '==' ;
ASSIGNMENT_OP : '=' ;
LPAREN : '(' ;
RPAREN : ')' ;
LBRACE : '{' ;
RBRACE : '}' ;
SEMICOLON : ';' ;
// DECIMAL, IDENTIFIER, COMMENTS, WS are set using regular expressions
DECIMAL : '-'?[0-9]+('.'[0-9]+)? ;
IDENTIFIER : [a-zA-Z_][a-zA-Z_0-9]* ;
Value: STR_EXT | QUOTED_STRING | SINGLE_QUOTED
;
STR_EXT
:
[a-zA-Z0-9_/\.,\-:=~+!?$&^*\[\]#|]+;
Comment
:
'#' ~[\r\n]*;
CONSTANT : StringCharacters;
QUOTED_STRING
:
'"' StringCharacters? '"'
;
fragment
StringCharacters
: (~["\\] | EscapeSequence)+
;
fragment
EscapeSequence
: '\\' [btnfr"'\\]?
;
SINGLE_QUOTED
:
'\'' ~['\\]* '\'';
// COMMENT and WS are stripped from the output token stream by sending
// to a different channel 'skip'
COMMENT : '//' .+? ('\n'|EOF) -> skip ;
WS : [ \r\t\u000C\n]+ -> skip ;
This grammar compiles fine in ANTLR, but when it comes to trying to use the parser, I get the following error:
line 1:0 mismatched input 'task = { priority = 10; return = AND; };' expecting 'task'
org.antlr.v4.runtime.InputMismatchException
It looks like the parser isn't recognising the block part of the definition, but I can't quite see why. The block parse rule definition should match as far as I can tell. I would expect to have a TaskContext, with a child BlockContext containing a single AssignmentContext. I get the TaskContext, but it has the above exception.
Am I missing something here? This is my first attempt at using Antler, so may be getting confused between Lexxer and Parser rules...
Your STR_EXT consumes the entire input. That rule has to go: ANTLR's lexer will always try to match as much characters as possible.
I also see that CONSTANT might consume that entire input. It has to go to, or at least be changed to consume less chars.
I'm trying to write a code translator in Java with the help of Antlr4 and had great success with the grammar part so far. However I'm now banging my head against a wall wrapping my mind around the parse tree data structure that I need to work on after my input has been parsed.
I'm trying to use the visitor template to go over my parse tree. I'll show you an example to illustrate the points of my confusion.
My grammar:
grammar pqlc;
// Lexer
//Schlüsselwörter
EXISTS: 'exists';
REDUCE: 'reduce';
QUERY: 'query';
INT: 'int';
DOUBLE: 'double';
CONST: 'const';
STDVECTOR: 'std::vector';
STDMAP: 'std::map';
STDSET: 'std::set';
C_EXPR: 'c_expr';
INTEGER_LITERAL : (DIGIT)+ ;
fragment DIGIT: '0'..'9';
DOUBLE_LITERAL : DIGIT '.' DIGIT+;
LPAREN : '(';
RPAREN : ')';
LBRACK : '[';
RBRACK : ']';
DOT : '.';
EQUAL : '==';
LE : '<=';
GE : '>=';
GT : '>';
LT : '<';
ADD : '+';
MUL : '*';
AND : '&&';
COLON : ':';
IDENTIFIER : JavaLetter JavaLetterOrDigit*;
fragment JavaLetter : [a-zA-Z$_]; // these are the "java letters" below 0xFF
fragment JavaLetterOrDigit : [a-zA-Z0-9$_]; // these are the "java letters or digits" below 0xFF
WS
: [ \t\r\n\u000C]+ -> skip
;
COMMENT
: '/*' .*? '*/' -> skip
;
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
// Parser
//start_rule: query;
query :
quant_expr
| qexpr+
| IDENTIFIER // order IDENTIFIER and qexpr+?
| numeral
| c_expr //TODO
;
c_type : INT | DOUBLE | CONST;
bin_op: AND | ADD | MUL | EQUAL | LT | GT | LE| GE;
qexpr:
LPAREN query RPAREN bin_op_query?
// query bin_op query
| IDENTIFIER bin_op_query? // copied from query to resolve left recursion problem
| numeral bin_op_query? // ^
| quant_expr bin_op_query? // ^
|c_expr bin_op_query?
// query.find(query)
| IDENTIFIER find_query? // copied from query to resolve left recursion problem
| numeral find_query? // ^
| quant_expr find_query?
|c_expr find_query?
// query[query]
| IDENTIFIER array_query? // copied from query to resolve left recursion problem
| numeral array_query? // ^
| quant_expr array_query?
|c_expr array_query?
// | qexpr bin_op_query // bad, resolved by quexpr+ in query
;
bin_op_query: bin_op query bin_op_query?; // resolve left recursion of query bin_op query
find_query: '.''find' LPAREN query RPAREN;
array_query: LBRACK query RBRACK;
quant_expr:
quant id ':' query
| QUERY LPAREN match RPAREN ':' query
| REDUCE LPAREN IDENTIFIER RPAREN id ':' query
;
match:
STDVECTOR LBRACK id RBRACK EQUAL cm
| STDMAP '.''find' LPAREN cm RPAREN EQUAL cm
| STDSET '.''find' LPAREN cm RPAREN
;
cm:
IDENTIFIER
| numeral
| c_expr //TODO
;
quant :
EXISTS;
id :
c_type IDENTIFIER
| IDENTIFIER // Nach Seite 2 aber nicht der Übersicht. Laut übersicht id -> aber dann wäre Regel 1 ohne +
;
numeral :
INTEGER_LITERAL
| DOUBLE_LITERAL
;
c_expr:
C_EXPR
;
Now let's parse the following string:
double x: x >= c_expr
Visually I'll get this tree:
Let's say my visitor is in the visitQexpr(#NotNull pqlcParser.QexprContext ctx) routine when it hits the branch Qexpr(x bin_op_query).
My question is, how can I tell that the left children ("x" in the tree) is a terminal node, or more specifically an "IDENTIFIER"? There are no visiting rules for Terminal nodes since they aren't rules.
ctx.getChild(0) has no RuleIndex. I guess I could use that to check if I'm in a terminal or not, but that still wouldn't tell me if I was in IDENTIFIER or another kind of terminal token. I need to be able to tell the difference somehow.
I had more questions but in the time it took me to write the explanation I forgot them :<
Thanks in advance.
You can add labels to tokens and access them/check if they exist in the surrounding context:
id :
c_type labelA = IDENTIFIER
| labelB = IDENTIFIER
;
You could also do this to create different visits:
id :
c_type IDENTIFIER #idType1 //choose more appropriate names!
| IDENTIFIER #idType2
;
This will create different visitors for the two alternatives and I suppose (i.e. have not verified) that the visitor for id will not be called.
I prefer the following approach though:
id :
typeDef
| otherId
;
typeDef: c_type IDENTIFIER;
otherId : IDENTIFIER ;
This is a more heavily typed system. But you can very specifically visit nodes. Some rules of thumb I use:
Use | only when all alternatives are parser rules.
Wrap each Token in a parser rule (like otherId) to give them "more meaning".
It's ok to mix parser rules and tokens, if the tokens are not really important (like ;) and therefore not needed in the parse tree.
My ANTLR grammar looks like this.
grammar ProgCalc;
options {
language = Java;
ASTLabelType=CommonTree;
output=AST;
backtrack=true;
}
/* Parser rules */
eval
: exp=add;
add
: term ( PLUS^ term | MINUS^ term ) *;
term
: factor ( MULT^ factor | MOD^ factor )*;
factor
: number
| VARIABLE
| '('! add^ ')'!
;
number
: DEC | HEX | OCT;
/* Lexer Rules*/
VARIABLE: ('a'..'z' |'A'..'Z')('a'..'z'|'A'..'Z' | '0'..'9'|'_')* ;
DEC : ('1'..'9')('0'..'9')+;
HEX : '0x' ('0'..'9' | 'a'..'f' | 'A'..'F')+;
OCT : '0' ('0'..'7')*;
PLUS : '+';
MINUS : '-';
MULT : '*';
MOD : '%';
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; };
When I compile, it was successful.
But as I parsed with an expression(eg. 5%3*5), I get an error.
line 1:1 required (...)+ loop did not match anything at character '%'
line 1:3 required (...)+ loop did not match anything at character '*'
line 1:5 required (...)+ loop did not match anything at character '<EOF>'
line 1:5 no viable alternative at input '<EOF>'
Could anyone please check my grammar and correct it?
Thank you very much.
Your DEC lexer rule requires at least 2 digits due to the + operator. I believe you meant to write:
DEC : ('1'..'9') ('0'..'9')*;
I have a small problem with my grammar.
I am trying to detect whether my string is a date comparison or not.
But the DATE lexer I created seem not to be recognized by antlr, and I get an error that I cannot solve.
Here is my input expression :
"FDT > '2007/10/09 12:00:0.0'"
I simply expect such a tree as output :
COMP_OP
FDT my_DATE
Here is my grammar :
// Aiming at parsing a complete BQS formed Query
grammar Logic;
options {
output=AST;
}
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
// precedence order is (low to high): or, and, not, [comp_op, geo_op, rel_geo_op, like, not like, exists], ()
parse
: expression EOF -> expression
; // ommit the EOF token
expression
: query
;
query
: atom (COMP_OP^ DATE)*
;
//END BIG PART
atom
: ID
| | '(' expression ')' -> expression
;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
// GENERAL OPERATORS:
DATE : '\'' YEAR '/' MONTH '/' DAY (' ' HOUR ':' MINUTE ':' SECOND)? '\'';
ID : (CHARACTER|DIGIT|','|'.'|'\''|':'|'/')+;
COMP_OP : '=' | '<' | '>' | '<>' | '<=' | '>=';
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
fragment YEAR : DIGIT DIGIT DIGIT DIGIT;
fragment MONTH : DIGIT DIGIT;
fragment DAY : DIGIT DIGIT;
fragment HOUR : DIGIT DIGIT;
fragment MINUTE : DIGIT DIGIT;
fragment SECOND : DIGIT DIGIT ('.' (DIGIT)+)?;
fragment DIGIT : '0'..'9' ;
fragment DIGIT_SEQ :(DIGIT)+;
fragment CHARACTER : ('a'..'z' | 'A'..'Z');
As an output error, I get :
line 1:25 mismatched character '.' expecting set null
line 1:27 mismatched input ''' expecting DATE
I also tried to remove the ' ' from my Date (thinking that it was perhaps the problem, as I remove them in the grammar.)
In this case, I get this error :
line 1:6 mismatched input ''2007/10/09' expecting DATE
Can anyone explain me why I get such an error, and how I could solve it ?
This question is q subset of my complete task, where I have to differentiate lots of omparisons (dates, geographic, strings, . . .). I would thus really need to be able to give 'tags' to my atoms.
Thank you very much !
As a complement, here is my current Java code :
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
// the expression
String src = "FDT > '2007/10/09 12:00:0.0'";
// create a lexer & parser
//LogicLexer lexer = new LogicLexer(new ANTLRStringStream(src));
//LogicParser parser = new LogicParser(new CommonTokenStream(lexer));
LogicLexer lexer = new LogicLexer(new ANTLRStringStream(src));
LogicParser parser = new LogicParser(new CommonTokenStream(lexer));
// invoke the entry point of the parser (the parse() method) and get the AST
CommonTree tree = (CommonTree)parser.parse().getTree();
// print the DOT representation of the AST
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
I finally got it,
Seems like even though I remove whitespaces, I still have to include them in expressions that contain one.
In addition, there was a small error in second definition, as the second digit was optional.
The grammar is thus slightly modified :
fragment SECOND : DIGIT (DIGIT)? ('.' (DIGIT)+)?;
which gives this output :
I know hope this will still work in my more complete grammar :)
Hope it helps someone.