I try to implement oopath syntax in a drools like rule, but I have some issues regarding the non variables oopaths. For example, here is what I try to generate in the when block:
when $rt : string(Variables
$yr1t : /path1/F
$yr2t : /path2/F
$yr3t : path3.path4/PATH5
$yr4t : path3
$yr5t : /path3
Conditions
$yr4t == $yr5t + 3
$yr3t != $yr2t
//FROM HERE IS THE PROBLEM:
$yr3t == p/path/f
$yr3t == /g/t
/path2/F[g==$yr1t]
)
The problem I am facing is that my grammar doesn't support this format and I don't know how to modify the existing one in order to support even the last 3 statements.
Here is what I've tried so far:
Model:
declarations+=Declaration*
;
Temp:
elementType=ElementType
;
ElementType:
typeName=('string'|'int'|'boolean');
Declaration:
Rule
;
#Override
terminal ID: ('^'|'$')('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'_'|'0'..'9')*;
terminal PATH: ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'_'|'0'..'9')*;
Rule:
'Filter'
'rule' ruleDescription=STRING
'#specification' QualifiedNameSpecification
'ruleflow-group' ruleflowDescription=STRING
'when' (name += ID ':' atribute += Temp '(' 'Variables'?
//(varName += ID ':' QualifiedNameVariabilePath)*
(variablesList += Variable)*
'Conditions'?
(exp += EvalExpression)*
')'
)*
;
QualifiedNameSpecification: '(' STRING ')';
QualifiedNameVariabilePath: (('/'|'.')? PATH)* ;
ExpressionsModel:
elements += AbstractElement*;
AbstractElement:
Variable | EvalExpression ;
Variable:
//'var'
name=ID ':' QualifiedNameVariabilePath; //expression=Expression;
EvalExpression:
//'eval'
expression=Expression;
Expression: Or;
Or returns Expression:
And ({Or.left=current} "||" right=And)*
;
And returns Expression:
Equality ({And.left=current} "&&" right=Equality)*
;
Equality returns Expression:
Comparison (
{Equality.left=current} op=("=="|"!=")
right=Comparison
)*
;
Comparison returns Expression:
PlusOrMinus (
{Comparison.left=current} op=(">="|"<="|">"|"<")
right=PlusOrMinus
)*
;
PlusOrMinus returns Expression:
MulOrDiv (
({Plus.left=current} '+' | {Minus.left=current} '-')
right=MulOrDiv
)*
;
MulOrDiv returns Expression:
Primary (
{MulOrDiv.left=current} op=('*'|'/')
right=Primary
)*
;
Primary returns Expression:
'(' Expression ')' |
{Not} "!" expression=Primary |
Atomic
;
Atomic returns Expression:
{IntConstant} value=INT |
{StringConstant} value=STRING |
{BoolConstant} value=('true'|'false') |
{VariableRef} variable=[Variable]
;
EDIT: To make it more clear, the question is how do I modify my grammar in order to support oopath syntax without binding them to a variable, something like a temporary object.
Related
I'm using ANTLR 4 to parse a protocol's messages, let's name it 'X'. Before extracting a message's information , I have to check if it complies with X's rules.
Suppose we have to parse X's 'FOO' message that follows the following rules:
Message starts with the 'messageIdentifier' that consists of the 3-letter reserved word FOO.
Message contains 5 fields, of which the first 2 are mandatory (must be included) and the rest 3 are optional (can be not included).
Message's fields are separated by the character '/'. If there is no information in a field (that means that the field is optional and is omitted) the '/' character must be preserved. Optional fields and their associated filed separators '/' at the end of the message may be omitted where no further information within the message is reported.
A message can expand in multiple lines. Each line must have at least one non-empty field (mandatory or optional). Moreover, each line must start with a '/' character and end with a non-empty field following a '\n' character. Exception is the first line that always starts with the reserved word FOO.
Each message's field also has its own rules regarding the accepted tokens, which will be shown in the grammar below.
Sample examples of valid FOO messages:
FOO/MANDATORY_1/MANDATORY2/OPT 1/HELLO/100\n
FOO/MANDATORY_1/MANDATORY2\n
/OPT 1\n
/HELLO\n
/100\n
FOO/MANDATORY_1/MANDATORY2\n
FOO/MANDATORY_1/MANDATORY2//HELLO/100\n
FOO/MANDATORY_1/MANDATORY2///100\n
FOO/MANDATORY_1/MANDATORY2/OPT 1\n
FOO/MANDATORY_1/MANDATORY2
///100\n
Sample examples of non-valid FOO messages:
FOO\n
/MANDATORY_1/MANDATORY2/OPT 1/HELLO/100\n
FOO/MANDATORY_1/\n
MANDATORY2/OPT 1/HELLO/100\n
FOO/MANDATORY_1/MANDATORY2/OPT 1//\n
FOO/MANDATORY_1/MANDATORY2/OPT 1/\n
/100\n
FOO/MANDATORY_1/MANDATORY2/OPT 1/HELLO/100\n
FOO/MANDATORY_1/MANDATORY2/\n
FOO/MANDATORY_1/MANDATORY2/OPT 1/HELLO/100
Below follows the grammar for the above message:
grammar Foo_Message
/* Parser Rules */
startRule : 'FOO' mandatoryField_1 ;
mandatoryField_1 : '/' field_1 NL? mandatoryField_2 ;
mandatoryField_2 : '/' field_2 NL? optionalField_3 ;
optionalField_3 : '/' field_3 NL? optionalField_4
| '/' optionalField_4
| optionalField_4
;
optionalField_4 : '/' field_4 NL? optionalField_5
| '/' optionalField_5
| optionalField_5
;
optionalField_5 : '/' field_5 NL?
| NL
;
field_1 : (A | N | B | S)+ ;
field_2 : (A | N)+ ;
field_3 : (A | N | B)+ ;
field_4 : A+ ;
field_5 : N+ ;
/* Lexer Rules */
A : [A-Z]+ ;
N : [0-9]+ ;
B : ' ' -> skip ;
S : [*&##-_<>?!]+ ;
NL : '\r'? '\n' ;
The above grammar parses correctly any input that complies with FOO message's rules.
The problem resides in parsing a line that ends with the '/' character, which according to the protocol's FOO message's rules is an invalid input.
I understand that the second alternatives of rules 'optionalField_3', 'optionalField_4' and 'optionalField_5' lead to this behavior but I can't figure out how to make a rule for this.
Somehow I need the parser to remember that he came to 'optionalField_5' rule after seeing a non-omitted field in the previous rule, which if I am not mistaken can't be done in ANTLR as I can't check from which alternative of the previous rule I reached the current rule.
Is there a way to make the parser 'remember' this by some explicit option-rule? Or does my grammar need to be rearranged and if yes how?
This grammar accepts all examples, character for character copied/pasted from your post, and flags a parse error all "non-valid FOO messages".
grammar X;
file_ : s* EOF ;
s : FOO '/' f1 '/' f2 (
| NL? '/' f3
| NL? ('/' f3 NL? | '/' ) '/' f4
| NL? ('/' f3 NL? | '/' ) ('/' f4 NL? | '/') '/' f5
) NL;
f1 : (A | N | B | S)+ ;
f2 : (A | N | B)+ ;
f3 : (A | N | B)+ ;
f4 : A+ ;
f5 : N+ ;
FOO: 'FOO';
A : [A-Z]+ ;
N : [0-9]+ ;
B : ' ';
S : [*&##\-_<>?!]+ ;
NL : '\r'? '\n' ;
One can easily refactor this with folds and groupings.
In your previous grammar, lexer symbol B was marked as "skip". Skipped symbols do not appear on any token stream, and they should not be used directly on the right-hand side of a parser rule (see field_1 from your original grammar). It is innocuous because it is alted with other symbols, i.e. field_3:(A|N|B)+; will operate the same as field_3:(A|N)+;, but the rule field_3:(A|N|B)+; may be misleading to others because B will never appear in the parse tree. I felt that you wanted to include spaces in the fields, because perhaps you would want to compute the text for a field. Therefore, I changed the rule for B to appear as a token.
#5 from "non-valid FOO messages" is exactly the same character for character of #1 from "valid FOO messages", which you can see here:
#1: FOO/MANDATORY_1/MANDATORY2/OPT 1/HELLO/100\n
#5: FOO/MANDATORY_1/MANDATORY2/OPT 1/HELLO/100\n
I don't understand your comment "this allows the optional fields of the FOO message to come in any order". The grammar here and the previous grammar I mentioned in the comments force field3 to occur before field4, which occurs before field5. There is no way that field5 could occur before a field3: the requisite number of '/' must appear before field5. Fields can be empty (see #4 of "valid FOO messages"). To handle that, the field specified is a grouping, e.g., ('/' f3 NL? | '/' ). For this grouping, the only sentential forms are "/", "/f3", "/f3\n". Note, this grouping can only occur with a succeeding field, so it is impossible for two "\n" to be next to each other.
The other way to approach this is to use semantic predicates or evaluate the semantic equations after the entire parse.
If there are many more fields, then you will likely not want to add alts for f6, f7, ...., f10000. In that case, I would suggest that you allow an arbitrary type for each field in the parse:
s : FOO '/' f1 '/' f2 (
| NL? ('/' f NL? | '/' )* '/' f
) NL;
and validate the semantics afterwards.
Solution was to refactor my grammar to include rules for filledField and emptyField.
kaby76's post is marked as an answer as it helped towards the solution.
The refactored grammar:
grammar Foo_Message
/* Parser Rules */
startRule : 'FOO' mandatoryField_1 endRule ;
mandatoryField_1 : '/' field_1 NL? mandatoryField_2 ;
mandatoryField_2 : '/' field_2 NL? (filledOptionalField_3 | emptyOptionalField_3 )? ;
filledOptionalField_3 : '/' field_3 NL? (filledOptionalField_4 | emptyOptionalField_4)? ;
emptyOptionalField_3 : '/' (filledOptionalField_4 | emptyOptionalField_4) ;
filledOptionalField_4 : '/' field_4 NL? filledOptionalField_5? ;
emptyOptionalField_4 : '/' filledOptionalField_5 ;
filledOptionalField_5 : '/' field_5 ;
endRule : NL;
field_1 : (A | N | B | S)+ ;
field_2 : (A | N)+ ;
field_3 : (A | N | B)+ ;
field_4 : A+ ;
field_5 : N+ ;
/* Lexer Rules */
A : [A-Z]+ ;
N : [0-9]+ ;
B : ' ' -> skip ;
S : [*&##-_<>?!]+ ;
NL : '\r'? '\n' ;
Here is the beginning of my lexer rules:
F_TEXT_START
: {! matchingFText}? 'f"' {matchingFText = true;}
;
F_TEXT_PH_ESCAPE
: {matchingFText && ! matchingFTextPh}? '{=/'
;
F_TEXT_PH_START
: {matchingFText && ! matchingFTextPh}? '{=' {matchingFTextPh = true;}
;
F_TEXT_PH_END
: {matchingFText && matchingFTextPh}? '}' {matchingFTextPh = false;}
;
F_TEXT_CHAR
: {matchingFText && ! matchingFTextPh}? (~('"' | '{')+ | '""' | '{' ~'=')
;
F_TEXT_END
: {matchingFText && ! matchingFTextPh}? '"' {matchingFText = false;}
;
IF
: {! matchingFText || matchingFTextPh}? 'if'
;
ELIF
: {! matchingFText || matchingFTextPh}? 'elif'
;
// Lots of other keywords
fragment LETTER
: ('A' .. 'Z' | 'a' .. 'z' | '_')
;
VARIABLE
: {! matchingFText || matchingFTextPh}? LETTER (LETTER | DIGIT)*
;
What I am doing is putting my formatted text not just like a normal text token but with a f before, but I add it to my parse tree, to be able to tell if there are errors while parsing (with just parser.start()). So a formatted text starts with f", finishes with a ", any " must be replaced by "", and can contain placeholders starting with {= and finishing with } but if you want to actually write {=, you'll have to replace it by {=/.
The problem is that in a normal formatted text content (not placeholder), the lexer started to mach not only F_TEXT_CHAR but other lexer rules too, like variables. What I did seems pretty dumb, I just put semantic predicates for every other rule to avoid them to be matched in a formatted text's content (but still in a placeholder).
Isn't there a better way ?
I'd use a lexical mode for this. To use lexical modes, you'll have to define separate lexer- and parser grammars. Here's a quick demo:
lexer grammar TestLexer;
F_TEXT_START
: 'f"' -> pushMode(F_TEXT)
;
VARIABLE
: LETTER (LETTER | DIGIT)*
;
F_TEXT_PH_ESCAPE
: '{=/'
;
F_TEXT_PH_END
: '}' -> popMode
;
SPACES
: [ \t\r\n]+ -> skip
;
fragment LETTER
: [a-zA-Z_]
;
fragment DIGIT
: [0-9]
;
mode F_TEXT;
F_TEXT_CHAR
: ~["{]+ | '""' | '{' ~'='
;
F_TEXT_PH_START
: '{=' -> pushMode(DEFAULT_MODE)
;
F_TEXT_END
: '"' -> popMode
;
Use the lexer in your parser like this:
parser grammar TestParser;
options {
tokenVocab=TestLexer;
}
// ...
If you now tokenise the input f"mu {=mu}" mu, you'd get the following tokens:
F_TEXT_START `f"`
F_TEXT_CHAR `mu `
F_TEXT_PH_START `{=`
VARIABLE `mu`
F_TEXT_PH_END `}`
F_TEXT_END `"`
VARIABLE `mu`
I'm using ANTLR 4 to try and parse task definitions. The task definitions look a little like the following:
task = { priority = 10; };
My grammar file then looks like the following:
grammar TaskGrammar;
/* Parser rules */
task : 'task' ASSIGNMENT_OP block EOF;
logical_entity : (TRUE | FALSE) # LogicalConst
| IDENTIFIER # LogicalVariable
;
numeric_entity : DECIMAL # NumericConst
| IDENTIFIER # NumericVariable
;
block : LBRACE (statement)* RBRACE SEMICOLON;
assignment : IDENTIFIER ASSIGNMENT_OP DECIMAL SEMICOLON
| IDENTIFIER ASSIGNMENT_OP block SEMICOLON
| IDENTIFIER ASSIGNMENT_OP QUOTED_STRING SEMICOLON
| IDENTIFIER ASSIGNMENT_OP CONSTANT SEMICOLON;
functionCall : IDENTIFIER LPAREN (parameter)*? RPAREN SEMICOLON;
parameter : DECIMAL
| QUOTED_STRING;
statement : assignment
| functionCall;
/* Lexxer rules */
IF : 'if' ;
THEN : 'then';
AND : 'and' ;
OR : 'or' ;
TRUE : 'true' ;
FALSE : 'false' ;
MULT : '*' ;
DIV : '/' ;
PLUS : '+' ;
MINUS : '-' ;
GT : '>' ;
GE : '>=' ;
LT : '<' ;
LE : '<=' ;
EQ : '==' ;
ASSIGNMENT_OP : '=' ;
LPAREN : '(' ;
RPAREN : ')' ;
LBRACE : '{' ;
RBRACE : '}' ;
SEMICOLON : ';' ;
// DECIMAL, IDENTIFIER, COMMENTS, WS are set using regular expressions
DECIMAL : '-'?[0-9]+('.'[0-9]+)? ;
IDENTIFIER : [a-zA-Z_][a-zA-Z_0-9]* ;
Value: STR_EXT | QUOTED_STRING | SINGLE_QUOTED
;
STR_EXT
:
[a-zA-Z0-9_/\.,\-:=~+!?$&^*\[\]#|]+;
Comment
:
'#' ~[\r\n]*;
CONSTANT : StringCharacters;
QUOTED_STRING
:
'"' StringCharacters? '"'
;
fragment
StringCharacters
: (~["\\] | EscapeSequence)+
;
fragment
EscapeSequence
: '\\' [btnfr"'\\]?
;
SINGLE_QUOTED
:
'\'' ~['\\]* '\'';
// COMMENT and WS are stripped from the output token stream by sending
// to a different channel 'skip'
COMMENT : '//' .+? ('\n'|EOF) -> skip ;
WS : [ \r\t\u000C\n]+ -> skip ;
This grammar compiles fine in ANTLR, but when it comes to trying to use the parser, I get the following error:
line 1:0 mismatched input 'task = { priority = 10; return = AND; };' expecting 'task'
org.antlr.v4.runtime.InputMismatchException
It looks like the parser isn't recognising the block part of the definition, but I can't quite see why. The block parse rule definition should match as far as I can tell. I would expect to have a TaskContext, with a child BlockContext containing a single AssignmentContext. I get the TaskContext, but it has the above exception.
Am I missing something here? This is my first attempt at using Antler, so may be getting confused between Lexxer and Parser rules...
Your STR_EXT consumes the entire input. That rule has to go: ANTLR's lexer will always try to match as much characters as possible.
I also see that CONSTANT might consume that entire input. It has to go to, or at least be changed to consume less chars.
Consider the following simple grammar.
grammar test;
options {
language = Java;
output = AST;
}
//imaginary tokens
tokens{
}
parse
: declaration
;
declaration
: forall
;
forall
:'forall' '('rule1')' '[' (( '(' rule2 ')' '|' )* ) ']'
;
rule1
: INT
;
rule2
: ID
;
ID
: ('a'..'z' | 'A'..'Z'|'_')('a'..'z' | 'A'..'Z'|'0'..'9'|'_')*
;
INT
: ('0'..'9')+
;
WHITESPACE
: ('\t' | ' ' | '\r' | '\n' | '\u000C')+ {$channel = HIDDEN;}
;
and here is the input
forall (1) [(first) | (second) | (third) | (fourth) | (fifth) |]
The grammar works fine for the above input but I want to get rid of the extra pipe symbol (2nd last character in the input) from the input.
Any thoughts/ideas?
My antlr syntax is a bit rusty but you should try something like this:
forall
:'forall' '('rule1')' '[' ('(' rule2 ')' ('|' '(' rule2 ')' )* )? ']'
;
That is, instead of (r|)* write (r(|r)*)?. You can see how the latter allows for zero, one or many rules with pipes inbetween.
Okay, for my third ANTLR question in two days:
My Grammar is meant to parse boolean statements, something like this:
AGE > 21 AND AGE < 35
Since this is a relatively simple grammar, I embedded the code rather than using an AST. The rule looks like this:
: a=singleEvaluation { $evalResult = $a.evalResult;}
(('AND') b=singleEvaluation {$evalResult = $evalResult && $b.evalResult;})+
{
// code
}
;
Now I need to implement order of operations using parenthesis, to parse something like this:
AGE >= 21 AND (DEPARTMENT=1000 OR DEPARTMENT=1001)
or even worse:
AGE >= 21 AND (DEPARTMENT=1000 OR (EMPID=1000 OR EMPID=1001))
Can anyone suggest a way to implement the recursion needed? I'd rather not switch to an AST at this late stage, and I'm still a relative noob at this.
Jason
Since some of your rules evaluate to a boolean, and others to an integer (or only compare integers), you'd best let your rules return a generic Object, and cast accordingly.
Here's a quick demo (including making a recursive call in case of parenthesized expressions):
grammar T;
#parser::members {
private java.util.Map<String, Integer> memory = new java.util.HashMap<String, Integer>();
}
parse
#init{
// initialize some test values
memory.put("AGE", 42);
memory.put("DEPARTMENT", 999);
memory.put("EMPID", 1001);
}
: expression EOF {System.out.println($text + " -> " + $expression.value);}
;
expression returns [Object value]
: logical {$value = $logical.value;}
;
logical returns [Object value]
: e1=equality {$value = $e1.value;} ( 'AND' e2=equality {$value = (Boolean)$value && (Boolean)$e2.value;}
| 'OR' e2=equality {$value = (Boolean)$value || (Boolean)$e2.value;}
)*
;
equality returns [Object value]
: r1=relational {$value = $r1.value;} ( '=' r2=relational {$value = $value.equals($r2.value);}
| '!=' r2=relational {$value = !$value.equals($r2.value);}
)*
;
relational returns [Object value]
: a1=atom {$value = $a1.value;} ( '>=' a2=atom {$value = (Integer)$a1.value >= (Integer)$a2.value;}
| '>' a2=atom {$value = (Integer)$a1.value > (Integer)$a2.value;}
| '<=' a2=atom {$value = (Integer)$a1.value <= (Integer)$a2.value;}
| '<' a2=atom {$value = (Integer)$a1.value < (Integer)$a2.value;}
)?
;
atom returns [Object value]
: INTEGER {$value = Integer.valueOf($INTEGER.text);}
| ID {$value = memory.get($ID.text);}
| '(' expression ')' {$value = $expression.value;}
;
INTEGER : '0'..'9'+;
ID : ('a'..'z' | 'A'..'Z')+;
SPACE : ' ' {$channel=HIDDEN;};
Parsing the input "AGE >= 21 AND (DEPARTMENT=1000 OR (EMPID=1000 OR EMPID=1001))" would result in the following output:
AGE >= 21 AND (DEPARTMENT=1000 OR (EMPID=1000 OR EMPID=1001)) -> true
I would do it like this:
program : a=logicalExpression {System.out.println($a.evalResult);}
;
logicalExpression returns [boolean evalResult] : a=andExpression { $evalResult = $a.evalResult;} (('OR') b=andExpression {$evalResult = $evalResult || $b.evalResult;})*
;
andExpression returns [boolean evalResult] : a=atomicExpression { $evalResult = $a.evalResult;} (('AND') b=atomicExpression {$evalResult = $evalResult && $b.evalResult;})*
;
atomicExpression returns [boolean evalResult] : a=singleEvaluation {$evalResult = $a.evalResult;}
| '(' b=logicalExpression ')' {$evalResult = $b.evalResult;}
;
singleEvaluation returns [boolean evalResult ] : 'TRUE' {$evalResult = true;}
| 'FALSE' {$evalResult = false;}
;