ANTLR: minus expression precedence and different results with Grun - java

I have a grammar like this:
/* entry point */
parse: expr EOF;
expr
: value # argumentArithmeticExpr
| l=expr operator=(MULT|DIV) r=expr # multdivArithmeticExpr
| l=expr operator=(PLUS|MINUS) r=expr # addsubtArithmeticExpr
| operator=('-'|'+') r=expr # minusPlusArithmeticExpr
| IDENTIFIER '(' (expr ( COMMA expr )* ) ? ')'# functionExpr
| LPAREN expr RPAREN # parensArithmeticExpr
;
value
: number
| variable
| string // contains date
| bool
| null_value
;
/* Atomes */
bool
: BOOL
;
variable
: VARIABLE
;
string
: STRING_LITERAL
;
number
: ('+'|'-')? NUMERIC_LITERAL
;
null_value
: NULL // TODO: test this
;
IDENTIFIER
: ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'_'|'0'..'9')*
;
NUMERIC_LITERAL
: DIGIT+ ( '.' DIGIT* )? ( E [-+]? DIGIT+ )? // ex: 0.05e3
| '.' DIGIT+ ( E [-+]? DIGIT+ )? // ex: .05e3
;
INT: DIGIT+;
STRING_LITERAL
: '\'' ( ~'\'' | '\'\'' )* '\''
| '"' ( ~'"' | '""' )* '"'
;
VARIABLE
: LBRACKET ( ~']' | ' ')* RBRACKET
;
Now, I want to parse this:
-1.3 * 5 + -2 * 7
With Grun, I get this:
antlr4 formula.g4 && javac *.java && time grun formula parse -gui
-1.3*5 + -2*7
^D
Which looks OK and I would be happy with that.
But in my Java code, I get called like this using the Visitor pattern:
visitMinusPlusArithmeticExpr -1.3*5+-2*7 // ugh ?? sees "- (1.3 * 5 + - 2 * 7 )" instead of "(-1.3*5) + (-2*7)"
visitAddsubtArithmeticExpr 1.3*5+-2*7
visitMultdivArithmeticExpr 1.3*5
visitArgumentArithmeticExpr 1.3
visitNumber 1.3
visitArgumentArithmeticExpr 5
visitValue 5
visitNumber 5
visitMinusPlusArithmeticExpr -2*7 // UHG? should see a MultDiv with -2 and 7
visitMultdivArithmeticExpr 2*7
visitArgumentArithmeticExpr 2
visitValue 2
visitNumber 2
visitArgumentArithmeticExpr 7
visitValue 7
visitNumber 7
Which means that I don't get my negative number (-1.3), but rather the 'minus expression', which I should not get.
Why is my Java result different from Grun ? I have verified that the grammar is recompiled and I use my parser like this:
formulaLexer lexer = new formulaLexer(new ANTLRInputStream(s));
formulaParser parser = new formulaParser(new CommonTokenStream(lexer));
parser.getInterpreter().setPredictionMode(PredictionMode.SLL);
parser.setErrorHandler(new BailErrorStrategy()); // will throw exceptions on failure
formula = tryParse(parser);
if( formula == null && errors.isEmpty() ){
// the parsing failed, retry in LL mode
parser.getInterpreter().setPredictionMode(PredictionMode.LL);
parser.reset();
tryParse(parser);
}
I have disabled the SLL mode to verify if this was not the problem, and the result was the same.
I thought this could be a problem of precedence, but in my expr I have specified to match a value first, and then only a minusPlusArithmeticExpr.
I can't understand how I will detect this 'minus' expression instead of my 'negative value'. Can you check this?
Also, why does Grun show the correct behavior and not my Java code?
EDIT
Following the comments advice, I modified the grammar to look like this:
expr
: value # argumentArithmeticExpr
| (PLUS|MINUS) expr # plusMinusExpr
| l=expr operator=(MULT|DIV) r=expr # multdivArithmeticExpr
| l=expr operator=(PLUS|MINUS) r=expr # addsubtArithmeticExpr
| function=IDENTIFIER '(' (expr ( COMMA expr )* ) ? ')'# functionExpr
| '(' expr ')' # parensArithmeticExpr
;
But now, I want to optimize the case where I have a single "-1.3" somewhere.
I don't know how to do it correctly, since when I land in the visitMinusPlusAritmeticExpr, I have to check if the terminal node is a number.
Here is what I get while debugging:
ctx = {formulaParser$PlusMinusExprContext#929} "[16]"
children = {ArrayList#955} size = 2
0 = {TerminalNodeImpl#962} "-"
1 = {formulaParser$ArgumentArithmeticExprContext#963} "[21 16]"
children = {ArrayList#967} size = 1
0 = {formulaParser$ValueContext#990} "[22 21 16]"
children = {ArrayList#992} size = 1
0 = {formulaParser$NumberContext#997} "[53 22 21 16]"
children = {ArrayList#999} size = 1
0 = {TerminalNodeImpl#1004} "1.3"
I suspect I should walk down the tree and tell if the terminal node is a number, but it seems cumbersome. Do you have any idea on how to do that without compromising legibility of my code?

Ok, for those interested, Lucas and Bart got the answer, and my implementation is like this:
expr
: value # argumentArithmeticExpr
| (PLUS|MINUS) expr # plusMinusExpr
| l=expr operator=(MULT|DIV) r=expr # multdivArithmeticExpr
| l=expr operator=(PLUS|MINUS) r=expr # addsubtArithmeticExpr
| function=IDENTIFIER '(' (expr ( COMMA expr )* ) ? ')'# functionExpr
| '(' expr ')' # parensArithmeticExpr
;
And in the visitor of plusMinusExpr:
#Override
public Formula visitPlusMinusExpr(formulaParser.PlusMinusExprContext ctx) {
if( debug ) LOG.log(Level.INFO, "visitPlusMinusExpr " + ctx.getText());
Formula formulaExpr = visit(ctx.expr());
if( ctx.MINUS() == null ) return formulaExpr;
else {
if(formulaExpr instanceof DoubleFormula){
// optimization for numeric values: we don't return "(0.0 MINUS THEVALUE)" but directly "-THEVALUE"
Double v = - ((DoubleFormula) formulaExpr).getValue();
return new DoubleFormula( v );
} else {
return ArithmeticOperator.MINUS( 0, formulaExpr);
}
}
}

Related

Antlr - Why it expect FunctionCall but PrintCommand gave

my Antlr-grammar expect a FunctionCall but in my example-code for the compiler built by antlr, i wrote a print-command. Does someone know why and how to fix that? The print-command is named: RetroBox.show(); The print-command should be recognised from blockstatements to blockstatement to statement to localFunctionCall to printCommand
Here my Antrl-grammar:
grammar Mars;
// ******************************LEXER
BEGIN*****************************************
// Keywords
FUNC: 'func';
ENTRY: 'entry';
VARI: 'vari';
VARF: 'varf';
VARC: 'varc';
VARS: 'vars';
LET: 'let';
INCREMENTS: 'increments';
RETROBOX: 'retrobox';
SHOW: 'show';
// Literals
DECIMAL_LITERAL: ('0' | [1-9] (Digits? | '_'+ Digits)) [lL]?;
FLOAT_LITERAL: (Digits '.' Digits? | '.' Digits) ExponentPart? [fFdD]?
| Digits (ExponentPart [fFdD]? | [fFdD])
;
CHAR_LITERAL: '\'' (~['\\\r\n] | EscapeSequence) '\'';
STRING_LITERAL: '"' (~["\\\r\n] | EscapeSequence)* '"';
// Seperators
ORBRACKET: '(';
CRBRACKET: ')';
OEBRACKET: '{';
CEBRACKET: '}';
SEMI: ';';
POINT: '.';
// Operators
ASSIGN: '=';
// Whitespace and comments
WS: [ \t\r\n\u000C]+ -> channel(HIDDEN);
COMMENT: '/*' .*? '*/' -> channel(HIDDEN);
LINE_COMMENT: '//' ~[\r\n]* -> channel(HIDDEN);
// Identifiers
IDENTIFIER: Letter LetterOrDigit*;
// Fragment rules
fragment ExponentPart
: [eE] [+-]? Digits
;
fragment EscapeSequence
: '\\' [btnfr"'\\]
| '\\' ([0-3]? [0-7])? [0-7]
| '\\' 'u'+ HexDigit HexDigit HexDigit HexDigit
;
fragment HexDigits
: HexDigit ((HexDigit | '_')* HexDigit)?
;
fragment HexDigit
: [0-9a-fA-F]
;
fragment Digits
: [0-9] ([0-9_]* [0-9])?
;
fragment LetterOrDigit
: Letter
| [0-9]
;
fragment Letter
: [a-zA-Z$_] // these are the "java letters" below 0x7F
| ~[\u0000-\u007F\uD800-\uDBFF] // covers all characters above 0x7F which are not a surrogate
| [\uD800-\uDBFF] [\uDC00-\uDFFF] // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
;
// *******************************LEXER END****************************************
// *****************************PARSER BEGIN*****************************************
program
: mainfunction #Programm
| /*EMPTY*/ #Garnichts
;
mainfunction
: FUNC VARI ENTRY ORBRACKET CRBRACKET block #NormaleHauptmethode
;
block
: '{' blockStatement '}' #CodeBlock
| /*EMPTY*/ #EmptyCodeBlock
;
blockStatement
: statement* #Befehl
;
statement
: localVariableDeclaration
| localVariableInitialization
| localFunctionImplementation
| localFunctionCall
;
expression
: left=expression op='%'
| left=expression op=('*' | '/') right=expression
| left=expression op=('+' | '-') right=expression
| neg='-' right=expression
| number
| IDENTIFIER
| '(' expression ')'
;
number
: DECIMAL_LITERAL
| FLOAT_LITERAL
;
localFunctionImplementation
: FUNC primitiveType IDENTIFIER ORBRACKET CRBRACKET block #Methodenimplementierung
;
localFunctionCall
: IDENTIFIER ORBRACKET CRBRACKET SEMI #Methodenaufruf
| printCommand #RetroBoxShowCommand
;
printCommand
: RETROBOX POINT SHOW ORBRACKET params=primitiveLiteral CRBRACKET SEMI #PrintCommandWP
;
localVariableDeclaration
: varTypeDek=primitiveType IDENTIFIER SEMI #Variablendeklaration
;
localVariableInitialization
: varTypeIni=primitiveType IDENTIFIER ASSIGN varValue=primitiveLiteral SEMI #VariableninitKonst
| varTypeIni=primitiveType IDENTIFIER ASSIGN varValue=expression SEMI #VariableninitExpr
;
primitiveLiteral
: DECIMAL_LITERAL
| FLOAT_LITERAL
| STRING_LITERAL
| CHAR_LITERAL
;
primitiveType
: VARI
| VARC
| VARF
| VARS
;
// ******************************PARSER END****************************************
Here my example-code:
func vari entry()
{
RetroBox.show("Hallo"); //Should be recognised as print-command
}
And here a AST printed from Antlr:
AST from Compiler
The problem is that your RETROBOX keyword is 'retrobox' but your example code has it typed as 'RetroBox'. Antlr parses 'RetroBox' as an identifier so the following '.' is unexpected.
Antlr should emit an error: "line 3:12 mismatched input '.' expecting '('".
Then it attempts to recover and continue parsing. It tries single token deletion (just ignoring the '.') and finds that that works... except the rule it now matches is #Methodenaufruf instead of #RetroBoxShowCommand.

ANTLR - Join tokens to output

Using ANTLR3, I want to parse Strings such:
name IS NOT empty AND age NOT IN (14, 15)
And for these cases, I want to get the following ASTs:
n0 [label="QUERY"];
n1 [label="AND"];
n1 [label="AND"];
n2 [label="IS NOT"];
n2 [label="IS NOT"];
n3 [label="name"];
n4 [label="empty"];
n5 [label="NOT IN"];
n5 [label="NOT IN"];
n6 [label="age"];
n7 [label="14"];
n8 [label="15"];
n0 -> n1 // "QUERY" -> "AND"
n1 -> n2 // "AND" -> "IS NOT"
n2 -> n3 // "IS NOT" -> "name"
n2 -> n4 // "IS NOT" -> "empty"
n1 -> n5 // "AND" -> "NOT IN"
n5 -> n6 // "NOT IN" -> "age"
n5 -> n7 // "NOT IN" -> "14"
n5 -> n8 // "NOT IN" -> "15"
But my n2 and n5 nodes are appearing like :
n2 [label="IS"];
n5 [label="NOT"];
Ie, just the first word is appearing. How can I join both tokens in just one?
My Grammar is:
query
: expr EOF -> ^(QUERY expr)
;
expr
: logical_expr
;
logical_expr
: equality_expr (logical_op^ equality_expr)*
;
equality_expr
: ID equality_op+ atom -> ^(equality_op ID atom)
| '(' expr ')' -> ^('(' expr)
;
atom
: ID
| id_list
| Int
| Number
| String
| '*'
;
id_list
: '(' ID (',' ID)+ ')' -> ID+
| '(' Number (',' Number)* ')' -> Number+
| '(' String (',' String)* ')' -> String+
;
equality_op
: 'IN'
| 'IS'
| 'NOT'
| 'in'
| 'is'
| 'not'
;
logical_op
: 'AND'
| 'OR'
| 'and'
| 'or'
;
Number
: Int ('.' Digit*)?
;
ID
: ('a'..'z' | 'A'..'Z' | '_' | '.' | '-' | '*' | '/' | ':' | Digit)*
;
String
#after {
setText(getText().substring(1, getText().length()-1).replaceAll("\\\\(.)", "$1"));
}
: '"' (~('"' | '\\') | '\\' ('\\' | '"'))* '"'
| '\'' (~('\'' | '\\') | '\\' ('\\' | '\''))* '\''
;
Comment
: '//' ~('\r' | '\n')* {skip();}
| '/*' .* '*/' {skip();}
;
Space
: (' ' | '\t' | '\r' | '\n' | '\u000C') {skip();}
;
fragment Int
: '1'..'9' Digit*
| '0'
;
fragment Digit
: '0'..'9'
;
indexes
: ('[' expr ']')+ -> ^(INDEXES expr+)
;
Do something like this instead (check the inline comments I added):
tokens {
IS_NOT; // added
NOT_IN; // added
QUERY;
INDEXES;
}
query
: expr EOF -> ^(QUERY expr)
;
expr
: logical_expr
;
logical_expr
: equality_expr (logical_op^ equality_expr)*
;
equality_expr
: ID equality_op atom -> ^(equality_op ID atom) // changed equality_op+ to equality_op
| '(' expr ')' -> ^('(' expr)
;
atom
: ID
| id_list
| Int
| Number
| String
| '*'
;
id_list
: '(' ID (',' ID)+ ')' -> ID+
| '(' Number (',' Number)* ')' -> Number+
| '(' String (',' String)* ')' -> String+
;
equality_op
: IS NOT -> IS_NOT // added
| NOT IN -> NOT_IN // added
| IN
| IS
| NOT
;
logical_op
: AND
| OR
;
IS : 'IS' | 'is'; // added
NOT : 'NOT' | 'not'; // added
IN : 'IN' | 'in'; // added
AND : 'AND' | 'and'; // added
OR : 'OR' | 'or'; // added
Number
: Int ('.' Digit*)?
;
ID
: ('a'..'z' | 'A'..'Z' | '_' | '.' | '-' | '*' | '/' | ':' | Digit)+
;
String
#after {
setText(getText().substring(1, getText().length()-1).replaceAll("\\\\(.)", "$1"));
}
: '"' (~('"' | '\\') | '\\' ('\\' | '"'))* '"'
| '\'' (~('\'' | '\\') | '\\' ('\\' | '\''))* '\''
;
Comment
: '//' ~('\r' | '\n')* {skip();}
| '/*' .* '*/' {skip();}
;
Space
: (' ' | '\t' | '\r' | '\n' | '\u000C') {skip();}
;
fragment Int
: '1'..'9' Digit*
| '0'
;
fragment Digit
: '0'..'9'
;
indexes
: ('[' expr ']')+ -> ^(INDEXES expr+)
;
which produces the following AST:
Also, lexer rules should always match at least 1 character (I've mentioned this before to you). Your lexer rule ID possibly matched 0 chars.
The problem is that equalityop+ will only have the value of the first match.
I see different workarounds: create specific rules if it is just for not or not in, create a subrules, or concatenate a variable like i do here:
equality_expr
: ID (full_op+=equality_op) + atom -> ^(full_op ID atom)
| '(' expr ')' -> ^('(' expr)
;
The following problem is different but gives my the idea:
Antlr AST generating (possible) madness

Understanding the context data structure in Antlr4

I'm trying to write a code translator in Java with the help of Antlr4 and had great success with the grammar part so far. However I'm now banging my head against a wall wrapping my mind around the parse tree data structure that I need to work on after my input has been parsed.
I'm trying to use the visitor template to go over my parse tree. I'll show you an example to illustrate the points of my confusion.
My grammar:
grammar pqlc;
// Lexer
//Schlüsselwörter
EXISTS: 'exists';
REDUCE: 'reduce';
QUERY: 'query';
INT: 'int';
DOUBLE: 'double';
CONST: 'const';
STDVECTOR: 'std::vector';
STDMAP: 'std::map';
STDSET: 'std::set';
C_EXPR: 'c_expr';
INTEGER_LITERAL : (DIGIT)+ ;
fragment DIGIT: '0'..'9';
DOUBLE_LITERAL : DIGIT '.' DIGIT+;
LPAREN : '(';
RPAREN : ')';
LBRACK : '[';
RBRACK : ']';
DOT : '.';
EQUAL : '==';
LE : '<=';
GE : '>=';
GT : '>';
LT : '<';
ADD : '+';
MUL : '*';
AND : '&&';
COLON : ':';
IDENTIFIER : JavaLetter JavaLetterOrDigit*;
fragment JavaLetter : [a-zA-Z$_]; // these are the "java letters" below 0xFF
fragment JavaLetterOrDigit : [a-zA-Z0-9$_]; // these are the "java letters or digits" below 0xFF
WS
: [ \t\r\n\u000C]+ -> skip
;
COMMENT
: '/*' .*? '*/' -> skip
;
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
// Parser
//start_rule: query;
query :
quant_expr
| qexpr+
| IDENTIFIER // order IDENTIFIER and qexpr+?
| numeral
| c_expr //TODO
;
c_type : INT | DOUBLE | CONST;
bin_op: AND | ADD | MUL | EQUAL | LT | GT | LE| GE;
qexpr:
LPAREN query RPAREN bin_op_query?
// query bin_op query
| IDENTIFIER bin_op_query? // copied from query to resolve left recursion problem
| numeral bin_op_query? // ^
| quant_expr bin_op_query? // ^
|c_expr bin_op_query?
// query.find(query)
| IDENTIFIER find_query? // copied from query to resolve left recursion problem
| numeral find_query? // ^
| quant_expr find_query?
|c_expr find_query?
// query[query]
| IDENTIFIER array_query? // copied from query to resolve left recursion problem
| numeral array_query? // ^
| quant_expr array_query?
|c_expr array_query?
// | qexpr bin_op_query // bad, resolved by quexpr+ in query
;
bin_op_query: bin_op query bin_op_query?; // resolve left recursion of query bin_op query
find_query: '.''find' LPAREN query RPAREN;
array_query: LBRACK query RBRACK;
quant_expr:
quant id ':' query
| QUERY LPAREN match RPAREN ':' query
| REDUCE LPAREN IDENTIFIER RPAREN id ':' query
;
match:
STDVECTOR LBRACK id RBRACK EQUAL cm
| STDMAP '.''find' LPAREN cm RPAREN EQUAL cm
| STDSET '.''find' LPAREN cm RPAREN
;
cm:
IDENTIFIER
| numeral
| c_expr //TODO
;
quant :
EXISTS;
id :
c_type IDENTIFIER
| IDENTIFIER // Nach Seite 2 aber nicht der Übersicht. Laut übersicht id -> aber dann wäre Regel 1 ohne +
;
numeral :
INTEGER_LITERAL
| DOUBLE_LITERAL
;
c_expr:
C_EXPR
;
Now let's parse the following string:
double x: x >= c_expr
Visually I'll get this tree:
Let's say my visitor is in the visitQexpr(#NotNull pqlcParser.QexprContext ctx) routine when it hits the branch Qexpr(x bin_op_query).
My question is, how can I tell that the left children ("x" in the tree) is a terminal node, or more specifically an "IDENTIFIER"? There are no visiting rules for Terminal nodes since they aren't rules.
ctx.getChild(0) has no RuleIndex. I guess I could use that to check if I'm in a terminal or not, but that still wouldn't tell me if I was in IDENTIFIER or another kind of terminal token. I need to be able to tell the difference somehow.
I had more questions but in the time it took me to write the explanation I forgot them :<
Thanks in advance.
You can add labels to tokens and access them/check if they exist in the surrounding context:
id :
c_type labelA = IDENTIFIER
| labelB = IDENTIFIER
;
You could also do this to create different visits:
id :
c_type IDENTIFIER #idType1 //choose more appropriate names!
| IDENTIFIER #idType2
;
This will create different visitors for the two alternatives and I suppose (i.e. have not verified) that the visitor for id will not be called.
I prefer the following approach though:
id :
typeDef
| otherId
;
typeDef: c_type IDENTIFIER;
otherId : IDENTIFIER ;
This is a more heavily typed system. But you can very specifically visit nodes. Some rules of thumb I use:
Use | only when all alternatives are parser rules.
Wrap each Token in a parser rule (like otherId) to give them "more meaning".
It's ok to mix parser rules and tokens, if the tokens are not really important (like ;) and therefore not needed in the parse tree.

ANTLR replace tokens in a recursive manner

I have the following grammar:
rule: q=QualifiedName {System.out.println($q.text);};
QualifiedName
:
i=Identifier { $i.setText($i.text + "_");}
('[' (QualifiedName+ | Integer)? ']')*
;
Integer
: Digit Digit*
;
fragment
Digit
: '0'..'9'
;
fragment
Identifier
: ( '_'
| '$'
| ('a'..'z' | 'A'..'Z')
)
('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '$')*
;
and the code from Java:
ANTLRStringStream stream = new ANTLRStringStream("array1[array2[array3[index]]]");
TestLexer lexer = new TestLexer(stream);
CommonTokenStream tokens = new TokenRewriteStream(lexer);
TestParser parser = new TestParser(tokens);
try {
parser.rule();
} catch (RecognitionException e) {
e.printStackTrace();
}
For the input: array1[array2[array3[index]]], i want to modify each identifier. I was expecting to see the output: array1_[array_2[array3_[index_]]], but the output was the same as the input.
So the question is: why the setText() method doesn't work here?
EDIT:
I modified Bart's answer in the following way:
rule: q=qualifiedName {System.out.println($q.modified);};
qualifiedName returns [String modified]
:
Identifier
('[' (qualifiedName+ | Integer)? ']')*
{
$modified = $text + "_";
}
;
Identifier
: ( '_'
| '$'
| ('a'..'z' | 'A'..'Z')
)
('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '$')*
;
Integer
: Digit Digit*
;
fragment
Digit
: '0'..'9'
;
I want to modify each token matched by the rule qualifiedName. I tried the code above, and for the input array1[array2[array3[index]]] i was expecting to see the output array1[array2[array3[index_]_]_]_, but instead only the last token was modified: array1[array2[array3[index]]]_.
How can i solve this?
You can only use setText(...) once a token is created. You're recursively calling this token and setting some other text, which won't work. You'll need to create a parser rule out of QualifiedName instead of a lexer rule, and remove the fragment before Identifier.
rule: q=qualifiedName {System.out.println($q.text);};
qualifiedName
:
i=Identifier { $i.setText($i.text + "_");}
('[' (qualifiedName+ | Integer)? ']')*
;
Identifier
: ( '_'
| '$'
| ('a'..'z' | 'A'..'Z')
)
('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '$')*
;
Integer
: Digit Digit*
;
fragment
Digit
: '0'..'9'
;
Now, it will print: array1_[array2_[array3_[index_]]] on the console.
EDIT
I have no idea why you'd want to do that, but it seems you're simply trying to rewrite ] into ]_, which can be done in the same way as I showed above:
qualifiedName
:
Identifier
('[' (qualifiedName+ | Integer)? t=']' {$t.setText("]_");} )*
;

lexer that takes "not" but not "not like"

I need a small trick to get my parser completely working.
I use antlr to parse boolean queries.
a query is composed of elements, linked together by ands, ors and nots.
So I can have something like :
"(P or not Q or R) or (( not A and B) or C)"
Thing is, an element can be long, and is generally in the form :
a an_operator b
for example :
"New-York matches NY"
Trick, one of the an_operator is "not like"
So I would like to modify my lexer so that the not checks that there is no like after it, to avoid parsing elements containing "not like" operators.
My current grammar is here :
// save it in a file called Logic.g
grammar Logic;
options {
output=AST;
}
// parser/production rules start with a lower case letter
parse
: expression EOF! // omit the EOF token
;
expression
: orexp
;
orexp
: andexp ('or'^ andexp)* // make `or` the root
;
andexp
: notexp ('and'^ notexp)* // make `and` the root
;
notexp
: 'not'^ atom // make `not` the root
| atom
;
atom
: ID
| '('! expression ')'! // omit both `(` andexp `)`
;
// lexer/terminal rules start with an upper case letter
ID : ('a'..'z' | 'A'..'Z')+;
Space : (' ' | '\t' | '\r' | '\n')+ {$channel=HIDDEN;};
Any help would be appreciated.
Thanks !
Here's a possible solution:
grammar Logic;
options {
output=AST;
}
tokens {
NOT_LIKE;
}
parse
: expression EOF!
;
expression
: orexp
;
orexp
: andexp (Or^ andexp)*
;
andexp
: fuzzyexp (And^ fuzzyexp)*
;
fuzzyexp
: (notexp -> notexp) ( Matches e=notexp -> ^(Matches $fuzzyexp $e)
| Not Like e=notexp -> ^(NOT_LIKE $fuzzyexp $e)
| Like e=notexp -> ^(Like $fuzzyexp $e)
)?
;
notexp
: Not^ atom
| atom
;
atom
: ID
| '('! expression ')'!
;
And : 'and';
Or : 'or';
Not : 'not';
Like : 'like';
Matches : 'matches';
ID : ('a'..'z' | 'A'..'Z')+;
Space : (' ' | '\t' | '\r' | '\n')+ {$channel=HIDDEN;};
which will parse the input "A not like B or C like D and (E or not F) and G matches H" into the following AST:

Categories

Resources