lexer that takes "not" but not "not like" - java

I need a small trick to get my parser completely working.
I use antlr to parse boolean queries.
a query is composed of elements, linked together by ands, ors and nots.
So I can have something like :
"(P or not Q or R) or (( not A and B) or C)"
Thing is, an element can be long, and is generally in the form :
a an_operator b
for example :
"New-York matches NY"
Trick, one of the an_operator is "not like"
So I would like to modify my lexer so that the not checks that there is no like after it, to avoid parsing elements containing "not like" operators.
My current grammar is here :
// save it in a file called Logic.g
grammar Logic;
options {
output=AST;
}
// parser/production rules start with a lower case letter
parse
: expression EOF! // omit the EOF token
;
expression
: orexp
;
orexp
: andexp ('or'^ andexp)* // make `or` the root
;
andexp
: notexp ('and'^ notexp)* // make `and` the root
;
notexp
: 'not'^ atom // make `not` the root
| atom
;
atom
: ID
| '('! expression ')'! // omit both `(` andexp `)`
;
// lexer/terminal rules start with an upper case letter
ID : ('a'..'z' | 'A'..'Z')+;
Space : (' ' | '\t' | '\r' | '\n')+ {$channel=HIDDEN;};
Any help would be appreciated.
Thanks !

Here's a possible solution:
grammar Logic;
options {
output=AST;
}
tokens {
NOT_LIKE;
}
parse
: expression EOF!
;
expression
: orexp
;
orexp
: andexp (Or^ andexp)*
;
andexp
: fuzzyexp (And^ fuzzyexp)*
;
fuzzyexp
: (notexp -> notexp) ( Matches e=notexp -> ^(Matches $fuzzyexp $e)
| Not Like e=notexp -> ^(NOT_LIKE $fuzzyexp $e)
| Like e=notexp -> ^(Like $fuzzyexp $e)
)?
;
notexp
: Not^ atom
| atom
;
atom
: ID
| '('! expression ')'!
;
And : 'and';
Or : 'or';
Not : 'not';
Like : 'like';
Matches : 'matches';
ID : ('a'..'z' | 'A'..'Z')+;
Space : (' ' | '\t' | '\r' | '\n')+ {$channel=HIDDEN;};
which will parse the input "A not like B or C like D and (E or not F) and G matches H" into the following AST:

Related

ANTLR4 Grammar only matching first part of parser rule

I'm using ANTLR 4 to try and parse task definitions. The task definitions look a little like the following:
task = { priority = 10; };
My grammar file then looks like the following:
grammar TaskGrammar;
/* Parser rules */
task : 'task' ASSIGNMENT_OP block EOF;
logical_entity : (TRUE | FALSE) # LogicalConst
| IDENTIFIER # LogicalVariable
;
numeric_entity : DECIMAL # NumericConst
| IDENTIFIER # NumericVariable
;
block : LBRACE (statement)* RBRACE SEMICOLON;
assignment : IDENTIFIER ASSIGNMENT_OP DECIMAL SEMICOLON
| IDENTIFIER ASSIGNMENT_OP block SEMICOLON
| IDENTIFIER ASSIGNMENT_OP QUOTED_STRING SEMICOLON
| IDENTIFIER ASSIGNMENT_OP CONSTANT SEMICOLON;
functionCall : IDENTIFIER LPAREN (parameter)*? RPAREN SEMICOLON;
parameter : DECIMAL
| QUOTED_STRING;
statement : assignment
| functionCall;
/* Lexxer rules */
IF : 'if' ;
THEN : 'then';
AND : 'and' ;
OR : 'or' ;
TRUE : 'true' ;
FALSE : 'false' ;
MULT : '*' ;
DIV : '/' ;
PLUS : '+' ;
MINUS : '-' ;
GT : '>' ;
GE : '>=' ;
LT : '<' ;
LE : '<=' ;
EQ : '==' ;
ASSIGNMENT_OP : '=' ;
LPAREN : '(' ;
RPAREN : ')' ;
LBRACE : '{' ;
RBRACE : '}' ;
SEMICOLON : ';' ;
// DECIMAL, IDENTIFIER, COMMENTS, WS are set using regular expressions
DECIMAL : '-'?[0-9]+('.'[0-9]+)? ;
IDENTIFIER : [a-zA-Z_][a-zA-Z_0-9]* ;
Value: STR_EXT | QUOTED_STRING | SINGLE_QUOTED
;
STR_EXT
:
[a-zA-Z0-9_/\.,\-:=~+!?$&^*\[\]#|]+;
Comment
:
'#' ~[\r\n]*;
CONSTANT : StringCharacters;
QUOTED_STRING
:
'"' StringCharacters? '"'
;
fragment
StringCharacters
: (~["\\] | EscapeSequence)+
;
fragment
EscapeSequence
: '\\' [btnfr"'\\]?
;
SINGLE_QUOTED
:
'\'' ~['\\]* '\'';
// COMMENT and WS are stripped from the output token stream by sending
// to a different channel 'skip'
COMMENT : '//' .+? ('\n'|EOF) -> skip ;
WS : [ \r\t\u000C\n]+ -> skip ;
This grammar compiles fine in ANTLR, but when it comes to trying to use the parser, I get the following error:
line 1:0 mismatched input 'task = { priority = 10; return = AND; };' expecting 'task'
org.antlr.v4.runtime.InputMismatchException
It looks like the parser isn't recognising the block part of the definition, but I can't quite see why. The block parse rule definition should match as far as I can tell. I would expect to have a TaskContext, with a child BlockContext containing a single AssignmentContext. I get the TaskContext, but it has the above exception.
Am I missing something here? This is my first attempt at using Antler, so may be getting confused between Lexxer and Parser rules...
Your STR_EXT consumes the entire input. That rule has to go: ANTLR's lexer will always try to match as much characters as possible.
I also see that CONSTANT might consume that entire input. It has to go to, or at least be changed to consume less chars.

Understanding the context data structure in Antlr4

I'm trying to write a code translator in Java with the help of Antlr4 and had great success with the grammar part so far. However I'm now banging my head against a wall wrapping my mind around the parse tree data structure that I need to work on after my input has been parsed.
I'm trying to use the visitor template to go over my parse tree. I'll show you an example to illustrate the points of my confusion.
My grammar:
grammar pqlc;
// Lexer
//Schlüsselwörter
EXISTS: 'exists';
REDUCE: 'reduce';
QUERY: 'query';
INT: 'int';
DOUBLE: 'double';
CONST: 'const';
STDVECTOR: 'std::vector';
STDMAP: 'std::map';
STDSET: 'std::set';
C_EXPR: 'c_expr';
INTEGER_LITERAL : (DIGIT)+ ;
fragment DIGIT: '0'..'9';
DOUBLE_LITERAL : DIGIT '.' DIGIT+;
LPAREN : '(';
RPAREN : ')';
LBRACK : '[';
RBRACK : ']';
DOT : '.';
EQUAL : '==';
LE : '<=';
GE : '>=';
GT : '>';
LT : '<';
ADD : '+';
MUL : '*';
AND : '&&';
COLON : ':';
IDENTIFIER : JavaLetter JavaLetterOrDigit*;
fragment JavaLetter : [a-zA-Z$_]; // these are the "java letters" below 0xFF
fragment JavaLetterOrDigit : [a-zA-Z0-9$_]; // these are the "java letters or digits" below 0xFF
WS
: [ \t\r\n\u000C]+ -> skip
;
COMMENT
: '/*' .*? '*/' -> skip
;
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
// Parser
//start_rule: query;
query :
quant_expr
| qexpr+
| IDENTIFIER // order IDENTIFIER and qexpr+?
| numeral
| c_expr //TODO
;
c_type : INT | DOUBLE | CONST;
bin_op: AND | ADD | MUL | EQUAL | LT | GT | LE| GE;
qexpr:
LPAREN query RPAREN bin_op_query?
// query bin_op query
| IDENTIFIER bin_op_query? // copied from query to resolve left recursion problem
| numeral bin_op_query? // ^
| quant_expr bin_op_query? // ^
|c_expr bin_op_query?
// query.find(query)
| IDENTIFIER find_query? // copied from query to resolve left recursion problem
| numeral find_query? // ^
| quant_expr find_query?
|c_expr find_query?
// query[query]
| IDENTIFIER array_query? // copied from query to resolve left recursion problem
| numeral array_query? // ^
| quant_expr array_query?
|c_expr array_query?
// | qexpr bin_op_query // bad, resolved by quexpr+ in query
;
bin_op_query: bin_op query bin_op_query?; // resolve left recursion of query bin_op query
find_query: '.''find' LPAREN query RPAREN;
array_query: LBRACK query RBRACK;
quant_expr:
quant id ':' query
| QUERY LPAREN match RPAREN ':' query
| REDUCE LPAREN IDENTIFIER RPAREN id ':' query
;
match:
STDVECTOR LBRACK id RBRACK EQUAL cm
| STDMAP '.''find' LPAREN cm RPAREN EQUAL cm
| STDSET '.''find' LPAREN cm RPAREN
;
cm:
IDENTIFIER
| numeral
| c_expr //TODO
;
quant :
EXISTS;
id :
c_type IDENTIFIER
| IDENTIFIER // Nach Seite 2 aber nicht der Übersicht. Laut übersicht id -> aber dann wäre Regel 1 ohne +
;
numeral :
INTEGER_LITERAL
| DOUBLE_LITERAL
;
c_expr:
C_EXPR
;
Now let's parse the following string:
double x: x >= c_expr
Visually I'll get this tree:
Let's say my visitor is in the visitQexpr(#NotNull pqlcParser.QexprContext ctx) routine when it hits the branch Qexpr(x bin_op_query).
My question is, how can I tell that the left children ("x" in the tree) is a terminal node, or more specifically an "IDENTIFIER"? There are no visiting rules for Terminal nodes since they aren't rules.
ctx.getChild(0) has no RuleIndex. I guess I could use that to check if I'm in a terminal or not, but that still wouldn't tell me if I was in IDENTIFIER or another kind of terminal token. I need to be able to tell the difference somehow.
I had more questions but in the time it took me to write the explanation I forgot them :<
Thanks in advance.
You can add labels to tokens and access them/check if they exist in the surrounding context:
id :
c_type labelA = IDENTIFIER
| labelB = IDENTIFIER
;
You could also do this to create different visits:
id :
c_type IDENTIFIER #idType1 //choose more appropriate names!
| IDENTIFIER #idType2
;
This will create different visitors for the two alternatives and I suppose (i.e. have not verified) that the visitor for id will not be called.
I prefer the following approach though:
id :
typeDef
| otherId
;
typeDef: c_type IDENTIFIER;
otherId : IDENTIFIER ;
This is a more heavily typed system. But you can very specifically visit nodes. Some rules of thumb I use:
Use | only when all alternatives are parser rules.
Wrap each Token in a parser rule (like otherId) to give them "more meaning".
It's ok to mix parser rules and tokens, if the tokens are not really important (like ;) and therefore not needed in the parse tree.

Remove extra symbol from the repetitive ANTLR rule

Consider the following simple grammar.
grammar test;
options {
language = Java;
output = AST;
}
//imaginary tokens
tokens{
}
parse
: declaration
;
declaration
: forall
;
forall
:'forall' '('rule1')' '[' (( '(' rule2 ')' '|' )* ) ']'
;
rule1
: INT
;
rule2
: ID
;
ID
: ('a'..'z' | 'A'..'Z'|'_')('a'..'z' | 'A'..'Z'|'0'..'9'|'_')*
;
INT
: ('0'..'9')+
;
WHITESPACE
: ('\t' | ' ' | '\r' | '\n' | '\u000C')+ {$channel = HIDDEN;}
;
and here is the input
forall (1) [(first) | (second) | (third) | (fourth) | (fifth) |]
The grammar works fine for the above input but I want to get rid of the extra pipe symbol (2nd last character in the input) from the input.
Any thoughts/ideas?
My antlr syntax is a bit rusty but you should try something like this:
forall
:'forall' '('rule1')' '[' ('(' rule2 ')' ('|' '(' rule2 ')' )* )? ']'
;
That is, instead of (r|)* write (r(|r)*)?. You can see how the latter allows for zero, one or many rules with pipes inbetween.

ANTLR replace tokens in a recursive manner

I have the following grammar:
rule: q=QualifiedName {System.out.println($q.text);};
QualifiedName
:
i=Identifier { $i.setText($i.text + "_");}
('[' (QualifiedName+ | Integer)? ']')*
;
Integer
: Digit Digit*
;
fragment
Digit
: '0'..'9'
;
fragment
Identifier
: ( '_'
| '$'
| ('a'..'z' | 'A'..'Z')
)
('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '$')*
;
and the code from Java:
ANTLRStringStream stream = new ANTLRStringStream("array1[array2[array3[index]]]");
TestLexer lexer = new TestLexer(stream);
CommonTokenStream tokens = new TokenRewriteStream(lexer);
TestParser parser = new TestParser(tokens);
try {
parser.rule();
} catch (RecognitionException e) {
e.printStackTrace();
}
For the input: array1[array2[array3[index]]], i want to modify each identifier. I was expecting to see the output: array1_[array_2[array3_[index_]]], but the output was the same as the input.
So the question is: why the setText() method doesn't work here?
EDIT:
I modified Bart's answer in the following way:
rule: q=qualifiedName {System.out.println($q.modified);};
qualifiedName returns [String modified]
:
Identifier
('[' (qualifiedName+ | Integer)? ']')*
{
$modified = $text + "_";
}
;
Identifier
: ( '_'
| '$'
| ('a'..'z' | 'A'..'Z')
)
('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '$')*
;
Integer
: Digit Digit*
;
fragment
Digit
: '0'..'9'
;
I want to modify each token matched by the rule qualifiedName. I tried the code above, and for the input array1[array2[array3[index]]] i was expecting to see the output array1[array2[array3[index_]_]_]_, but instead only the last token was modified: array1[array2[array3[index]]]_.
How can i solve this?
You can only use setText(...) once a token is created. You're recursively calling this token and setting some other text, which won't work. You'll need to create a parser rule out of QualifiedName instead of a lexer rule, and remove the fragment before Identifier.
rule: q=qualifiedName {System.out.println($q.text);};
qualifiedName
:
i=Identifier { $i.setText($i.text + "_");}
('[' (qualifiedName+ | Integer)? ']')*
;
Identifier
: ( '_'
| '$'
| ('a'..'z' | 'A'..'Z')
)
('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '$')*
;
Integer
: Digit Digit*
;
fragment
Digit
: '0'..'9'
;
Now, it will print: array1_[array2_[array3_[index_]]] on the console.
EDIT
I have no idea why you'd want to do that, but it seems you're simply trying to rewrite ] into ]_, which can be done in the same way as I showed above:
qualifiedName
:
Identifier
('[' (qualifiedName+ | Integer)? t=']' {$t.setText("]_");} )*
;

The following alternatives can never be reached: 2

I'm trying to create a very simple grammar to learn to use ANTLR but I get the following message:
"The following alternatives can never be reached: 2"
This is my grammar attempt:
grammar Robot;
file : command+;
command : ( delay|type|move|click|rclick) ;
delay : 'wait' number ';';
type : 'type' id ';';
move : 'move' number ',' number ';';
click : 'click' ;
rclick : 'rlick' ;
id : ('a'..'z'|'A'..'Z')+ ;
number : ('0'..'9')+ ;
WS : (' ' | '\t' | '\r' | '\n' ) { skip();} ;
I'm using ANTLRWorks plugin for IDEA:
The .. (range) inside parser rules means something different than inside lexer rules. Inside lexer rules, it means: "from char X to char Y", and inside parser rule it matches "from token M to token N". And since you made number a parser rule, it does not do what you think it does (and are therefor receiving an obscure error message).
The solution: make number a lexer rule instead (so, capitalize it: Number):
grammar Robot;
file : command+;
command : (delay | type | move | Click | RClick) ;
delay : 'wait' Number ';';
type : 'type' Id ';';
move : 'move' Number ',' Number ';';
Click : 'click' ;
RClick : 'rlick' ;
Id : ('a'..'z'|'A'..'Z')+ ;
Number : ('0'..'9')+ ;
WS : (' ' | '\t' | '\r' | '\n') { skip();} ;
And as you can see, I also made id, click and rclick lexer rules instead. If you're not sure what the difference is between parser- and lexer rules, please say so and I'll add an explanation to this answer.

Categories

Resources