ANTLR parsing is not finding correct lexer parts - java

I am a complete newcomer to ANTLR.
I have the following ANTLR grammar:
grammar DrugEntityRecognition;
// Parser Rules
derSentence : ACTION (INT | FRACTION | RANGE) FORM TEXT;
// Lexer Rules
ACTION : 'TAKE' | 'INFUSE' | 'INJECT' | 'INHALE' | 'APPLY' | 'SPRAY' ;
INT : [0-9]+ ;
FRACTION : [1] '/' [1-9] ;
RANGE : INT '-' INT ;
FORM : ('TABLET' | 'TABLETS' | 'CAPSULE' | 'CAPSULES' | 'SYRINGE') ;
TEXT : ('A'..'Z' | WHITESPACE | ',')+ ;
WHITESPACE : ('\t' | ' ' | '\r' | '\n' | '\u000C')+ -> skip ;
And when I try to parse a sentence as follows:
String upperLine = line.toUpperCase();
org.antlr.v4.runtime.CharStream stream = new ANTLRInputStream(upperLine);
DrugEntityRecognitionLexer lexer = new DrugEntityRecognitionLexer(stream);
lexer.removeErrorListeners();
lexer.addErrorListener(ThrowingErrorListener.INSTANCE);
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
DrugEntityRecognitionParser parser = new DrugEntityRecognitionParser(tokenStream);
try {
DrugEntityRecognitionParser.DerSentenceContext ctx = parser.derSentence();
StringBuilder sb = new StringBuilder();
sb.append("ACTION: ").append(ctx.ACTION());
sb.append(", ");
sb.append("FORM: ").append(ctx.FORM());
sb.append(", ");
sb.append("INT: ").append(ctx.INT());
sb.append(", ");
sb.append("FRACTION: ").append(ctx.FRACTION());
sb.append(", ");
sb.append("RANGE: ").append(ctx.RANGE());
System.out.println(upperLine);
System.out.println(sb.toString());
} catch (ParseCancellationException e) {
//e.printStackTrace();
}
An example of the input to lexer:
take 10 Tablet (25MG) by oral route every week
In this case ACTION node is not getting populated, but take is getting recognized only as a TEXT node, not an ACTION node. 10 is being recognized as an INT node, however.
How can I modify this grammar to work correctly, where ACTION node is populated correctly (as well as FORM, which is not being populated either)?

There are several problems in your grammar:
Your TEXT rule only matches uppercase letters. Same for ACTION.
You shouldn't mix punctuation and text in a single text rule (here the comma), otherwise you cannot freely allow whitespaces between tokens.
You don't match parentheses at all, hence (25MG) is not valid input and the parser returns in an error state.
You did not check for any syntax errors, to learn what went wrong during recognition.
Also, when in doubt, always print your token sequence from the token source to see if the input has actually been tokenized as you expect. Start there to fix your grammar before you go to the parser.
About case sensitivity: typically (if your language is case-insensitive) you have rules like these:
fragment A: [aA];
fragment B: [bB];
fragment C: [cC];
fragment D: [dD];
...
to match a letter in either case and then define your keywords so:
ACTION : T A K E | I N F U S E | I N J E C T | I N H A L E | A P P L Y | S P R A Y;

Related

Making generated parser work in Java for ANTLR 4.8

I've been having trouble getting my generated parser to work in Java for ANTLR 4.8. There are other answers to this question, but it seems that ANTLR has changed things since 4.7 and all the other answers are before this change. My code is:
String formula = "(fm.a < fm.b) | (fm.a = fm.b)";
CharStream input = CharStreams.fromString(formula);
Antlr.LogicGrammerLexer lexer = new Antlr.LogicGrammerLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
Antlr.LogicGrammerParser parser = new Antlr.LogicGrammerParser(tokens);
ParseTree pt = new ParseTree(parser);
It appears to be reading in the formula correctly into the CharStream, but anything I try to do past that just isn't working at all. For example, if I try to print out the parse tree, nothing will be printed. The following line will print out nothing:
System.out.println(lexer._input.getText(new Interval(0, 100)));
Any advice appreciated.
EDIT: added the grammar file:
grammar LogicGrammer;
logicalStmt: BOOL_EXPR | '('logicalStmt' '*LOGIC_SYMBOL' '*logicalStmt')';
BOOL_EXPR: '('IDENTIFIER' '*MATH_SYMBOL' '*IDENTIFIER')';
IDENTIFIER: CHAR+('.'CHAR*)*;
CHAR: 'a'..'z' | 'A'..'Z' | '1'..'9';
LOGIC_SYMBOL: '~' | '|' | '&';
MATH_SYMBOL: '<' | '≤' | '=' | '≥' | '>';
This line:
ParseTree pt = new ParseTree(parser);
is incorrect. You need to call the start rule method on your parser object to get your parse tree
Antlr.LogicGrammerParser parser = new Antlr.LogicGrammerParser(tokens);
ParseTree pt = parser.logicalStmt();
So far as printing out your input, generally fields starting with an _ (like _input) are not intended for external use. Though I suspect the failure may be that you don't have 100 characters in your input stream, so the Interval is invalid. (I haven't tried it to see the exact failure)
I you include your grammar, one of us could easily attempt to generate and compile and, perhaps, be more specific.
Using your grammar, this works for me:
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.misc.Interval;
import org.antlr.v4.runtime.tree.ParseTree;
public class Logic {
public static void main(String... args) {
String formula = "(fm.a < fm.b) | (fm.a = fm.b)";
CharStream input = CharStreams.fromString(formula);
LogicGrammerLexer lexer = new LogicGrammerLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
LogicGrammerParser parser = new LogicGrammerParser(tokens);
ParseTree pt = parser.logicalStmt();
System.out.println(pt.toStringTree());
System.out.println(input.getText(new Interval(1, 28)));
}
}
output:
([] (fm.a < fm.b))
fm.a < fm.b) | (fm.a = fm.b)
BTW, a couple of minor suggestions for your grammar:
set up a rule to skip whitespace WS: [ \t\r\n]+ -> skip;
change BOOL_EXPR to a parser rule (since it's made up of a composition of tokens from other lexer rules:
grammar LogicGrammer
;
logicalStmt
: boolExpr
| '(' logicalStmt LOGIC_SYMBOL logicalStmt ')'
;
boolExpr: '(' IDENTIFIER MATH_SYMBOL IDENTIFIER ')';
IDENTIFIER: CHAR+ ('.' CHAR*)*;
CHAR: 'a' ..'z' | 'A' ..'Z' | '1' ..'9';
LOGIC_SYMBOL: '~' | '|' | '&';
MATH_SYMBOL: '<' | '≤' | '=' | '≥' | '>';
WS: [ \t\r\n]+ -> skip;
The BOOL_EXPR shouldn't be a lexer rule. I suggest you do something like this instead:
grammar LogicGrammer;
parse
: logicalStmt EOF
;
logicalStmt
: logicalStmt LOGIC_SYMBOL logicalStmt
| logicalStmt MATH_SYMBOL logicalStmt
| '(' logicalStmt ')'
| IDENTIFIER
;
IDENTIFIER
: CHAR+ ( '.'CHAR+ )*
;
LOGIC_SYMBOL
: [~|&]
;
MATH_SYMBOL
: [<≤=≥>]
;
SPACE
: [ \t\r\n] -> skip
;
fragment CHAR
: [a-zA-Z1-9]
;
which can be tested by running the following code:
String formula = "(fm.a < fm.b) | (fm.a = fm.b)";
LogicGrammerLexer lexer = new LogicGrammerLexer(CharStreams.fromString(formula));
LogicGrammerParser parser = new LogicGrammerParser(new CommonTokenStream(lexer));
ParseTree root = parser.parse();
System.out.println(root.toStringTree(parser));

no viable alternative at input ANTLR4

I'm designing a language that allows you to make predicates on data. Here is my lexer.
lexer grammar Studylexer;
fragment LETTER : [A-Za-z];
fragment DIGIT : [0-9];
fragment TWODIGIT : DIGIT DIGIT;
fragment MONTH: ('0' [1-9] | '1' [0-2]);
fragment DAY: ('0' [1-9] | '1' [1-9] | '2' [1-9] | '3' [0-1]);
TIMESTAMP: TWODIGIT ':' TWODIGIT; // représentation de la timestamp
DATE : TWODIGIT TWODIGIT MONTH DAY; // représentation de la date
ID : LETTER+; // match identifiers
STRING : '"' ( ~ '"' )* '"' ; // match string content
NEWLINE:'\r'? '\n' ; // return newlines to parser (is end-statement signal)
WS : [ \t]+ -> skip ; // toss out whitespace
LIST: ( LISTSTRING | LISTDATE | LISTTIMESTAMP ) ; // list of variabels;
// list of operators
GT: '>';
LT: '<';
GTEQ: '>=';
LTEQ:'<=';
EQ: '=';
IN: 'in';
fragment LISTSTRING: STRING ',' STRING (',' STRING)*; // list of strings
fragment LISTDATE : DATE ',' DATE (',' DATE)*; // list of dates
fragment LISTTIMESTAMP:TIMESTAMP ',' TIMESTAMP (',' TIMESTAMP )*; // list of timestamps
NAMES: 'filename' | 'timestamp' | 'tso' | 'region' | 'processType' | 'businessDate' | 'lastModificationDate'; // name of variables in the where block
KEY: ID '[' NAMES ']' | ID '.' NAMES; // predicat key
and here is a part of my grammar.
expr: KEY op = ('>' | '<') value = ( DATE | TIMESTAMP ) NEWLINE # exprGTORLT
| KEY op = ('>='| '<=') value = ( DATE | TIMESTAMP ) NEWLINE # exprGTEQORLTEQ
| KEY '=' value = ( STRING | DATE | TIMESTAMP ) NEWLINE # exprEQ
| KEY 'in' LIST NEWLINE #exprIn
When I make a predicate for example.
tab [key] in "value1", "value2"
ANTLR generates an error.
no viable alternative at input tab [key] in
What can I do to resolve this problem?
First tab [key] does not produce a KEY token like you want it to for two reasons:
It contains spaces and KEY doesn't allow any spaces. The best way to fix that would be to remove the KEY rule from your lexer and instead turn it into a parser rule (meaning you also need to turn [ and ] into their own tokens). Then the white space in your input would be between tokens and thus successfully skipped.
key is not actually one of the words listed in NAMES.
Then another issue is that in is recognized as an ID token, not an IN token. That's because both ID and IN would produce a match of the same length and in cases like that the rule that's listed first takes precedence. So you should define ID after all of the keywords because otherwise the keywords will never be matched.

Understanding the context data structure in Antlr4

I'm trying to write a code translator in Java with the help of Antlr4 and had great success with the grammar part so far. However I'm now banging my head against a wall wrapping my mind around the parse tree data structure that I need to work on after my input has been parsed.
I'm trying to use the visitor template to go over my parse tree. I'll show you an example to illustrate the points of my confusion.
My grammar:
grammar pqlc;
// Lexer
//Schlüsselwörter
EXISTS: 'exists';
REDUCE: 'reduce';
QUERY: 'query';
INT: 'int';
DOUBLE: 'double';
CONST: 'const';
STDVECTOR: 'std::vector';
STDMAP: 'std::map';
STDSET: 'std::set';
C_EXPR: 'c_expr';
INTEGER_LITERAL : (DIGIT)+ ;
fragment DIGIT: '0'..'9';
DOUBLE_LITERAL : DIGIT '.' DIGIT+;
LPAREN : '(';
RPAREN : ')';
LBRACK : '[';
RBRACK : ']';
DOT : '.';
EQUAL : '==';
LE : '<=';
GE : '>=';
GT : '>';
LT : '<';
ADD : '+';
MUL : '*';
AND : '&&';
COLON : ':';
IDENTIFIER : JavaLetter JavaLetterOrDigit*;
fragment JavaLetter : [a-zA-Z$_]; // these are the "java letters" below 0xFF
fragment JavaLetterOrDigit : [a-zA-Z0-9$_]; // these are the "java letters or digits" below 0xFF
WS
: [ \t\r\n\u000C]+ -> skip
;
COMMENT
: '/*' .*? '*/' -> skip
;
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
// Parser
//start_rule: query;
query :
quant_expr
| qexpr+
| IDENTIFIER // order IDENTIFIER and qexpr+?
| numeral
| c_expr //TODO
;
c_type : INT | DOUBLE | CONST;
bin_op: AND | ADD | MUL | EQUAL | LT | GT | LE| GE;
qexpr:
LPAREN query RPAREN bin_op_query?
// query bin_op query
| IDENTIFIER bin_op_query? // copied from query to resolve left recursion problem
| numeral bin_op_query? // ^
| quant_expr bin_op_query? // ^
|c_expr bin_op_query?
// query.find(query)
| IDENTIFIER find_query? // copied from query to resolve left recursion problem
| numeral find_query? // ^
| quant_expr find_query?
|c_expr find_query?
// query[query]
| IDENTIFIER array_query? // copied from query to resolve left recursion problem
| numeral array_query? // ^
| quant_expr array_query?
|c_expr array_query?
// | qexpr bin_op_query // bad, resolved by quexpr+ in query
;
bin_op_query: bin_op query bin_op_query?; // resolve left recursion of query bin_op query
find_query: '.''find' LPAREN query RPAREN;
array_query: LBRACK query RBRACK;
quant_expr:
quant id ':' query
| QUERY LPAREN match RPAREN ':' query
| REDUCE LPAREN IDENTIFIER RPAREN id ':' query
;
match:
STDVECTOR LBRACK id RBRACK EQUAL cm
| STDMAP '.''find' LPAREN cm RPAREN EQUAL cm
| STDSET '.''find' LPAREN cm RPAREN
;
cm:
IDENTIFIER
| numeral
| c_expr //TODO
;
quant :
EXISTS;
id :
c_type IDENTIFIER
| IDENTIFIER // Nach Seite 2 aber nicht der Übersicht. Laut übersicht id -> aber dann wäre Regel 1 ohne +
;
numeral :
INTEGER_LITERAL
| DOUBLE_LITERAL
;
c_expr:
C_EXPR
;
Now let's parse the following string:
double x: x >= c_expr
Visually I'll get this tree:
Let's say my visitor is in the visitQexpr(#NotNull pqlcParser.QexprContext ctx) routine when it hits the branch Qexpr(x bin_op_query).
My question is, how can I tell that the left children ("x" in the tree) is a terminal node, or more specifically an "IDENTIFIER"? There are no visiting rules for Terminal nodes since they aren't rules.
ctx.getChild(0) has no RuleIndex. I guess I could use that to check if I'm in a terminal or not, but that still wouldn't tell me if I was in IDENTIFIER or another kind of terminal token. I need to be able to tell the difference somehow.
I had more questions but in the time it took me to write the explanation I forgot them :<
Thanks in advance.
You can add labels to tokens and access them/check if they exist in the surrounding context:
id :
c_type labelA = IDENTIFIER
| labelB = IDENTIFIER
;
You could also do this to create different visits:
id :
c_type IDENTIFIER #idType1 //choose more appropriate names!
| IDENTIFIER #idType2
;
This will create different visitors for the two alternatives and I suppose (i.e. have not verified) that the visitor for id will not be called.
I prefer the following approach though:
id :
typeDef
| otherId
;
typeDef: c_type IDENTIFIER;
otherId : IDENTIFIER ;
This is a more heavily typed system. But you can very specifically visit nodes. Some rules of thumb I use:
Use | only when all alternatives are parser rules.
Wrap each Token in a parser rule (like otherId) to give them "more meaning".
It's ok to mix parser rules and tokens, if the tokens are not really important (like ;) and therefore not needed in the parse tree.

Regular Expressions - tree grammar Antlr Java

I'm trying to write a program in ANTLR (Java) concerning simplifying regular expression. I have already written some code (grammar file contents below)
grammar Regexp_v7;
options{
language = Java;
output = AST;
ASTLabelType = CommonTree;
backtrack = true;
}
tokens{
DOT;
REPEAT;
RANGE;
NULL;
}
fragment
ZERO
: '0'
;
fragment
DIGIT
: '1'..'9'
;
fragment
EPSILON
: '#'
;
fragment
FI
: '%'
;
ID
: EPSILON
| FI
| 'a'..'z'
| 'A'..'Z'
;
NUMBER
: ZERO
| DIGIT (ZERO | DIGIT)*
;
WHITESPACE
: ('\r' | '\n' | ' ' | '\t' ) + {$channel = HIDDEN;}
;
list
: (reg_exp ';'!)*
;
term
: ID -> ID
| '('! reg_exp ')'!
;
repeat_exp
: term ('{' range_exp '}')+ -> ^(REPEAT term (range_exp)+)
| term -> term
;
range_exp
: NUMBER ',' NUMBER -> ^(RANGE NUMBER NUMBER)
| NUMBER (',') -> ^(RANGE NUMBER NULL)
| ',' NUMBER -> ^(RANGE NULL NUMBER)
| NUMBER -> ^(RANGE NUMBER NUMBER)
;
kleene_exp
: repeat_exp ('*'^)*
;
concat_exp
: kleene_exp (kleene_exp)+ -> ^(DOT kleene_exp (kleene_exp)+)
| kleene_exp -> kleene_exp
;
reg_exp
: concat_exp ('|'^ concat_exp)*
;
My next goal is to write down tree grammar code, which is able to simplify regular expressions (e.g. a|a -> a , etc.). I have done some coding (see text below), but I have troubles with defining rule that treats nodes as subtrees (in order to simplify following kind of expressions e.g.: (a|a)|(a|a) to a, etc.)
tree grammar Regexp_v7Walker;
options{
language = Java;
tokenVocab = Regexp_v7;
ASTLabelType = CommonTree;
output=AST;
backtrack = true;
}
tokens{
NULL;
}
bottomup
: ^('*' ^('*' e=.)) -> ^('*' $e) //a** -> a*
| ^('|' i=.* j=.* {$i.tree.toStringTree() == $j.tree.toStringTree()} )
-> $i // There are 3 errors while this line is up and running:
// 1. CommonTree cannot be resolved,
// 2. i.tree cannot be resolved or is not a field,
// 3. i cannot be resolved.
;
Small driver class:
public class Regexp_Test_v7 {
public static void main(String[] args) throws RecognitionException {
CharStream stream = new ANTLRStringStream("a***;a|a;(ab)****;ab|ab;ab|aa;");
Regexp_v7Lexer lexer = new Regexp_v7Lexer(stream);
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
Regexp_v7Parser parser = new Regexp_v7Parser(tokenStream);
list_return list = parser.list();
CommonTree t = (CommonTree) list.getTree();
System.out.println("Original tree: " + t.toStringTree());
CommonTreeNodeStream nodes = new CommonTreeNodeStream(t);
Regexp_v7Walker s = new Regexp_v7Walker(nodes);
t = (CommonTree)s.downup(t);
System.out.println("Simplified tree: " + t.toStringTree());
Can anyone help me with solving this case?
Thanks in advance and regards.
Now, I'm no expert, but in your tree grammar:
add filter=true
change the second line of bottomup rule to:
^('|' i=. j=. {i.toStringTree().equals(j.toStringTree()) }? ) -> $i }
If I'm not mistaken by using i=.* you're allowing i to be non-existent and you'll get a NullPointerException on conversion to a String.
Both i and j are of type CommonTree because you've set it up this way: ASTLabelType = CommonTree, so you should call i.toStringTree().
And since it's Java and you're comparing Strings, use equals().
Also to make the expression in curly brackets a predicate, you need a question mark after the closing one.

Match String to Lexer : 'expecting' error

I have a small problem with my grammar.
I am trying to detect whether my string is a date comparison or not.
But the DATE lexer I created seem not to be recognized by antlr, and I get an error that I cannot solve.
Here is my input expression :
"FDT > '2007/10/09 12:00:0.0'"
I simply expect such a tree as output :
COMP_OP
FDT my_DATE
Here is my grammar :
// Aiming at parsing a complete BQS formed Query
grammar Logic;
options {
output=AST;
}
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
// precedence order is (low to high): or, and, not, [comp_op, geo_op, rel_geo_op, like, not like, exists], ()
parse
: expression EOF -> expression
; // ommit the EOF token
expression
: query
;
query
: atom (COMP_OP^ DATE)*
;
//END BIG PART
atom
: ID
| | '(' expression ')' -> expression
;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
// GENERAL OPERATORS:
DATE : '\'' YEAR '/' MONTH '/' DAY (' ' HOUR ':' MINUTE ':' SECOND)? '\'';
ID : (CHARACTER|DIGIT|','|'.'|'\''|':'|'/')+;
COMP_OP : '=' | '<' | '>' | '<>' | '<=' | '>=';
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
fragment YEAR : DIGIT DIGIT DIGIT DIGIT;
fragment MONTH : DIGIT DIGIT;
fragment DAY : DIGIT DIGIT;
fragment HOUR : DIGIT DIGIT;
fragment MINUTE : DIGIT DIGIT;
fragment SECOND : DIGIT DIGIT ('.' (DIGIT)+)?;
fragment DIGIT : '0'..'9' ;
fragment DIGIT_SEQ :(DIGIT)+;
fragment CHARACTER : ('a'..'z' | 'A'..'Z');
As an output error, I get :
line 1:25 mismatched character '.' expecting set null
line 1:27 mismatched input ''' expecting DATE
I also tried to remove the ' ' from my Date (thinking that it was perhaps the problem, as I remove them in the grammar.)
In this case, I get this error :
line 1:6 mismatched input ''2007/10/09' expecting DATE
Can anyone explain me why I get such an error, and how I could solve it ?
This question is q subset of my complete task, where I have to differentiate lots of omparisons (dates, geographic, strings, . . .). I would thus really need to be able to give 'tags' to my atoms.
Thank you very much !
As a complement, here is my current Java code :
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
// the expression
String src = "FDT > '2007/10/09 12:00:0.0'";
// create a lexer & parser
//LogicLexer lexer = new LogicLexer(new ANTLRStringStream(src));
//LogicParser parser = new LogicParser(new CommonTokenStream(lexer));
LogicLexer lexer = new LogicLexer(new ANTLRStringStream(src));
LogicParser parser = new LogicParser(new CommonTokenStream(lexer));
// invoke the entry point of the parser (the parse() method) and get the AST
CommonTree tree = (CommonTree)parser.parse().getTree();
// print the DOT representation of the AST
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
I finally got it,
Seems like even though I remove whitespaces, I still have to include them in expressions that contain one.
In addition, there was a small error in second definition, as the second digit was optional.
The grammar is thus slightly modified :
fragment SECOND : DIGIT (DIGIT)? ('.' (DIGIT)+)?;
which gives this output :
I know hope this will still work in my more complete grammar :)
Hope it helps someone.

Categories

Resources