I'm trying to extend an existing grammar using Antlr4. In the .g4 file beside other rules the following is defined:
Digit
: ZeroDigit
| NonZeroDigit
;
NonZeroDigit
: NonZeroOctDigit
| '8'
| '9'
;
NonZeroOctDigit
: '1'
| '2'
| '3'
| '4'
| '5'
| '6'
| '7'
;
OctDigit
: ZeroDigit
| NonZeroOctDigit
;
ZeroDigit
: '0' ;
SP
: ( WHITESPACE )+ ;
so on top of that (not only as a figure of speech) I added the following rules which are supposed to make use of these existing rules:
ttQL_Query
: ttQL_TimeClause SP;
ttQL_TimeClause
: FROM SP? ttQL_DateTime SP? TO SP? ttQL_DateTime;
ttQL_DateTime
: ttQL_Date ('T' ttQL_Time ttQL_Timezone)?;
ttQL_Timezone: 'Z' | ( '+' | '-' ) ttQL_Hour ':' ttQL_Minute;
ttQL_Date: ttQL_Year '-' ttQL_Month '-' ttQL_Day;
ttQL_Time: ttQL_Hour (':' ttQL_Minute (':' ttQL_Second (ttQL_Millisecond)?)?)?;
ttQL_Year: Digit Digit Digit Digit;
ttQL_Month: Digit Digit;
ttQL_Day: Digit Digit;
ttQL_Hour: Digit Digit ;
ttQL_Minute: Digit Digit ;
ttQL_Second: Digit Digit ;
ttQL_Millisecond: '.' ( Digit )+;
FROM : ( 'F' | 'f' ) ( 'R' | 'r' ) ( 'O' | 'o' ) ( 'M' | 'm' ) ;
TO : ( 'T' | 't' ) ( 'O' | 'o' ) ;
This is supposed to be an extension of the open cypher query language (grammar can be found here: http://opencypher.org/resources/) but i dont get it to work. Its supposed to prefix a cypher query. The rule for that is simple:
ttQL
: SP? ttQL_Query SP? oC_Cypher ;
So all the other existing rules as well as the one i stated in the beginning are used in oC_Cypher. I put all my rules on top of the antlr file and when trying to parse a query like the following:
FROM 2123-12-13T12:34:39Z TO 2123-12-13T14:34:39.2222Z MATCH (a)-[x]->(b) WHERE a.ping > 22" RETURN a.ping, b"
I get the following error messages by my parser:
line 1:5 mismatched input '2123' expecting Digit
line 1:10 mismatched input '12' expecting Digit
line 1:13 mismatched input '13' expecting Digit
line 1:29 mismatched input '2123' expecting Digit
line 1:34 mismatched input '12' expecting Digit
line 1:37 mismatched input '13' expecting Digit
The weird thing is, when i put my part of the grammar in a new .g4 file and create a parser only for the prefix part FROM 2123-12-13T12:34:39Z TO 2123-12-13T14:34:39.2222Z then everything works like a charm. I'm kind of lost here. I am using vscode, java, maven and the ANTLR4 Plugin with ANTLR version 4.9.2, mvn-compiler-plugin 3.10.1, java version 11
what could be the catch here ?
With the help of the answers of kaby I could solve the problem for me. I don't know if this is the correct of handling this issue but for what I want to achieve it is sufficient. So please be careful with this solution if you have a similar problem and try to solve it.
As kaby noted the lexer seaches for the Token it can concatenate the most characters with, so i just made lexer rules out of the date and time so the numbers wouldnt get recognized as Integers. Here is my working solution:
ttQL_Query
: ttQL_TimeClause SP?;
ttQL_TimeClause
: FROM SP? DATETIME SP? TO SP? DATETIME;
DATETIME: DATE ('T' TIME TIMEZONE)?;
TIMEZONE: 'Z' | ( '+' | '-' ) Digit Digit ':' Digit Digit;
DATE: Digit Digit Digit Digit '-' Digit Digit '-' Digit Digit;
TIME: Digit Digit (':' Digit Digit (':' Digit Digit ('.' (Digit)+ )?)?)?;
FROM : ( 'F' | 'f' ) ( 'R' | 'r' ) ( 'O' | 'o' ) ( 'M' | 'm' ) ;
TO : ( 'T' | 't' ) ( 'O' | 'o' ) ;
EDIT:
I discovered my solution contains another pitfall which I will add here. In case you are parsing integers or any other sequence of digits where it is possible that two digits are concatenated my TIME rule will be invoked and a TIME token will be created - at least if this rule is above other rules which could fit here. As someone who dealt the first time with lexers and parsers I found that it is most important to be careful about already existing Lexer rules. As kaby mentioned: take care about the Lexer first, print out the tokens of Example input for debugging. In my case a simple solution was to merge the DATE, TIME and TIMEZONE rules to make a more unique rule to not run into compatibility issues with the existing Lexer rules:
DATETIME: (Digit Digit Digit Digit '-' Digit Digit '-' Digit Digit) ('T' (Digit Digit (':' Digit Digit (':' Digit Digit ('.' (Digit)+ )?)?)?) ('Z' | ( '+' | '-' ) Digit Digit ':' Digit Digit))?;
I suggest adding a fragment prefix to all lexer rules with the exception of Digits and SP.
Related
I have a problem with the rule mnemonic_format.
Instead to recognize a simple text like A100 it gives the following error :
mismatched input 'A100' expecting 'A'
The grammar is:
grammar SimpleMathGrammar;
INTEGER : [0-9]+;
FLOAT : [0-9]+ '.' [0-9]+;
ADD : '+';
SUB : '-';
DOT : '.';
AND : 'AND';
BACKSLASH : '\\';
fragment SINGLELETTER : ( 'a'..'z' | 'A'..'Z');
fragment LOWERCASE : 'a'..'z';
fragment UNDERSCORE : '_';
fragment DOLLAR : '$';
fragment NUMBER : '0'..'9';
VARIABLENAME
: SINGLELETTER
| (SINGLELETTER|UNDERSCORE) (SINGLELETTER | UNDERSCORE | DOLLAR | NUMBER)*;
HASH : '#';
/* PARSER */
operation
: (INTEGER | FLOAT) ADD (INTEGER | FLOAT)
| (INTEGER | FLOAT) SUB (INTEGER | FLOAT);
operation_with_backslash : BACKSLASH operation BACKSLASH;
mnemonic: HASH VARIABLENAME HASH;
mnemonic_format
// Example: A100
: 'A' INTEGER;
At this point, i know that the token VARIABLENAME should not include the character A (correct me if im wrong)
So what can i do for include a single character (o fixed sequence) in distinct rule? (and which is my error?)
EDIT: I found the origin of the problem (by remove all of the other tokens and rules) in the following token case:
VARIABLENAME: (SINGLELETTER|UNDERSCORE) (SINGLELETTER | UNDERSCORE | DOLLAR | NUMBER)*;
So how can i create a token or a lexer rule that give me the basic for detect some generic text (like a Class name or a Variable name) by also create rules where i must accept a fixed sequence of characters?
Ok,
The trick was the "general scope" of the token VARIABLENAME.
In other terms, the token is too much generic.
In my case the sub-condition VARIABLENAME: SINGLELETTER NUMBER* crash/collide with the condition mnemonic_format: 'A' INTEGER
(Indeed i can create the string A100 with VARIABLENAME or mnemonic_format and this create an ambiguity)
So i "specialize" VARIABLENAME for accept a prefix, for example:
VARIABLENAME
: HASH (SINGLELETTER|UNDERSCORE)(SINGLELETTER|UNDERSCORE|DOLLAR|NUMBER)*
| 'class ' (SINGLELETTER|UNDERSCORE)(SINGLELETTER|UNDERSCORE|DOLLAR|NUMBER)*
...
This should avoid an ambiguity between the token and the rule
I have to make a calculator in Java using AntLR .But when i try to calculate the square root using the command s 4 it shows me :no viable alternative at input 's4'.
I really need your help for this. I try everything and I dont know whats is wrong.
This is my grammar:
grammar Hello;
r : r SEMI r EOF
| r SEMI
| plus_op
| minus_op
| sqrt_op;
// match keyword hello followed by an identifier
ID : [a-z]+ ; // match lower-case identifiers
NUM : [0-9];
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
ADD : '+';
MINUS : '-';
SEMI: ';';
SQRT: 's';
plus_token: NUM | ID;
minus_token: NUM | ID;
sqrt_token: NUM;
plus_op : plus_token ADD plus_token;
minus_op : minus_token MINUS minus_token;
sqrt_op: SQRT sqrt_token;
SQRT will never be matched as ID matches the same input. This is because ANTLR won't try to match the input with multiple rules in the lexer but rather uses the first found rule that can match the current input.
The problem should be solved if SQRT is being defined before ID in the grammar.
I have a string that contains the following date range format variations. I need to find and replace with hour precision using a single java regex pattern. The date range is variable. Can you come up with a regex for me?
String Example
published_date:{05/31/16.23:41:24-?}
published_date:{05/31/16.23:41:24-06/21/16.23:41:24}
Expected Results
published_date:{05/31/16.23:00:00-?}
published_date:{05/31/16.23:00:00-06/21/16.23:00:00}
Description
This regex will find substrings that look like date/time stamps like 05/31/16.23:41:24. It'll capture the date and hour portions of and allow you to replace the minutes and seconds with 00.
([0-9]{2}\/[0-9]{2}\/[0-9]{2}\.[0-9]{2}):[0-9]{2}:[0-9]{2}
Replace With: $1:00:00
Example
Live Demo
https://regex101.com/r/qK8bL7/1
Sample text
published_date:{05/31/16.23:41:24-?}
published_date:{05/31/16.23:41:24-06/21/16.23:41:24}
After Replacement
published_date:{05/31/16.23:00:00-?}
published_date:{05/31/16.23:00:00-06/21/16.23:00:00}
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[0-9]{2} any character of: '0' to '9' (2 times)
----------------------------------------------------------------------
\/ '/'
----------------------------------------------------------------------
[0-9]{2} any character of: '0' to '9' (2 times)
----------------------------------------------------------------------
\/ '/'
----------------------------------------------------------------------
[0-9]{2} any character of: '0' to '9' (2 times)
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
[0-9]{2} any character of: '0' to '9' (2 times)
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
[0-9]{2} any character of: '0' to '9' (2 times)
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
[0-9]{2} any character of: '0' to '9' (2 times)
----------------------------------------------------------------------
Try this. ":[0-9]+:[0-9]+".
public class Regex {
public static void main(String ar[]){
String st = "05/31/16.23:41:24-06/21/16.23:41:24";
st = st.replaceAll(":[0-9]+:[0-9]+", ":00:00");
System.out.println(st);
st = "05/31/16.23:41:24-?";
st = st.replaceAll(":[0-9]+:[0-9]+", ":00:00");
System.out.println(st);
}
}
I keep getting MissingTokenException, NullPointerException, and if I remember correctly NoViableAlterativeException. The logfile / console output from ANTLRWorks is not helpful enough for me.
What I'm after is a rewrite such as the following:
(expression | FLOAT) '(' -> (expression | FLOAT) '*('
Here below is a sample of my grammar that I snatched out to create a test file with.
grammar Test;
expression
: //FLOAT '(' -> (FLOAT '*(')+
| add EOF!
;
term
:
| '(' add ')'
| FLOAT
| IMULT
;
IMULT
: (add ('(' add)*) -> (add ('*' add)*)
;
negation
: '-'* term
;
unary
: ('+' | '-')* negation
;
mult
: unary (('*' | '/') unary)*
;
add
: mult (('+' | '-') mult)*
;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
FLOAT
: ('0'..'9')+ '.' ('0'..'9')*// EXPONENT?
| '.' ('0'..'9')+ //EXPONENT?
| ('0'..'9')+ //EXPONENT
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
I've also tried :
imult
: FLOAT '(' -> FLOAT '*('
;
And this:
IMULT / imult
: expression '(' -> expression '*'
;
As well as countless other versions (hacks) that I have lost count of.
Can anyone help me out with this ?
I've run into this problem before. The basic answer is that ANTLR doesn't allow you to use tokens on the right hand side of a '->' statement that weren't present on the left hand side. However, what you can do is use extra tokens defined specifically for AST's.
Just create a tokens block before the grammar rules as follows:
tokens { ABSTRACTTOKEN; }
You can use them on the right hand side of the grammar statement like this.
imult
: FLOAT '(' -> ^(ABSTRACTTOKEN FLOAT)
;
Hope that helps.
I'd like to parse an UTF8 encoded text file that may contain something like this:
int 1
text " some text with \" and \\ "
int list[-45,54, 435 ,-65]
float list [ 4.0, 5.2,-5.2342e+4]
The numbers in the list are separated by commas. Whitespace is permitted but not required between any number and any symbol like commas and brackets here. Similarly for words and symbols, like in the case of list[
I've done the quoted string reading by forcing Scanner to give me single chars (setting its delimiter to an empty pattern) because I still thought it'll be useful for reading the ints and floats, but I'm not sure anymore.
The Scanner always takes a complete token and then tries to match it. What I need is try to match as much (or as little) as possible, disregarding delimiters.
Basically for this input
int list[-45,54, 435 ,-65]
I'd like to be able to call and get this
s.nextWord() // int
s.nextWord() // list
s.nextSymbol() // [
s.nextInt() // -45
s.nextSymbol() // ,
s.nextInt() // 54
s.nextSymbol() // ,
s.nextInt() // 435
s.nextSymbol() // ,
s.nextInt() // -65
s.nextSymbol() // ]
and so on.
Or, if it couldn't parse doubles and other types itself, at least a method that takes a regex, returns the biggest string that matches it (or an error) and sets the stream position to just after what it matched.
Can the Scanner somehow be used for this? Or is there another approach? I feel this must be quite a common thing to do, but I don't seem to be able to find the right tool for it.
I'm not an ANTLR expert, but this ANTLR grammar is capable to parse your code:
grammar Expressions;
expressions
: expression+ EOF
;
expression
: intExpression
| intListExpression
| floatExpression
| floatListExpression
| textExpression
| textListExpression
;
intExpression : intType INT;
intListExpression : intType listType '[' ( INT (',' INT)* )? ']';
floatExpression : floatType FLOAT;
floatListExpression : floatType listType '[' ( (INT|FLOAT) (',' (INT|FLOAT))* )? ']';
textExpression : textType STRING;
textListExpression : textType listType '[' ( STRING (',' STRING)* )? ']';
intType : 'int';
floatType : 'float';
textType : 'text';
listType : 'list';
INT : '0'..'9'+
;
FLOAT
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
STRING
: '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
;
fragment
EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
Of course you will need to improve it, but I think that with this structure is easy to insert code in the parser to do what you want (a kind of token stream). Try it in ANTLRWorks debug to see what happens.
For your input, this is the parse tree:
Edit: I changed it to support empty lists.
Initiate the scanner with the file in the class constructor. then for the nextWord Method, do this,
public static nextWord(){
return(sc.findInLine("\\w+"));
}
You can derive the code for other methods using the above example with the findInLine method of the Scanner class and changing the regex pattern.