I try to match the string "match 'match content'", meanwhile extract match content that within single quotes. But throws the following exception:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.antlr.runtime.Lexer.emit(Lexer.java:160)
at org.antlr.runtime.Lexer.nextToken(Lexer.java:91)
at org.antlr.runtime.BufferedTokenStream.fetch(BufferedTokenStream.java:133)
at org.antlr.runtime.BufferedTokenStream.sync(BufferedTokenStream.java:127)
at org.antlr.runtime.CommonTokenStream.consume(CommonTokenStream.java:70)
at org.antlr.runtime.BaseRecognizer.match(BaseRecognizer.java:106)
I don't known why throws OOM exception and i can not find error define in dot g file.
My dot g file:
grammar Contains;
options {
language=Java;
output=AST;
ASTLabelType=CommonTree;
backtrack=false;
k=3;
}
match
:
KW_MATCH SINGLE_QUOTE ( ~(SINGLE_QUOTE|'\\') | ('\\' .) )+ SINGLE_QUOTE
;
regexp
:
KW_REGEXP SINGLE_QUOTE RegexComponent+ SINGLE_QUOTE
;
range
:
KW_RANGE SINGLE_QUOTE left=(LPAREN | LSQUARE) start=Number COMMA end = Number right=(RPAREN | RSQUARE) SINGLE_QUOTE
;
DOT : '.'; // generated as a part of Number rule
COLON : ':' ;
COMMA : ',' ;
LPAREN : '(' ;
RPAREN : ')' ;
LSQUARE : '[' ;
RSQUARE : ']' ;
LCURLY : '{';
RCURLY : '}';
PLUS : '+';
MINUS : '-';
STAR : '*';
BITWISEOR : '|';
BITWISEXOR : '^';
QUESTION : '?';
DOLLAR : '$';
KW_RANGE : 'RANGE';
KW_REGEXP : 'REGEXP';
KW_MATCH : 'MATCH';
DOUBLE_QUOTE : '\"';
SINGLE_QUOTE : '\'';
fragment
Digit
:
'0'..'9'
;
fragment
Exponent
:
('e' | 'E') ( PLUS|MINUS )? (Digit)+
;
fragment
RegexComponent
: 'a'..'z' | 'A'..'Z' | '0'..'9' | '_'
| PLUS | STAR | QUESTION | MINUS | DOT
| LPAREN | RPAREN | LSQUARE | RSQUARE | LCURLY | RCURLY
| BITWISEXOR | BITWISEOR | DOLLAR | '\u0080'..'\u00FF' | '\u0400'..'\u04FF'
| '\u0600'..'\u06FF' | '\u0900'..'\u09FF' | '\u4E00'..'\u9FFF' | '\u0A00'..'\u0A7F'
;
Number
:
(Digit)+ ( DOT (Digit)* (Exponent)? | Exponent)?
;
WS : (' '|'\r'|'\t'|'\n'|'\u000C')* {$channel=HIDDEN;}
;
You could start by changing:
WS : (' '|'\r'|'\t'|'\n'|'\u000C')* {$channel=HIDDEN;}
;
to:
WS : (' '|'\r'|'\t'|'\n'|'\u000C')+ {$channel=HIDDEN;}
;
Your version matches an empty string, which might produce an infinite amount of tokens (which might throw an OOME).
Related
my Antlr-grammar expect a FunctionCall but in my example-code for the compiler built by antlr, i wrote a print-command. Does someone know why and how to fix that? The print-command is named: RetroBox.show(); The print-command should be recognised from blockstatements to blockstatement to statement to localFunctionCall to printCommand
Here my Antrl-grammar:
grammar Mars;
// ******************************LEXER
BEGIN*****************************************
// Keywords
FUNC: 'func';
ENTRY: 'entry';
VARI: 'vari';
VARF: 'varf';
VARC: 'varc';
VARS: 'vars';
LET: 'let';
INCREMENTS: 'increments';
RETROBOX: 'retrobox';
SHOW: 'show';
// Literals
DECIMAL_LITERAL: ('0' | [1-9] (Digits? | '_'+ Digits)) [lL]?;
FLOAT_LITERAL: (Digits '.' Digits? | '.' Digits) ExponentPart? [fFdD]?
| Digits (ExponentPart [fFdD]? | [fFdD])
;
CHAR_LITERAL: '\'' (~['\\\r\n] | EscapeSequence) '\'';
STRING_LITERAL: '"' (~["\\\r\n] | EscapeSequence)* '"';
// Seperators
ORBRACKET: '(';
CRBRACKET: ')';
OEBRACKET: '{';
CEBRACKET: '}';
SEMI: ';';
POINT: '.';
// Operators
ASSIGN: '=';
// Whitespace and comments
WS: [ \t\r\n\u000C]+ -> channel(HIDDEN);
COMMENT: '/*' .*? '*/' -> channel(HIDDEN);
LINE_COMMENT: '//' ~[\r\n]* -> channel(HIDDEN);
// Identifiers
IDENTIFIER: Letter LetterOrDigit*;
// Fragment rules
fragment ExponentPart
: [eE] [+-]? Digits
;
fragment EscapeSequence
: '\\' [btnfr"'\\]
| '\\' ([0-3]? [0-7])? [0-7]
| '\\' 'u'+ HexDigit HexDigit HexDigit HexDigit
;
fragment HexDigits
: HexDigit ((HexDigit | '_')* HexDigit)?
;
fragment HexDigit
: [0-9a-fA-F]
;
fragment Digits
: [0-9] ([0-9_]* [0-9])?
;
fragment LetterOrDigit
: Letter
| [0-9]
;
fragment Letter
: [a-zA-Z$_] // these are the "java letters" below 0x7F
| ~[\u0000-\u007F\uD800-\uDBFF] // covers all characters above 0x7F which are not a surrogate
| [\uD800-\uDBFF] [\uDC00-\uDFFF] // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
;
// *******************************LEXER END****************************************
// *****************************PARSER BEGIN*****************************************
program
: mainfunction #Programm
| /*EMPTY*/ #Garnichts
;
mainfunction
: FUNC VARI ENTRY ORBRACKET CRBRACKET block #NormaleHauptmethode
;
block
: '{' blockStatement '}' #CodeBlock
| /*EMPTY*/ #EmptyCodeBlock
;
blockStatement
: statement* #Befehl
;
statement
: localVariableDeclaration
| localVariableInitialization
| localFunctionImplementation
| localFunctionCall
;
expression
: left=expression op='%'
| left=expression op=('*' | '/') right=expression
| left=expression op=('+' | '-') right=expression
| neg='-' right=expression
| number
| IDENTIFIER
| '(' expression ')'
;
number
: DECIMAL_LITERAL
| FLOAT_LITERAL
;
localFunctionImplementation
: FUNC primitiveType IDENTIFIER ORBRACKET CRBRACKET block #Methodenimplementierung
;
localFunctionCall
: IDENTIFIER ORBRACKET CRBRACKET SEMI #Methodenaufruf
| printCommand #RetroBoxShowCommand
;
printCommand
: RETROBOX POINT SHOW ORBRACKET params=primitiveLiteral CRBRACKET SEMI #PrintCommandWP
;
localVariableDeclaration
: varTypeDek=primitiveType IDENTIFIER SEMI #Variablendeklaration
;
localVariableInitialization
: varTypeIni=primitiveType IDENTIFIER ASSIGN varValue=primitiveLiteral SEMI #VariableninitKonst
| varTypeIni=primitiveType IDENTIFIER ASSIGN varValue=expression SEMI #VariableninitExpr
;
primitiveLiteral
: DECIMAL_LITERAL
| FLOAT_LITERAL
| STRING_LITERAL
| CHAR_LITERAL
;
primitiveType
: VARI
| VARC
| VARF
| VARS
;
// ******************************PARSER END****************************************
Here my example-code:
func vari entry()
{
RetroBox.show("Hallo"); //Should be recognised as print-command
}
And here a AST printed from Antlr:
AST from Compiler
The problem is that your RETROBOX keyword is 'retrobox' but your example code has it typed as 'RetroBox'. Antlr parses 'RetroBox' as an identifier so the following '.' is unexpected.
Antlr should emit an error: "line 3:12 mismatched input '.' expecting '('".
Then it attempts to recover and continue parsing. It tries single token deletion (just ignoring the '.') and finds that that works... except the rule it now matches is #Methodenaufruf instead of #RetroBoxShowCommand.
For this grammar: (I'm eliding the rest, will include it below.)
defaultBooleanExpression
: nested += maybeAndExpression (nested += maybeAndExpression)+
;
fuzzyQuery
: text =
( UNQUOTED
| UNSIGNED_NUMBER
| SIGNED_NUMBER
)
TILDE
(similarity = UNSIGNED_NUMBER)?
;
If I input this:
abc~0.5
I expect to get a structure like:
{ fuzzyQuery text=abc similarity=0.5 }
But what I actually get is:
{ defaultBooleanQuery
{ fuzzyQuery text=abc similarity=null }
{ unquoted text=0.5 }
}
The code I'm using to run the parser is as follows. We're applying a performance hack as suggested in the FAQ...
QueryLexer lexer = new QueryLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
QueryParser parser = new QueryParser(tokens);
parser.getInterpreter().setPredictionMode(PredictionMode.SLL);
ParseTree expression;
try
{
expression = parser.startExpression();
}
catch (Exception e)
{
parser.reset();
parser.getInterpreter()
.setPredictionMode(PredictionMode.LL);
expression = parser.startExpression();
}
Debugging it:
First round, it goes inside defaultBooleanExpression(), then goes inside fuzzyQuery(), recognises all three tokens as part of fuzzyQuery(), then exits fuzzyQuery(). Then there are no tokens left, so defaultBooleanExpression() fails because there aren't enough clauses, and this causes the parse to fail.
We catch the exception and then retry with LL(*).
Now, it goes inside fuzzyQuery(), but for some reason the prediction fails to see the third token, so it stops at the '~'. It then completes without throwing an exception, but returns the wrong result.
I attempted to use ANTLRWorks 2 and the IDEA plugin to try and debug the syntax rule by rule, but neither of these tools appear to work at the moment - they both refuse to parse the grammar without printing any kind of error to explain why.
Full grammar follows.
/**
* Grammar specification for our query syntax.
*/
grammar Query;
#header {
package com.nuix.storage.textindex.search.queryparser.antlr;
}
///////////////////////////////////////////////////////////////////////////
// Parser Rules
startExpression
: expression EOF
;
expression
: maybeOrExpression
;
maybeOrExpression
: orExpression
| maybeDefaultBooleanExpression
;
/**
* e.g., a OR b
*/
orExpression
: nested += maybeDefaultBooleanExpression (OR nested += maybeDefaultBooleanExpression)+
;
maybeDefaultBooleanExpression
: defaultBooleanExpression
| maybeAndExpression
;
/**
* e.g., a b
*/
defaultBooleanExpression
: nested += maybeAndExpression (nested += maybeAndExpression)+
;
maybeAndExpression
: andExpression
| maybeProximityExpression
;
/**
* e.g., a AND b
*/
andExpression
: nested += maybeProximityExpression (AND nested += maybeProximityExpression)+
;
maybeProximityExpression
: withinExpression
| notWithinExpression
| precedingExpression
| notPrecedingExpression
| maybeUnaryExpression
;
/**
* e.g., a W/4 b
*/
withinExpression
: left = maybeUnaryExpression op = W_SLASH_N right = maybeUnaryExpression
;
/**
* e.g., a NOT W/4 b
*/
notWithinExpression
: left = maybeUnaryExpression NOT op = W_SLASH_N right = maybeUnaryExpression
;
/**
* e.g., a PRE/4 b
*/
precedingExpression
: left = maybeUnaryExpression op = PRE_SLASH_N right = maybeUnaryExpression
;
/**
* e.g., a NOT PRE/4 b
*/
notPrecedingExpression
: left = maybeUnaryExpression NOT op = PRE_SLASH_N right = maybeUnaryExpression
;
maybeUnaryExpression
: notExpression
| plusExpression
| minusExpression
| maybeBoostedQueryFragment
;
/**
* e.g., NOT a
*/
notExpression
: NOT nested = maybeBoostedQueryFragment
;
/**
* e.g., +a
*/
plusExpression
: PLUS nested = maybeBoostedQueryFragment
;
/**
* e.g., -a
*/
minusExpression
: MINUS nested = maybeBoostedQueryFragment
;
maybeBoostedQueryFragment
: boostedQueryFragment
| maybeFieldedQueryFragment
;
/**
* e.g., a^2.0
*/
boostedQueryFragment
: nested = maybeFieldedQueryFragment CARET boost = UNSIGNED_NUMBER
;
maybeFieldedQueryFragment
: plainFieldedQueryFragment
| subFieldedQueryFragment
| wildcardSubFieldedQueryFragment
| queryFragment
;
/**
* e.g., properties:a
*/
plainFieldedQueryFragment
: fieldName =
( UNQUOTED
| LONE_WILDCARD
| TO
)
COLON
nested = queryFragment
;
/**
* e.g., integer-properties:"File Size":3
*/
subFieldedQueryFragment
: fieldName =
( UNQUOTED
| LONE_WILDCARD
| TO
)
COLON
subFieldName =
( QUOTED
| UNQUOTED
)
COLON
nested = queryFragment
;
/**
* e.g., date-properties:*:20010101
*/
wildcardSubFieldedQueryFragment
: fieldName =
( UNQUOTED
| LONE_WILDCARD
| TO
)
COLON
subFieldName =
( UNQUOTED_WILDCARD
| QUOTED_WILDCARD
| LONE_WILDCARD
)
COLON
nested = queryFragment
;
queryFragment
: fuzzyQuery
| unquotedQuery
| dateOffsetQuery
| unquotedWildcardQuery
| loneWildcardQuery
| slopQuery
| rangeQuery
| groupQuery
| unquotedMacro
| quotedMacro
| geoDistanceQuery
;
/**
* e.g., GEODISTANCE((40N 50E) 60km)
*/
geoDistanceQuery
: GEODISTANCE
LPAREN
LPAREN
latitude =
( UNQUOTED
| UNSIGNED_NUMBER
| SIGNED_NUMBER
)
longitude =
( UNQUOTED
| UNSIGNED_NUMBER
| SIGNED_NUMBER
)
RPAREN
distance =
( UNQUOTED
| UNSIGNED_NUMBER
)
RPAREN
;
/**
* e.g., "some query"~2
*/
slopQuery
: nested = slopCapableQuery
( TILDE slop = UNSIGNED_NUMBER )?
;
slopCapableQuery
: quotedQuery
| quotedWildcardQuery
| exactQuery
| regexQuery
;
/**
* e.g., a
*/
unquotedQuery
: UNQUOTED
| UNSIGNED_NUMBER
| SIGNED_NUMBER
;
/**
* e.g., +2Y
*/
dateOffsetQuery
: DATE_OFFSET
| TODAY
;
/**
* e.g., query~0.8
*/
fuzzyQuery
: text =
( UNQUOTED
| UNSIGNED_NUMBER
| SIGNED_NUMBER
)
TILDE
(similarity = UNSIGNED_NUMBER)?
;
/**
* e.g., a*
*/
unquotedWildcardQuery
: UNQUOTED_WILDCARD
;
/**
* e.g., *
*/
loneWildcardQuery
: LONE_WILDCARD
;
/**
* e.g., "a"
*/
quotedQuery
: QUOTED
;
/**
* e.g., "a*"
*/
quotedWildcardQuery
: QUOTED_WILDCARD
;
/**
* e.g., 'a'
*/
exactQuery
: QUOTED_EXACT
;
/**
* e.g., /a+/
*/
regexQuery
: QUOTED_REGEX
;
/**
* e.g., $a
*/
unquotedMacro
: DOLLAR
name = UNQUOTED
;
/**
* e.g., $"a"
*/
quotedMacro
: DOLLAR
name = QUOTED
;
/**
* e.g., [a TO b}
*/
rangeQuery
: lowerBoundSymbol = ( LBRACE | LBRACKET )
lowerBound = rangeQueryBound
TO?
upperBound = rangeQueryBound
upperBoundSymbol = ( RBRACE | RBRACKET )
;
rangeQueryBound
: unquotedQuery
| dateOffsetQuery
| quotedQuery
| loneWildcardQuery
;
/**
* <p>e.g., (a)</p>
*
* <p>If a ~N style suffix is present then the thing inside can only be an OR query. TODO: Not enforced yet though</p>
*/
groupQuery
: LPAREN nested = expression RPAREN
( TILDE minimumMatches = UNSIGNED_NUMBER )?
;
///////////////////////////////////////////////////////////////////////////
// Lexer Rules
// Most specific rules go first, otherwise the more general ones will blot them out.
AND : ('A'|'a')('N'|'n')('D'|'d') | '&' '&' ;
OR : ('O'|'o')('R'|'r') | '|' '|' ;
NOT : ('N'|'n')('O'|'o')('T'|'t') | '!' ;
TO : ('T'|'t')('O'|'o') ;
UNSIGNED_NUMBER : Digit+ ('.' Digit+)?
| '.' Digit+
;
SIGNED_NUMBER : ( '+' | '-' ) UNSIGNED_NUMBER ;
DATE_OFFSET : ( '+' | '-' ) UNSIGNED_NUMBER ('D'|'d'|'W'|'w'|'M'|'m'|'Y'|'y') ;
GEODISTANCE
: ('G'|'g')('E'|'e')('O'|'o')('D'|'d')('I'|'i')('S'|'s')('T'|'t')('A'|'a')('N'|'n')('C'|'c')('E'|'e')
;
PRE_SLASH_N
: ('P'|'p')('R'|'r')('E'|'e') '/' UNSIGNED_NUMBER
;
W_SLASH_N
: ('W'|'w') '/' UNSIGNED_NUMBER
;
TODAY
: ('T'|'t')('O'|'o')('D'|'d')('A'|'a')('Y'|'y')
;
UNQUOTED
: UnquotedStartChar
UnquotedChar*
;
LONE_WILDCARD
: '*'
;
UNQUOTED_WILDCARD
: ( UnquotedStartChar
UnquotedChar*
)?
( WildcardChar
UnquotedChar*
)+
;
fragment
UnquotedStartChar
: EscapeSequence
| ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\' | ':'
| '"' | '\u201C' | '\u201D' // DoubleQuote
| '\'' | '\u2018' | '\u2019' // SingleQuote
| '(' | ')' | '[' | ']' | '{' | '}' | '~'
| '&' | '|' | '!' | '^' | '?' | '*' | '/' | '+' | '-' | '$' )
;
fragment
UnquotedChar
: EscapeSequence
| ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\' | ':'
| '"' | '\u201C' | '\u201D' // DoubleQuote
| '\'' | '\u2018' | '\u2019' // SingleQuote
| '(' | ')' | '[' | ']' | '{' | '}' | '~'
| '&' | '|' | '!' | '^' | '?' | '*' )
;
QUOTED
: DoubleQuote
QuotedChar*
DoubleQuote
;
QUOTED_WILDCARD
: DoubleQuote
QuotedChar*
( WildcardChar
QuotedChar*
)+
DoubleQuote
;
fragment
QuotedChar
: EscapeSequence
| ~( '\\'
| '"' | '\u201C' | '\u201D' // DoubleQuote
| '\r' | '\n' | '?' | '*' )
;
fragment
WildcardChar
: ( '?' | '*' )
;
QUOTED_EXACT
: SingleQuote
( EscapeSequence
| ~( '\\' | '\'' | '\r' | '\n' )
)*
SingleQuote
;
QUOTED_REGEX
: '/'
( EscapeSequence
| ~( '\\' | '/' | '\r' | '\n' )
)*
'/'
;
fragment
EscapeSequence
: '\\'
( 'u' HexDigit HexDigit HexDigit HexDigit
| ~( 'u' )
)
;
fragment
Digit
: ('0'..'9')
;
fragment
HexDigit
: ('0'..'9' | 'a'..'f' | 'A'..'F')
;
// Single character fragments (not tokens, but become part of tokens)
// U+2018 LEFT SINGLE QUOTATION MARK
// U+2019 RIGHT SINGLE QUOTATION MARK
fragment SingleQuote : '\'' | '\u2018' | '\u2019';
// U+201C LEFT DOUBLE QUOTATION MARK
// U+201D RIGHT DOUBLE QUOTATION MARK
fragment DoubleQuote : '"' | '\u201C' | '\u201D';
COLON : ':' ;
PLUS : '+' ;
MINUS : '-' ;
TILDE : '~' ;
CARET : '^' ;
DOLLAR : '$' ;
LPAREN : '(' ;
RPAREN : ')' ;
LBRACKET : '[' ;
RBRACKET : ']' ;
LBRACE : '{' ;
RBRACE : '}' ;
WHITESPACE : ( ' ' | '\r' | '\t' | '\u000C' | '\n' ) -> skip;
I am parsing a SQL like language and I have problems with strings that starts with a number:
SELECT 90userN is parsed to SELECT 90 AS userN
Since I remove the whitespaces, it somehow gets the digits as the name and the string as the alias.
I don't know even where to start.
Grammar:
result_column : '*'
| table_name '.' '*'
| table_name '.' any_name
| expr
any_name : keyword
| IDENTIFIER
| STRING_LITERAL
| '(' any_name ')'
;
expr: literal_value;
literal_value :
NUMERIC_LITERAL
| STRING_LITERAL
| DATE_LITERAL
| IDENTIFIER
| NULL
;
IDENTIFIER :
'"' (~'"' | '""')* '"'
| '`' (~'`' | '``')* '`'
| '[' ~']'* ']'
| [a-zA-Z_] [a-zA-Z_0-9]*;
STRING_LITERAL : '\'' ( ~'\'' | '\'\'' )* '\'' ;
NUMERIC_LITERAL :
DIGIT+ ( '.' DIGIT* )? ( E [-+]? DIGIT+ )?
| '.' DIGIT+ ( E [-+]? DIGIT+ )? ;
DATE_LITERAL: DIGIT DIGIT DIGIT DIGIT '-' DIGIT DIGIT '-' DIGIT DIGIT;
Identifiers in SQL can not start with numbers and that is really clear in the last alternative of your IDENTIFIER rule: [a-zA-Z_] [a-zA-Z_0-9]*;
I think you are already using it, but refer to the SQLite4 grammar example
Using ANTLR3, I want to parse Strings such:
name IS NOT empty AND age NOT IN (14, 15)
And for these cases, I want to get the following ASTs:
n0 [label="QUERY"];
n1 [label="AND"];
n1 [label="AND"];
n2 [label="IS NOT"];
n2 [label="IS NOT"];
n3 [label="name"];
n4 [label="empty"];
n5 [label="NOT IN"];
n5 [label="NOT IN"];
n6 [label="age"];
n7 [label="14"];
n8 [label="15"];
n0 -> n1 // "QUERY" -> "AND"
n1 -> n2 // "AND" -> "IS NOT"
n2 -> n3 // "IS NOT" -> "name"
n2 -> n4 // "IS NOT" -> "empty"
n1 -> n5 // "AND" -> "NOT IN"
n5 -> n6 // "NOT IN" -> "age"
n5 -> n7 // "NOT IN" -> "14"
n5 -> n8 // "NOT IN" -> "15"
But my n2 and n5 nodes are appearing like :
n2 [label="IS"];
n5 [label="NOT"];
Ie, just the first word is appearing. How can I join both tokens in just one?
My Grammar is:
query
: expr EOF -> ^(QUERY expr)
;
expr
: logical_expr
;
logical_expr
: equality_expr (logical_op^ equality_expr)*
;
equality_expr
: ID equality_op+ atom -> ^(equality_op ID atom)
| '(' expr ')' -> ^('(' expr)
;
atom
: ID
| id_list
| Int
| Number
| String
| '*'
;
id_list
: '(' ID (',' ID)+ ')' -> ID+
| '(' Number (',' Number)* ')' -> Number+
| '(' String (',' String)* ')' -> String+
;
equality_op
: 'IN'
| 'IS'
| 'NOT'
| 'in'
| 'is'
| 'not'
;
logical_op
: 'AND'
| 'OR'
| 'and'
| 'or'
;
Number
: Int ('.' Digit*)?
;
ID
: ('a'..'z' | 'A'..'Z' | '_' | '.' | '-' | '*' | '/' | ':' | Digit)*
;
String
#after {
setText(getText().substring(1, getText().length()-1).replaceAll("\\\\(.)", "$1"));
}
: '"' (~('"' | '\\') | '\\' ('\\' | '"'))* '"'
| '\'' (~('\'' | '\\') | '\\' ('\\' | '\''))* '\''
;
Comment
: '//' ~('\r' | '\n')* {skip();}
| '/*' .* '*/' {skip();}
;
Space
: (' ' | '\t' | '\r' | '\n' | '\u000C') {skip();}
;
fragment Int
: '1'..'9' Digit*
| '0'
;
fragment Digit
: '0'..'9'
;
indexes
: ('[' expr ']')+ -> ^(INDEXES expr+)
;
Do something like this instead (check the inline comments I added):
tokens {
IS_NOT; // added
NOT_IN; // added
QUERY;
INDEXES;
}
query
: expr EOF -> ^(QUERY expr)
;
expr
: logical_expr
;
logical_expr
: equality_expr (logical_op^ equality_expr)*
;
equality_expr
: ID equality_op atom -> ^(equality_op ID atom) // changed equality_op+ to equality_op
| '(' expr ')' -> ^('(' expr)
;
atom
: ID
| id_list
| Int
| Number
| String
| '*'
;
id_list
: '(' ID (',' ID)+ ')' -> ID+
| '(' Number (',' Number)* ')' -> Number+
| '(' String (',' String)* ')' -> String+
;
equality_op
: IS NOT -> IS_NOT // added
| NOT IN -> NOT_IN // added
| IN
| IS
| NOT
;
logical_op
: AND
| OR
;
IS : 'IS' | 'is'; // added
NOT : 'NOT' | 'not'; // added
IN : 'IN' | 'in'; // added
AND : 'AND' | 'and'; // added
OR : 'OR' | 'or'; // added
Number
: Int ('.' Digit*)?
;
ID
: ('a'..'z' | 'A'..'Z' | '_' | '.' | '-' | '*' | '/' | ':' | Digit)+
;
String
#after {
setText(getText().substring(1, getText().length()-1).replaceAll("\\\\(.)", "$1"));
}
: '"' (~('"' | '\\') | '\\' ('\\' | '"'))* '"'
| '\'' (~('\'' | '\\') | '\\' ('\\' | '\''))* '\''
;
Comment
: '//' ~('\r' | '\n')* {skip();}
| '/*' .* '*/' {skip();}
;
Space
: (' ' | '\t' | '\r' | '\n' | '\u000C') {skip();}
;
fragment Int
: '1'..'9' Digit*
| '0'
;
fragment Digit
: '0'..'9'
;
indexes
: ('[' expr ']')+ -> ^(INDEXES expr+)
;
which produces the following AST:
Also, lexer rules should always match at least 1 character (I've mentioned this before to you). Your lexer rule ID possibly matched 0 chars.
The problem is that equalityop+ will only have the value of the first match.
I see different workarounds: create specific rules if it is just for not or not in, create a subrules, or concatenate a variable like i do here:
equality_expr
: ID (full_op+=equality_op) + atom -> ^(full_op ID atom)
| '(' expr ')' -> ^('(' expr)
;
The following problem is different but gives my the idea:
Antlr AST generating (possible) madness
I have the following grammar:
rule: q=QualifiedName {System.out.println($q.text);};
QualifiedName
:
i=Identifier { $i.setText($i.text + "_");}
('[' (QualifiedName+ | Integer)? ']')*
;
Integer
: Digit Digit*
;
fragment
Digit
: '0'..'9'
;
fragment
Identifier
: ( '_'
| '$'
| ('a'..'z' | 'A'..'Z')
)
('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '$')*
;
and the code from Java:
ANTLRStringStream stream = new ANTLRStringStream("array1[array2[array3[index]]]");
TestLexer lexer = new TestLexer(stream);
CommonTokenStream tokens = new TokenRewriteStream(lexer);
TestParser parser = new TestParser(tokens);
try {
parser.rule();
} catch (RecognitionException e) {
e.printStackTrace();
}
For the input: array1[array2[array3[index]]], i want to modify each identifier. I was expecting to see the output: array1_[array_2[array3_[index_]]], but the output was the same as the input.
So the question is: why the setText() method doesn't work here?
EDIT:
I modified Bart's answer in the following way:
rule: q=qualifiedName {System.out.println($q.modified);};
qualifiedName returns [String modified]
:
Identifier
('[' (qualifiedName+ | Integer)? ']')*
{
$modified = $text + "_";
}
;
Identifier
: ( '_'
| '$'
| ('a'..'z' | 'A'..'Z')
)
('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '$')*
;
Integer
: Digit Digit*
;
fragment
Digit
: '0'..'9'
;
I want to modify each token matched by the rule qualifiedName. I tried the code above, and for the input array1[array2[array3[index]]] i was expecting to see the output array1[array2[array3[index_]_]_]_, but instead only the last token was modified: array1[array2[array3[index]]]_.
How can i solve this?
You can only use setText(...) once a token is created. You're recursively calling this token and setting some other text, which won't work. You'll need to create a parser rule out of QualifiedName instead of a lexer rule, and remove the fragment before Identifier.
rule: q=qualifiedName {System.out.println($q.text);};
qualifiedName
:
i=Identifier { $i.setText($i.text + "_");}
('[' (qualifiedName+ | Integer)? ']')*
;
Identifier
: ( '_'
| '$'
| ('a'..'z' | 'A'..'Z')
)
('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '$')*
;
Integer
: Digit Digit*
;
fragment
Digit
: '0'..'9'
;
Now, it will print: array1_[array2_[array3_[index_]]] on the console.
EDIT
I have no idea why you'd want to do that, but it seems you're simply trying to rewrite ] into ]_, which can be done in the same way as I showed above:
qualifiedName
:
Identifier
('[' (qualifiedName+ | Integer)? t=']' {$t.setText("]_");} )*
;