I'm trying to change a grammar in the JSqlParser project, which deals with a javacc grammar file .jj specifying the standard SQL syntax. I had difficulty getting one section to work, I narrowed it down to the following , much simplified grammar.
basically I have a def of Column : [table ] . field
but table itself could also contain the "." char, which causes confusion.
I think intuitively the following grammar should accept all the following sentences:
select mytable.myfield
select myfield
select mydb.mytable.myfield
but in practice it only accepts the 2nd and 3rd above. whenever it sees the ".", it progresses to demanding the 2-dot version of table (i.e. the first derivation rule for table)
how can I make this grammar work?
Thanks a lot
Yang
options{
IGNORE_CASE=true ;
STATIC=false;
DEBUG_PARSER=true;
DEBUG_LOOKAHEAD=true;
DEBUG_TOKEN_MANAGER=false;
// FORCE_LA_CHECK=true;
UNICODE_INPUT=true;
}
PARSER_BEGIN(TT)
import java.util.*;
public class TT {
}
PARSER_END(TT)
///////////////////////////////////////////// main stuff concerned
void Statement() :
{ }
{
<K_SELECT> Column()
}
void Column():
{
}
{
[LOOKAHEAD(3) Table() "." ]
//[
//LOOKAHEAD(2) (
// LOOKAHEAD(5) <S_IDENTIFIER> "." <S_IDENTIFIER>
// |
// LOOKAHEAD(3) <S_IDENTIFIER>
//)
//
//
//
//]
Field()
}
void Field():
{}{
<S_IDENTIFIER>
}
void Table():
{}{
LOOKAHEAD(5) <S_IDENTIFIER> "." <S_IDENTIFIER>
|
LOOKAHEAD(3) <S_IDENTIFIER>
}
////////////////////////////////////////////////////////
SKIP:
{
" "
| "\t"
| "\r"
| "\n"
}
TOKEN: /* SQL Keywords. prefixed with K_ to avoid name clashes */
{
<K_CREATE: "CREATE">
|
<K_SELECT: "SELECT">
}
TOKEN : /* Numeric Constants */
{
< S_DOUBLE: ((<S_LONG>)? "." <S_LONG> ( ["e","E"] (["+", "-"])? <S_LONG>)?
|
<S_LONG> "." (["e","E"] (["+", "-"])? <S_LONG>)?
|
<S_LONG> ["e","E"] (["+", "-"])? <S_LONG>
)>
| < S_LONG: ( <DIGIT> )+ >
| < #DIGIT: ["0" - "9"] >
}
TOKEN:
{
< S_IDENTIFIER: ( <LETTER> | <ADDITIONAL_LETTERS> )+ ( <DIGIT> | <LETTER> | <ADDITIONAL_LETTERS> | <SPECIAL_CHARS>)* >
| < #LETTER: ["a"-"z", "A"-"Z", "_", "$"] >
| < #SPECIAL_CHARS: "$" | "_" | "#" | "#">
| < S_CHAR_LITERAL: "'" (~["'"])* "'" ("'" (~["'"])* "'")*>
| < S_QUOTED_IDENTIFIER: "\"" (~["\n","\r","\""])+ "\"" | ("`" (~["\n","\r","`"])+ "`") | ( "[" ~["0"-"9","]"] (~["\n","\r","]"])* "]" ) >
/*
To deal with database names (columns, tables) using not only latin base characters, one
can expand the following rule to accept additional letters. Here is the addition of german umlauts.
There seems to be no way to recognize letters by an external function to allow
a configurable addition. One must rebuild JSqlParser with this new "Letterset".
*/
| < #ADDITIONAL_LETTERS: ["ä","ö","ü","Ä","Ö","Ü","ß"] >
}
You could rewrite your grammar like this
Statement --> "select" Column
Column --> Prefix <ID>
Prefix --> (<ID> ".")*
Now the only choice is whether to iterate or not. Assuming a "." can't follow a Column, this is easily done with a lookahead of 2:
Statement --> "select" Column
Column --> Prefix <ID>
Prefix --> (LOOKAHEAD( <ID> ".") <ID> ".")*
indeed the following grammar in flex+bison (LR parser) works fine , recognizing all the following sentences correctly:
create mydb.mytable
create mytable
select mydb.mytable.myfield
select mytable.myfield
select myfield
so it is indeed due to limitation of LL parser
%%
statement:
create_sentence
|
select_sentence
;
create_sentence: CREATE table
;
select_sentence: SELECT table '.' ID
|
SELECT ID
;
table : table '.' ID
|
ID
;
%%
If you need Table to be its own nonterminal, you can do this by using a boolean parameter that says whether the table is expected to be followed by a dot.
void Statement():{}{
"select" Column() | "create" "table" Table(false) }
void Column():{}{
[LOOKAHEAD(<ID> ".") Table(true) "."] <ID> }
void Table(boolean expectDot):{}{
<ID> MoreTable(expectDot) }
void MoreTable(boolean expectDot) {
LOOKAHEAD("." <ID> ".", {expectDot}) "." <ID> MoreTable(expectDot)
|
LOOKAHEAD(".", {!expectDot}) "." <ID> MoreTable(expectDot)
|
{}
}
Doing it this way precludes using Table in any syntactic lookahead specifications either directly or indirectly. E.g. you shouldn't have LOOKAHEAD( Table()) anywhere in your grammar, because semantic lookahead is not used during syntactic lookahead. See the FAQ for more information on that.
Your examples are parsed perfectly well using JSqlParser V0.9.x (https://github.com/JSQLParser/JSqlParser)
CCJSqlParserUtil.parse("SELECT mycolumn");
CCJSqlParserUtil.parse("SELECT mytable.mycolumn");
CCJSqlParserUtil.parse("SELECT mydatabase.mytable.mycolumn");
Related
I'm using ANTLR 4 to parse a protocol's messages, let's name it 'X'. Before extracting a message's information , I have to check if it complies with X's rules.
Suppose we have to parse X's 'FOO' message that follows the following rules:
Message starts with the 'messageIdentifier' that consists of the 3-letter reserved word FOO.
Message contains 5 fields, of which the first 2 are mandatory (must be included) and the rest 3 are optional (can be not included).
Message's fields are separated by the character '/'. If there is no information in a field (that means that the field is optional and is omitted) the '/' character must be preserved. Optional fields and their associated filed separators '/' at the end of the message may be omitted where no further information within the message is reported.
A message can expand in multiple lines. Each line must have at least one non-empty field (mandatory or optional). Moreover, each line must start with a '/' character and end with a non-empty field following a '\n' character. Exception is the first line that always starts with the reserved word FOO.
Each message's field also has its own rules regarding the accepted tokens, which will be shown in the grammar below.
Sample examples of valid FOO messages:
FOO/MANDATORY_1/MANDATORY2/OPT 1/HELLO/100\n
FOO/MANDATORY_1/MANDATORY2\n
/OPT 1\n
/HELLO\n
/100\n
FOO/MANDATORY_1/MANDATORY2\n
FOO/MANDATORY_1/MANDATORY2//HELLO/100\n
FOO/MANDATORY_1/MANDATORY2///100\n
FOO/MANDATORY_1/MANDATORY2/OPT 1\n
FOO/MANDATORY_1/MANDATORY2
///100\n
Sample examples of non-valid FOO messages:
FOO\n
/MANDATORY_1/MANDATORY2/OPT 1/HELLO/100\n
FOO/MANDATORY_1/\n
MANDATORY2/OPT 1/HELLO/100\n
FOO/MANDATORY_1/MANDATORY2/OPT 1//\n
FOO/MANDATORY_1/MANDATORY2/OPT 1/\n
/100\n
FOO/MANDATORY_1/MANDATORY2/OPT 1/HELLO/100\n
FOO/MANDATORY_1/MANDATORY2/\n
FOO/MANDATORY_1/MANDATORY2/OPT 1/HELLO/100
Below follows the grammar for the above message:
grammar Foo_Message
/* Parser Rules */
startRule : 'FOO' mandatoryField_1 ;
mandatoryField_1 : '/' field_1 NL? mandatoryField_2 ;
mandatoryField_2 : '/' field_2 NL? optionalField_3 ;
optionalField_3 : '/' field_3 NL? optionalField_4
| '/' optionalField_4
| optionalField_4
;
optionalField_4 : '/' field_4 NL? optionalField_5
| '/' optionalField_5
| optionalField_5
;
optionalField_5 : '/' field_5 NL?
| NL
;
field_1 : (A | N | B | S)+ ;
field_2 : (A | N)+ ;
field_3 : (A | N | B)+ ;
field_4 : A+ ;
field_5 : N+ ;
/* Lexer Rules */
A : [A-Z]+ ;
N : [0-9]+ ;
B : ' ' -> skip ;
S : [*&##-_<>?!]+ ;
NL : '\r'? '\n' ;
The above grammar parses correctly any input that complies with FOO message's rules.
The problem resides in parsing a line that ends with the '/' character, which according to the protocol's FOO message's rules is an invalid input.
I understand that the second alternatives of rules 'optionalField_3', 'optionalField_4' and 'optionalField_5' lead to this behavior but I can't figure out how to make a rule for this.
Somehow I need the parser to remember that he came to 'optionalField_5' rule after seeing a non-omitted field in the previous rule, which if I am not mistaken can't be done in ANTLR as I can't check from which alternative of the previous rule I reached the current rule.
Is there a way to make the parser 'remember' this by some explicit option-rule? Or does my grammar need to be rearranged and if yes how?
This grammar accepts all examples, character for character copied/pasted from your post, and flags a parse error all "non-valid FOO messages".
grammar X;
file_ : s* EOF ;
s : FOO '/' f1 '/' f2 (
| NL? '/' f3
| NL? ('/' f3 NL? | '/' ) '/' f4
| NL? ('/' f3 NL? | '/' ) ('/' f4 NL? | '/') '/' f5
) NL;
f1 : (A | N | B | S)+ ;
f2 : (A | N | B)+ ;
f3 : (A | N | B)+ ;
f4 : A+ ;
f5 : N+ ;
FOO: 'FOO';
A : [A-Z]+ ;
N : [0-9]+ ;
B : ' ';
S : [*&##\-_<>?!]+ ;
NL : '\r'? '\n' ;
One can easily refactor this with folds and groupings.
In your previous grammar, lexer symbol B was marked as "skip". Skipped symbols do not appear on any token stream, and they should not be used directly on the right-hand side of a parser rule (see field_1 from your original grammar). It is innocuous because it is alted with other symbols, i.e. field_3:(A|N|B)+; will operate the same as field_3:(A|N)+;, but the rule field_3:(A|N|B)+; may be misleading to others because B will never appear in the parse tree. I felt that you wanted to include spaces in the fields, because perhaps you would want to compute the text for a field. Therefore, I changed the rule for B to appear as a token.
#5 from "non-valid FOO messages" is exactly the same character for character of #1 from "valid FOO messages", which you can see here:
#1: FOO/MANDATORY_1/MANDATORY2/OPT 1/HELLO/100\n
#5: FOO/MANDATORY_1/MANDATORY2/OPT 1/HELLO/100\n
I don't understand your comment "this allows the optional fields of the FOO message to come in any order". The grammar here and the previous grammar I mentioned in the comments force field3 to occur before field4, which occurs before field5. There is no way that field5 could occur before a field3: the requisite number of '/' must appear before field5. Fields can be empty (see #4 of "valid FOO messages"). To handle that, the field specified is a grouping, e.g., ('/' f3 NL? | '/' ). For this grouping, the only sentential forms are "/", "/f3", "/f3\n". Note, this grouping can only occur with a succeeding field, so it is impossible for two "\n" to be next to each other.
The other way to approach this is to use semantic predicates or evaluate the semantic equations after the entire parse.
If there are many more fields, then you will likely not want to add alts for f6, f7, ...., f10000. In that case, I would suggest that you allow an arbitrary type for each field in the parse:
s : FOO '/' f1 '/' f2 (
| NL? ('/' f NL? | '/' )* '/' f
) NL;
and validate the semantics afterwards.
Solution was to refactor my grammar to include rules for filledField and emptyField.
kaby76's post is marked as an answer as it helped towards the solution.
The refactored grammar:
grammar Foo_Message
/* Parser Rules */
startRule : 'FOO' mandatoryField_1 endRule ;
mandatoryField_1 : '/' field_1 NL? mandatoryField_2 ;
mandatoryField_2 : '/' field_2 NL? (filledOptionalField_3 | emptyOptionalField_3 )? ;
filledOptionalField_3 : '/' field_3 NL? (filledOptionalField_4 | emptyOptionalField_4)? ;
emptyOptionalField_3 : '/' (filledOptionalField_4 | emptyOptionalField_4) ;
filledOptionalField_4 : '/' field_4 NL? filledOptionalField_5? ;
emptyOptionalField_4 : '/' filledOptionalField_5 ;
filledOptionalField_5 : '/' field_5 ;
endRule : NL;
field_1 : (A | N | B | S)+ ;
field_2 : (A | N)+ ;
field_3 : (A | N | B)+ ;
field_4 : A+ ;
field_5 : N+ ;
/* Lexer Rules */
A : [A-Z]+ ;
N : [0-9]+ ;
B : ' ' -> skip ;
S : [*&##-_<>?!]+ ;
NL : '\r'? '\n' ;
I want to create a recursive descendant parser in java for following grammar (I have managed to create tokens). This is the relevant part of the grammar:
expression ::= numeric_expression | identifier | "null"
identifier ::= "a..z,$,_"
numeric_expression ::= ( ( "-" | "++" | "--" ) expression )
| ( expression ( "++" | "--" ) )
| ( expression ( "+" | "+=" | "-" | "-=" | "*" | "*=" | "/" | "/=" | "%" | "%=" ) expression )
arglist ::= expression { "," expression }
I have written code for parsing numeric_expression (assuming if invalid token, return null):
NumericAST<? extends OpAST> parseNumericExpr() {
OpAST op;
if (token.getCodes() == Lexer.CODES.UNARY_OP) { //Check for unary operator like "++" or "--" etc
op = new UnaryOpAST(token.getValue());
token = getNextToken();
AST expr = parseExpr(); // Method that returns expression node.
if (expr == null) {
op = null;
return null;
} else {
if (checkSemi()) {
System.out.println("UNARY AST CREATED");
return new NumericAST<OpAST>(expr, op, false);
}
else {
return null;
}
}
} else { // Binary operation like "a+b", where a,b ->expression
AST expr = parseExpr();
if (expr == null) {
return null;
} else {
token = getNextToken();
if (token.getCodes() == Lexer.CODES.UNARY_OP) {
op = new UnaryOpAST(token.getValue());
return new NumericAST<OpAST>(expr, op, true);
} else if (token.getCodes() == Lexer.CODES.BIN_OP) {
op = new BinaryOpAST(token.getValue());
token = getNextToken();
AST expr2 = parseExpr();
if (expr2 == null) {
op = null;
expr = null;
return null;
} else {
if (checkSemi()) {
System.out.println("BINARY AST CREATED");
return new NumericAST<OpAST>(expr, op, expr2);
}
else {
return null;
}
}
} else {
expr = null;
return null;
}
}
}
}
Now, if i get a unary operator like ++ i can directly call this method, but I dont know to recognize other grammar, starting with same productions, like arglist and numeric_expression having "expression" as start production.
My question is:
How to recognize whether to call parseNumericExpr() or parseArgList() (method not mentioned above) if i get an expression token?
In order to write a recursive descent parser, you need an appropriate top-down grammar, normally an LL(1) grammar, although it's common to write the grammar using EBNF operators, as shown in the example grammar on Wikipedia's page on recursive descent grammars.
Unfortunately, your grammar is not LL(1), and the question you raise is a consequence of that fact. An LL(1) grammar has the property that the parser can always determine which production to use by examining only the next input token, which puts some severe constraints on the grammar, including:
No two productions for the same non-terminal can start with the same symbol.
No production can be left-recursive (i.e. the first symbol on the right-hand side is the defining non-terminal).
Here's a small rearrangement of your grammar which will work:
-- I added number here in order to be explicit.
atom ::= identifier | number | "null" | "(" expression ")"
-- I added function calls here, but it's arguable that this syntax accepts
-- a lot of invalid expressions
primary ::= atom { "++" | "--" | "(" [ arglist ] ")" }
factor ::= [ "-" | "++" | "--" ] primary
term ::= factor { ( "*" | "/" | "%" ) factor }
value ::= term { ( "+" | "-" ) term }
-- This adds the ordinary "=" assignment to the list in case it was
-- omitted by accident. Also, see the note below.
expression ::= { value ( "=" | "+#" | "-=" | "*=" | "/=" | "%=" ) } value
arglist ::= expression { "," expression }
The last expression rule is an attempt to capture the usual syntax of assignment operators (which associate to the right, not to the left), but it suffers from a classic problem address by this highly related question. I don't think I have a better answer to this issue than the one I wrote three years ago, so I hope it is still useful.
I'm trying to write a code translator in Java with the help of Antlr4 and had great success with the grammar part so far. However I'm now banging my head against a wall wrapping my mind around the parse tree data structure that I need to work on after my input has been parsed.
I'm trying to use the visitor template to go over my parse tree. I'll show you an example to illustrate the points of my confusion.
My grammar:
grammar pqlc;
// Lexer
//Schlüsselwörter
EXISTS: 'exists';
REDUCE: 'reduce';
QUERY: 'query';
INT: 'int';
DOUBLE: 'double';
CONST: 'const';
STDVECTOR: 'std::vector';
STDMAP: 'std::map';
STDSET: 'std::set';
C_EXPR: 'c_expr';
INTEGER_LITERAL : (DIGIT)+ ;
fragment DIGIT: '0'..'9';
DOUBLE_LITERAL : DIGIT '.' DIGIT+;
LPAREN : '(';
RPAREN : ')';
LBRACK : '[';
RBRACK : ']';
DOT : '.';
EQUAL : '==';
LE : '<=';
GE : '>=';
GT : '>';
LT : '<';
ADD : '+';
MUL : '*';
AND : '&&';
COLON : ':';
IDENTIFIER : JavaLetter JavaLetterOrDigit*;
fragment JavaLetter : [a-zA-Z$_]; // these are the "java letters" below 0xFF
fragment JavaLetterOrDigit : [a-zA-Z0-9$_]; // these are the "java letters or digits" below 0xFF
WS
: [ \t\r\n\u000C]+ -> skip
;
COMMENT
: '/*' .*? '*/' -> skip
;
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
// Parser
//start_rule: query;
query :
quant_expr
| qexpr+
| IDENTIFIER // order IDENTIFIER and qexpr+?
| numeral
| c_expr //TODO
;
c_type : INT | DOUBLE | CONST;
bin_op: AND | ADD | MUL | EQUAL | LT | GT | LE| GE;
qexpr:
LPAREN query RPAREN bin_op_query?
// query bin_op query
| IDENTIFIER bin_op_query? // copied from query to resolve left recursion problem
| numeral bin_op_query? // ^
| quant_expr bin_op_query? // ^
|c_expr bin_op_query?
// query.find(query)
| IDENTIFIER find_query? // copied from query to resolve left recursion problem
| numeral find_query? // ^
| quant_expr find_query?
|c_expr find_query?
// query[query]
| IDENTIFIER array_query? // copied from query to resolve left recursion problem
| numeral array_query? // ^
| quant_expr array_query?
|c_expr array_query?
// | qexpr bin_op_query // bad, resolved by quexpr+ in query
;
bin_op_query: bin_op query bin_op_query?; // resolve left recursion of query bin_op query
find_query: '.''find' LPAREN query RPAREN;
array_query: LBRACK query RBRACK;
quant_expr:
quant id ':' query
| QUERY LPAREN match RPAREN ':' query
| REDUCE LPAREN IDENTIFIER RPAREN id ':' query
;
match:
STDVECTOR LBRACK id RBRACK EQUAL cm
| STDMAP '.''find' LPAREN cm RPAREN EQUAL cm
| STDSET '.''find' LPAREN cm RPAREN
;
cm:
IDENTIFIER
| numeral
| c_expr //TODO
;
quant :
EXISTS;
id :
c_type IDENTIFIER
| IDENTIFIER // Nach Seite 2 aber nicht der Übersicht. Laut übersicht id -> aber dann wäre Regel 1 ohne +
;
numeral :
INTEGER_LITERAL
| DOUBLE_LITERAL
;
c_expr:
C_EXPR
;
Now let's parse the following string:
double x: x >= c_expr
Visually I'll get this tree:
Let's say my visitor is in the visitQexpr(#NotNull pqlcParser.QexprContext ctx) routine when it hits the branch Qexpr(x bin_op_query).
My question is, how can I tell that the left children ("x" in the tree) is a terminal node, or more specifically an "IDENTIFIER"? There are no visiting rules for Terminal nodes since they aren't rules.
ctx.getChild(0) has no RuleIndex. I guess I could use that to check if I'm in a terminal or not, but that still wouldn't tell me if I was in IDENTIFIER or another kind of terminal token. I need to be able to tell the difference somehow.
I had more questions but in the time it took me to write the explanation I forgot them :<
Thanks in advance.
You can add labels to tokens and access them/check if they exist in the surrounding context:
id :
c_type labelA = IDENTIFIER
| labelB = IDENTIFIER
;
You could also do this to create different visits:
id :
c_type IDENTIFIER #idType1 //choose more appropriate names!
| IDENTIFIER #idType2
;
This will create different visitors for the two alternatives and I suppose (i.e. have not verified) that the visitor for id will not be called.
I prefer the following approach though:
id :
typeDef
| otherId
;
typeDef: c_type IDENTIFIER;
otherId : IDENTIFIER ;
This is a more heavily typed system. But you can very specifically visit nodes. Some rules of thumb I use:
Use | only when all alternatives are parser rules.
Wrap each Token in a parser rule (like otherId) to give them "more meaning".
It's ok to mix parser rules and tokens, if the tokens are not really important (like ;) and therefore not needed in the parse tree.
Here is a short javaCC code:
PARSER_BEGIN(TestParser)
public class TestParser
{
}
PARSER_END(TestParser)
SKIP :
{
" "
| "\t"
| "\n"
| "\r"
}
TOKEN : /* LITERALS */
{
<VOID: "void">
| <LPAR: "("> | <RPAR: ")">
| <LBRAC: "{"> | <RBRAC: "}">
| <COMMA: ",">
| <DATATYPE: "int">
| <#LETTER: ["_","a"-"z","A"-"Z"] >
| <#DIGIT: ["0"-"9"] >
| <DOUBLE_QUOTE_LITERAL: "\"" (~["\""])*"\"" >
| <IDENTIFIER: <LETTER> (<LETTER>|<DIGIT>)* >
| <VARIABLE: "$"<IDENTIFIER> >
}
public void input():{} { (statement())+ <EOF> }
private void statement():{}
{
<VOID> <IDENTIFIER> <LPAR> (<DATATYPE> <IDENTIFIER> (<COMMA> <DATATYPE> <IDENTIFIER>)*)? <RPAR>
<LBRAC>
<RBRAC>
}
I'd like this parser to handle the following kind of input with a "grammar-free" section (character '}' would be the end of the section ):
void fun(int i, int j)
{
Hello world the value of i is ${i}
and j=${j}.
}
the grammar-free section would return a
java.util.List<String_or_VariableReference>
How should I modify my javacc parser to handle this section ?
Thanks.
If I understand the question correctly, you want to allow essentially arbitrary input for a while and then switch back to your language. If you can decide when to make the switch based purely on tokens, then this is easy to do using two lexical states. Use the default state for your programming language. When a "{" is seen in the DEFAULT state, switch to the other state
TOKEN: { <LBRACE : "{" > : FREE }
In the FREE state, when a "}" is seen, switch back to the DEFAULT state; when any other character is seen, pass it on to the parser.
<FREE> TOKEN { <RBRACE : "}" > : DEFAULT }
<FREE> TOKEN { <OTHER : ~["}"] > : FREE }
In the parser you can have
void freeSection() : {} { <LBRACE> (<OTHER>)* <RBRACE> }
If you want to do something with all those OTHER characters, see question 5.2 in the FAQ. http://www.engr.mun.ca/~theo/JavaCC-FAQ
If you want to capture variable references such as "${i}" in the FREE state, you can to that too. Add
<FREE> TOKEN { <VARREF : "${" (["a"-"Z"]|["A"-"Z"])* "}" > }
I am new to Antlr and I have defined a basic grammar using Antlr 3. The grammar compiles and ANTLRWorks generates the Parser and Lexer code without any problems.
The grammar can be seen below:
grammar i;
#header {
package i;
}
module : 'Module1'| 'Module2';
object : 'I';
objectType : 'Name';
filters : EMPTY | 'WHERE' module;
table : module object objectType;
STRING : ('a'..'z'|'A'..'Z')+;
EMPTY : ' ';
The problem is that when I interpret the table Parser I get a MismatchedSetException. This is due to having the EMPTY. As soon as I remove EMPTY from the grammar, the interpretation works. I have looked on the Antlr website and some other examples and the Empty space is ' '. I am not sure what to do. I need this EMPTY.
When it interprets, I get the following Exception:
Interpreting...
[11:02:14] problem matching token at 1:4 NoViableAltException(' '#[1:1: Tokens : ( T__4 | T__5 | T__6 | T__7 | T__8 | T__9 | T__10 | T__11 | T__12 | T__13 | T__14 | T__15 );])
[11:02:14] problem matching token at 1:9 NoViableAltException(' '#[1:1: Tokens : ( T__4 | T__5 | T__6 | T__7 | T__8 | T__9 | T__10 | T__11 | T__12 | T__13 | T__14 | T__15 );])
As soon as I change the EMPTY to be the following:
EMPTY : '';
instead of:
EMPTY : ' ';
It actually interprets it. However, I am getting the following Exception:
Interpreting...
[10:57:23] problem matching token at 1:4 NoViableAltException(' '#[1:1: Tokens : ( T__4 | T__5 | T__6 | T__7 | T__8 | T__9 | T__10 | T__11 | T__12 | T__13 | T__14 | T__15 | T__16 );])
[10:57:23] problem matching token at 1:9 NoViableAltException(' '#[1:1: Tokens : ( T__4 | T__5 | T__6 | T__7 | T__8 | T__9 | T__10 | T__11 | T__12 | T__13 | T__14 | T__15 | T__16 );])
However, ANLTWorks still generates the Lexer and Parser code.
I hope you can help.
EDIT:
grammar i;
#header {
package i;
}
select : 'SELECT *' 'FROM' table filters';';
filters : EMPTY | 'WHERE' conditions;
conditions : STRING operator value;
operator : '=' | '!=';
true : 'true';
value : true;
STRING : ('a'..'z'|'A'..'Z')+;
EMPTY : ' ';
I'm still a bit unsure about usage, but I think we're talking about the same thing when we say "empty input". Here's an answer to get the ball rolling, starting with a modified grammar.
grammar i;
#header {
package i;
}
module : 'Module1'| 'Module2';
object : 'I';
objectType : 'Name';
filters : | 'WHERE' module;
table : module object objectType filters;
STRING : ('a'..'z'|'A'..'Z')+;
WS : (' '|'\t'|'\f'|'\n'|'\r')+ {skip();}; //ignore whitespace
Note that I tacked filters onto the end of the table rule to explain what I'm talking about.
This grammar accepts the following input (starting with rule table) as it did before:
Module1 I Name
It works because filters matches even though nothing follows the text Name: it matches on empty input using the first alternative.
The grammar also accepts this:
Module1 I Name WHERE Module2
The filters rule is satisfied with the text WHERE Module2 matching the second alternative (defined as 'WHERE' module in the grammar).
A cleaner approach would be to change filters and table to the following rules (recognizing, of course, that I changed table in the first place).
filters : 'WHERE' module; //no more '|'
table : module object objectType filters?; //added '?'
The grammar matches the same input as before, but the terms a little clearer: instead of saying "filters is required in table and filters matches on empty", we now say "filters are optional in table and filters doesn't match on empty".
It amounts to the same thing in this case. Matching on empty (foo: | etc;) is perfectly valid, but I've run into more problems using it than I have with matching optional (foo?) rules.
Update following your update.
I'm taking a step back here to get us out of the theoretical and into the practical. Here is an updated grammar, Java test code that invokes it, test input, and test output. Please give it a run.
Grammar
Altered for testing but follows the same idea as before.
grammar i;
#header {
package i;
}
selects : ( //test rule to allow processing multiple select calls. Don't worry about the details.
{System.out.println(">>select");}
select
{System.out.println("<<select");}
)+
;
select : 'SELECT *' 'FROM' table filters? ';'
{System.out.println("\tFinished select.");} //test output
;
module : 'Module1'| 'Module2';
object : 'I';
objectType : 'Name';
filters : 'WHERE' conditions
{System.out.println("\tFinished filters.");} //test output
;
table : module object objectType
{System.out.println("\tFinished table.");} //test output
;
conditions : STRING operator value
{System.out.println("\tCondition test on " + $STRING.text);}
;
operator : '=' | '!=';
true_ : 'true'; //changed so that Java code could be generated
value : true_;
STRING : ('a'..'z'|'A'..'Z')+;
WS : (' '|'\t'|'\f'|'\n'|'\r')+ {skip();}; //ignore whitespace
TestiGrammar.java
package i;
import java.io.InputStream;
import org.antlr.runtime.ANTLRInputStream;
import org.antlr.runtime.CharStream;
import org.antlr.runtime.CommonTokenStream;
public class TestiGrammar {
public static void main(String[] args) throws Exception {
InputStream resource = TestiGrammar.class.getResourceAsStream("itest.txt");
CharStream input = new ANTLRInputStream(resource);
resource.close();
iLexer lexer = new iLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
iParser parser = new iParser(tokens);
parser.selects();
}
}
itest.txt Test input file
SELECT * FROM Module2 I Name;
SELECT * FROM Module2 I Name WHERE foobar = true;
SELECT * FROM Module2 I Name WHERE dingdong != true;
test output
>>select
Finished table.
Finished select.
<<select
>>select
Finished table.
Condition test on foobar
Finished filters.
Finished select.
<<select
>>select
Finished table.
Condition test on dingdong
Finished filters.
Finished select.
<<select