Ambiguity in ANTLR grammar

Ambiguity in ANTLR grammar - java

AntlrWorks says that input {'AND','OR'..'XOR'} can be matched by two alternatives. Even with the graphical display, I could not figure out how the match happens!
How on earth the ambiguity occurs in the grammar below, and is there a way to remove it?
grammar testg;
rul : contains_expr ;
contains_expr: 'CONTAINS' contains_expression
//'CONTAINS' contains_or
;
contains_expression : primary (('OR'|'AND'|'XOR') primary)*
;
primary options{backtrack = true;}
: '(' contains_expression ')'
| class_expression
;
class_expression : simple_class_expr
| '(' simple_class_expr contains_expr ')'
|( simple_class_expr contains_expr)
;
simple_class_expr: identifier // RM_TYPE_NAME
| identifier identifier // RM_TYPE_NAME variable
| archetype_class_expr
| versioned_class_expression
| version_class_expression
// | identified_obj_expression // need to be used once VersionedClassExpr is removed
;
identifier
: ID
;
archetype_class_expr
: '.ace'
;
versioned_class_expression
: '.vce'
;
version_class_expression
: '.vnce'
;
temp :
;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;

How would you expect your grammar to parse CONTAINS foo bar baz?
contains_expr matches CONTAINS.
contains_expression "calls" primary.
primary "calls" class_expression.
class_expression "calls" simple_class_expr.
simple_class_expr can match: identifier or identifier identifier.
Thus I can see several possible parsings here; I've put individual simple_class_expr matches into parenthesis:
CONTAINS (foo bar) (baz)
CONTAINS (foo) (bar) (baz)
CONTAINS (foo) (bar baz)
I'm sorry to say that I'm new enough to parsing tools to not have suggestions how to fix this except wondering what identifier identifier might mean.

Related

How to fix the error in left-recursion used with semantic predicates?

I would like to parse two type of expression with boolean :
- the first would be an init expression with boolean like : init : false
- and the last one would be a derive expression with boolean like : derive : !express or (express and (amount >= 100))
My idea is to put semantic predicates in a set of rules,
the goal is when I'm parsing a boolean expression beginning with the word 'init' then it has to go to only one alternative rule proposed who is boolliteral, the last alternative in boolExpression. And if it's an expression beginning with the word 'derive' then it could have access to all alternatives of boolExpression.
I know that I could make two type of boolExpression without semantic predicates like boolExpressionInit and boolExpressionDerive... But I would like to try with my idea if it's could work with a only one boolExpression with semantic predicates.
Here's my grammar
grammar TestExpression;
#header
{
package testexpressionparser;
}
#parser::members {
int vConstraintType;
}
/* SYNTAX RULES */
textInput : initDefinition
| derDefinition ;
initDefinition : t=INIT {vConstraintType = $t.type;} ':' boolExpression ;
derDefinition : t=DERIVE {vConstraintType = $t.type;} ':' boolExpression ;
boolExpression : {vConstraintType != INIT || vConstraintType == DERIVE}? boolExpression (boolOp|relOp) boolExpression
| {vConstraintType != INIT || vConstraintType == DERIVE}? NOT boolExpression
| {vConstraintType != INIT || vConstraintType == DERIVE}? '(' boolExpression ')'
| {vConstraintType != INIT || vConstraintType == DERIVE}? attributeName
| {vConstraintType != INIT || vConstraintType == DERIVE}? numliteral
| {vConstraintType == INIT || vConstraintType == DERIVE}? boolliteral
;
boolOp : OR | AND ;
relOp : EQ | NEQ | GT | LT | GEQT | LEQT ;
attributeName : WORD;
numliteral : intliteral | decliteral;
intliteral : INT ;
decliteral : DEC ;
boolliteral : BOOLEAN;
/* LEXICAL RULES */
INIT : 'init';
DERIVE : 'derive';
BOOLEAN : 'true' | 'false' ;
BRACKETSTART : '(' ;
BRACKETSTOP : ')' ;
BRACESTART : '{' ;
BRACESTOP : '}' ;
EQ : '=' ;
NEQ : '!=' ;
NOT : '!' ;
GT : '>' ;
LT : '<' ;
GEQT : '>=' ;
LEQT : '<=' ;
OR : 'or' ;
AND : 'and' ;
DEC : [0-9]* '.' [0-9]* ;
INT : ZERO | POSITIF;
ZERO : '0';
POSITIF : [1-9] [0-9]* ;
WORD : [a-zA-Z] [_0-9a-zA-Z]* ;
WS : (SPACE | NEWLINE)+ -> skip ;
SPACE : [ \t] ; /* Space or tab */
NEWLINE : '\r'? '\n' ; /* Carriage return and new line */
I except that the grammar would run successfully, but what i receive is : "error(119): TestExpression.g4::: The following sets of rules are mutually left-recursive [boolExpression]
1 error(s)
BUILD FAIL"

Apparently ANTLR4's support for (direct) left-recursion does not work when a predicate appears before a left-recursive rule invocation. So you can fix the error by moving the predicate after the first boolExpression in the left-recursive alternatives.
That said, it seems like the predicates aren't really necessary in the first place - at least not in the example you've shown us (or the one before your edit as far as I could tell). Since a boolExpression with the constraint type INIT can apparently only match boolLiteral, you can just change initDefinition as follows:
initDefinition : t=INIT ':' boolLiteral ;
Then boolExpression will always have the constraint type DERIVE and no predicates are necessary anymore.
Generally, if you want to allow different alternatives in non-terminal x based on whether it was invoked by y or z, you should simply have multiple versions of x and then call one from y and the other from z. That's usually a lot less hassle than littering the code with actions and predicates.
Similarly it can also make sense to have a rule that matches more than it should and then detect illegal expressions in a later phase instead of trying to reject them at the syntax level. Specifically beginners often try to write grammars that only allow well-typed expressions (rejecting something like 1+true with a syntax error) and that never works out well.

Antlr4 regular expression grammar, data structure for NFA transition table

I apologize for the extremely long explanation but I'm stuck for a month now and I really can't figure out how to solve this.
I have to derive, as a project, a compiler with antlr4 for a regex grammar that generate a program (JAVA) able to distinguish words belonging to the language generated by a regex used as input for antlr4 compiler.
The grammar that we have to use is this one:
RE ::= union | simpleRE
union ::= simpleRE + RE
simpleRE ::= concatenation | basicRE
concatenation ::= basicRE simpleRE
basicRE ::= group | any | char
group ::= (RE) | (RE)∗ | (RE)+
any ::= ?
char ::= a | b | c | ··· | z | A | B | C | ··· | Z | 0 | 1 | 2 | ··· | 9 | . | − | _
and from that, I gave this grammar to antrl4
Regexp.g4
grammar Regxp;
start_rule
: re # start
;
re
: union
| simpleRE
;
union
: simpleRE '+' re # unionOfREs
;
simpleRE
: concatenation
| basicRE
;
concatenation
: basicRE simpleRE #concatOfREs
;
basicRE
: group
| any
| cHAR
;
group
: LPAREN re RPAREN '*' # star
| LPAREN re RPAREN '+' # plus
| LPAREN re RPAREN # singleWithParenthesis
;
any
: '?'
;
cHAR
: CHAR #singleChar
;
WS : [ \t\r\n]+ -> skip ;
LPAREN : '(' ;
RPAREN : ')' ;
CHAR : LETTER | DIGIT | DOT | D | UNDERSCORE
;
/* tokens */
fragment LETTER: [a-zA-Z]
;
fragment DIGIT: [0-9]
;
fragment DOT: '.'
;
fragment D: '-'
;
fragment UNDERSCORE: '_'
;
Then i generated the java files from antlr4 with visitors.
As far as i understood the logic of the project, when the visitor is traversing the parse tree, it has to generate lines of code to fill the transition table of the NFA derived as applying the Thompson rules on the input regexp.
Then these lines of code are to be saved as a .java text file, and compiled to a program that takes in input a string (word) and tells if the word belongs or not to the language generated by the regex.
The result should be like this:
RE word Result
a+b a OK
b OK
ac KO
a∗b aab OK
b OK
aaaab OK
abb KO
So I'm asking, how can I represent the transition table in a way such that it can be filled during the visit of the parse tree and then exported in order to be used by a simple java program implementing the acceptance algorithm for an NFA? (i'm considering this pseudo-code):
S = ε−closure(s0);
c = nextChar();
while (c ≠ eof) do
S = ε−closure(move(S,c));
c = nextChar();
end while
if (S ∩ F ≠ ∅) then return “yes”;
else return “no”;
end if
As of now I managed to make that, when the visitor is for example in the unionOfREs rule, it will do something like this:
MyVisitor.java
private List<String> generatedCode = new ArrayList<String>();
/* ... */
#Override
public String visitUnionOfREs(RegxpParser.UnionOfREsContext ctx) {
System.out.println("unionOfRExps");
String char1 = visit(ctx.simpleRE());
String char2 = visit(ctx.re());
generatedCode.add("tTable.addUnion("+char1+","+char2+");");
//then this line of code will populate the transition table
return char1+"+"+char2;
}
/* ... */
The addUnion it's inside a java file that will contains all the methods to fill the transition table. I wrote code for the union, but i dont' like it because it's like to write the transition table of the NFA, as you would write it on a paper: example.
I got this when I noticed that by building the table iteratively, you can define 2 "pointers" on the table, currentBeginning and currentEnd, that tell you where to expand again the character written on the table, with the next rule that the visitor will find on the parse tree. Because this character can be another production or just a single character. On the link it is represented the written-on-paper example that convinced me to use this approach.
TransitionTable.java
/* ... */
public void addUnion(String char1, String char2) {
if (transitionTable.isEmpty()) {
List<List<Integer>> lc1 = Arrays.asList(Arrays.asList(null)
,Arrays.asList(currentBeginning+3)
,Arrays.asList(null)
,Arrays.asList(null)
,Arrays.asList(null)
,Arrays.asList(null));
List<List<Integer>> lc2 = Arrays.asList(Arrays.asList(null)
,Arrays.asList(null)
,Arrays.asList(currentBeginning+4)
,Arrays.asList(null)
,Arrays.asList(null)
,Arrays.asList(null));
List<List<Integer>> le = Arrays.asList(Arrays.asList(currentBeginning+1,currentBeginning+2)
,Arrays.asList(null)
,Arrays.asList(null)
,Arrays.asList(currentBeginning+5)
,Arrays.asList(currentBeginning+5)
,Arrays.asList(null));
transitionTable.put(char1, lc1);
transitionTable.put(char2, lc2);
transitionTable.put("epsilon", le);
//currentBeginning += 2;
//currentEnd = transitionTable.get(char2).get(currentBeginning).get(0);
currentEnd = transitionTable.get("epsilon").size()-1;//il 5
} else { //not the first time it encounters this rule, beginning and end changed
//needs to add 2 less states
}
}
/* ... */
At the moment I'm representing the transition table as HashMap<String, List<List<Integer>>> strings are for chars on the edges of the NFA and List<List<Integer>> because by being non deterministic, it needs to represent more transitions from a single state.
But going this way, for a parse tree like this i will obtain this line of code for the union : "tTable.addUnion("tTable.addConcat(a,b)","+char2+");"
And i'm blocked here, i don't know how to solve this and i really can't think a different way to represent the transition table or to fill it while visiting the parse tree.
Thank You.

Using Thompson's construction, every regular (sub-)expression produces an NFA, and every regular expression operator (union, cat, *) can be implemented by adding a couple states and connecting them to states that already exists. See:
https://en.wikipedia.org/wiki/Thompson%27s_construction
So, when parsing the regex, every terminal or non-terminal production should add the required states and transitions to the NFA, and return its start and end state to the containing production. Non-terminal productions will combine their children and return their own start+end states so that your NFA can be built from the leaves of the regular expression up.
The representation of the state table is not critical for building. Thompson's construction will never require you to modify a state or transition that you built before, so you just need to be able to add new ones. You will also never need more than one transition from a state on the same character, or even more than one non-epsilon transition. In fact, if all your operators are binary you will never need more than 2 transitions on a state. Usually the representation is designed to make it easy to do the next steps, like DFA generation or direct execution of the NFA against strings.
For example, a class like this can completely represent a state:
class State
{
public char matchChar;
public State matchState; //where to go if you match matchChar, or null
public State epsilon1; //or null
public State epsilon2; //or null
}
This would actually be a pretty reasonable representation for directly executing an NFA. But if you already have code for directly executing an NFA, then you should probably just build whatever it uses so you don't have to do another transformation.

Antlr4 doesn't recognize identifiers

I'm trying to create a grammar which parses a file line by line.
grammar Comp;
options
{
language = Java;
}
#header {
package analyseur;
import java.util.*;
import component.*;
}
#parser::members {
/** Line to write in the new java file */
public String line;
}
start
: objectRule {System.out.println("OBJ"); line = $objectRule.text;}
| anyString {System.out.println("ANY"); line = $anyString.text;}
;
objectRule : ObjectKeyword ID ;
anyString : ANY_STRING ;
ObjectKeyword : 'Object' ;
ID : [a-zA-Z]+ ;
ANY_STRING : (~'\n')+ ;
WhiteSpace : (' '|'\t') -> skip;
When I send the lexem 'Object o' to the grammar, the output is ANY instead of OBJ.
'Object o' => 'ANY' // I would like OBJ
I know the ANY_STRING is longer but I wrote lexer tokens in the order. What is the problem ?
Thank you very much for your help ! ;)

For lexer rules, the rule with the longest match wins, independent of rule ordering. If the match length is the same, then the first listed rule wins.
To make rule order meaningful, reduce the possible match length of the ANY_STRING rule to be the same or less than any key word or id:
ANY_STRING: ~( ' ' | '\n' | '\t' ) ; // also?: '\r' | '\f' | '_'
Update
To see what the lexer is actually doing, dump the token stream.

Java CC issue - "Expansion within "(...)*" can be matched by empty string"

We've been given a grammar to patch up and parse using Java CC. One of the problems with it is several occurrences of "expansion within "(...)*" can be matched by empty string. I understand this error is caused when something can be matched zero or more times inside something else that can be matched zero or more times.
What I don't understand is how to fix it. (Our instructor hasn't been saying much, "You have to be careful how you word it."
The problem area of the grammar, along with its associated Java CC code is shown below. Any ideas or advice would be greatly appreciated.
program := ( decl )*
( function ) *
main_prog
decl := ( var_decl | const_decl )*
var_decl := var ident_list : type ( , ident_list : type)* ;
const_decl := const identifier : type = expression ( , identifier : type = expression)* ;
function :=
type identifier ( param_list)
( decl )*
( statement ; )*
return ( expression | e ); //e is greek epsilon character
main_prog :=
main
( decl ) *
(statement ; )*
The issue is with the way decl is declared I think. It is declared here in actual Java CC code:
void decl():{}
{
( var_decl() | const_decl())*
}
If I change that Kleene closure above to + , all the other errors caused by this go away. However the instructor says the star should remain, and we need to be careful how we word it. I've found lots of resources on left factoring, left recursion removal and the like, but scant little on this particular issue. The above code doesn't actually have an error in Java CC, but is the cause of further ones as below:
void program():{}
{
( decl() )* //error here - Expansion within "(...)*" can be matched by empty string
( function() )*
main_prog()
}
void main_prog(): {}
{
< MAIN >
( decl() )* //same error on this line
(statement() < SCOLON >)*
}
void function(): {}
{
type() < ID > <LPARENT > param_list() < RPARENT >
( decl() )* //same error on this line
( statement() < SCOLON > )*
< RET> ( expression() | {} ) <SCOLON > // {} is epsilon
}
Any ideas on how to go about fixing this would be very much appreciated.

As it stands your grammar is ambiguous - it says that a decl means zero or more declarations, and there are a number of places where you allow zero or more decls. You don't need * in both these places, just pick one or the other, either approach will parse the same programs but they're conceptually slightly different.
You could take out the * in decl:
decl := ( var_decl | const_decl )
program := ( decl )*
( function ) *
main_prog
so decl represents a single declaration, and a program may start with a sequence of decls but doesn't have to. Alternatively you could leave the * in decl but take it out from the places where you reference it:
decl := ( var_decl | const_decl )*
program := decl
( function ) *
main_prog
so now decl represents something like a "declarations block" rather than a single declaration - every program must start with a declarations block but that block is itself allowed to be empty.

How to iterate over a production in ANTLR

Lets suppose the following scenarios with 2 ANTLR grammars:
1)
expr : antExp+;
antExpr : '{' T '}' ;
T : 'foo';
2)
expr : antExpr;
antExpr : '{' T* '}' ;
T : 'bar';
In both cases I need to know how to iterate over antExp+ and T*, because I need to generate an ArrayList of each element of them. Of course my grammar is more complex, but I think that this example should explain what I'm needing. Thank you!

Production rules in ANTLR can have one or more return types which you can reference inside a loop (a (...)* or (...)+). So, let's say you want to print each of the T's text the antExp rule matches. This could be done like this:
expr
: (antExp {System.out.println($antExp.str);} )+
;
antExpr returns [String str]
: '{' T '}' {$str = $T.text;}
;
T : 'foo';
The same principle holds for example grammar #2:
expr : antExpr;
antExpr : '{' (T {System.out.println($T.text);} )* '}' ;
T : 'bar';
EDIT
Note that you're not restricted to returning a single reference. Running the parser generated from:
grammar T;
parse
: ids {System.out.println($ids.firstId + "\n" + $ids.allIds);}
;
ids returns [String firstId, List<String> allIds]
#init{$allIds = new ArrayList<String>();}
#after{$firstId = $allIds.get(0);}
: (ID {$allIds.add($ID.text);})+
;
ID : ('a'..'z' | 'A'..'Z')+;
SPACE : ' ' {skip();};
on the input "aaa bbb ccc" would print the following:
aaa
[aaa, bbb, ccc]

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Ambiguity in ANTLR grammar - java

Related

How to fix the error in left-recursion used with semantic predicates?

Antlr4 regular expression grammar, data structure for NFA transition table

Antlr4 doesn't recognize identifiers

Java CC issue - "Expansion within "(...)*" can be matched by empty string"

How to iterate over a production in ANTLR

Categories

Resources