Java CC issue - "Expansion within "(...)*" can be matched by empty string"

Java CC issue - "Expansion within "(...)*" can be matched by empty string" - java

We've been given a grammar to patch up and parse using Java CC. One of the problems with it is several occurrences of "expansion within "(...)*" can be matched by empty string. I understand this error is caused when something can be matched zero or more times inside something else that can be matched zero or more times.
What I don't understand is how to fix it. (Our instructor hasn't been saying much, "You have to be careful how you word it."
The problem area of the grammar, along with its associated Java CC code is shown below. Any ideas or advice would be greatly appreciated.
program := ( decl )*
( function ) *
main_prog
decl := ( var_decl | const_decl )*
var_decl := var ident_list : type ( , ident_list : type)* ;
const_decl := const identifier : type = expression ( , identifier : type = expression)* ;
function :=
type identifier ( param_list)
( decl )*
( statement ; )*
return ( expression | e ); //e is greek epsilon character
main_prog :=
main
( decl ) *
(statement ; )*
The issue is with the way decl is declared I think. It is declared here in actual Java CC code:
void decl():{}
{
( var_decl() | const_decl())*
}
If I change that Kleene closure above to + , all the other errors caused by this go away. However the instructor says the star should remain, and we need to be careful how we word it. I've found lots of resources on left factoring, left recursion removal and the like, but scant little on this particular issue. The above code doesn't actually have an error in Java CC, but is the cause of further ones as below:
void program():{}
{
( decl() )* //error here - Expansion within "(...)*" can be matched by empty string
( function() )*
main_prog()
}
void main_prog(): {}
{
< MAIN >
( decl() )* //same error on this line
(statement() < SCOLON >)*
}
void function(): {}
{
type() < ID > <LPARENT > param_list() < RPARENT >
( decl() )* //same error on this line
( statement() < SCOLON > )*
< RET> ( expression() | {} ) <SCOLON > // {} is epsilon
}
Any ideas on how to go about fixing this would be very much appreciated.

As it stands your grammar is ambiguous - it says that a decl means zero or more declarations, and there are a number of places where you allow zero or more decls. You don't need * in both these places, just pick one or the other, either approach will parse the same programs but they're conceptually slightly different.
You could take out the * in decl:
decl := ( var_decl | const_decl )
program := ( decl )*
( function ) *
main_prog
so decl represents a single declaration, and a program may start with a sequence of decls but doesn't have to. Alternatively you could leave the * in decl but take it out from the places where you reference it:
decl := ( var_decl | const_decl )*
program := decl
( function ) *
main_prog
so now decl represents something like a "declarations block" rather than a single declaration - every program must start with a declarations block but that block is itself allowed to be empty.

Related

How to fix the error in left-recursion used with semantic predicates?

I would like to parse two type of expression with boolean :
- the first would be an init expression with boolean like : init : false
- and the last one would be a derive expression with boolean like : derive : !express or (express and (amount >= 100))
My idea is to put semantic predicates in a set of rules,
the goal is when I'm parsing a boolean expression beginning with the word 'init' then it has to go to only one alternative rule proposed who is boolliteral, the last alternative in boolExpression. And if it's an expression beginning with the word 'derive' then it could have access to all alternatives of boolExpression.
I know that I could make two type of boolExpression without semantic predicates like boolExpressionInit and boolExpressionDerive... But I would like to try with my idea if it's could work with a only one boolExpression with semantic predicates.
Here's my grammar
grammar TestExpression;
#header
{
package testexpressionparser;
}
#parser::members {
int vConstraintType;
}
/* SYNTAX RULES */
textInput : initDefinition
| derDefinition ;
initDefinition : t=INIT {vConstraintType = $t.type;} ':' boolExpression ;
derDefinition : t=DERIVE {vConstraintType = $t.type;} ':' boolExpression ;
boolExpression : {vConstraintType != INIT || vConstraintType == DERIVE}? boolExpression (boolOp|relOp) boolExpression
| {vConstraintType != INIT || vConstraintType == DERIVE}? NOT boolExpression
| {vConstraintType != INIT || vConstraintType == DERIVE}? '(' boolExpression ')'
| {vConstraintType != INIT || vConstraintType == DERIVE}? attributeName
| {vConstraintType != INIT || vConstraintType == DERIVE}? numliteral
| {vConstraintType == INIT || vConstraintType == DERIVE}? boolliteral
;
boolOp : OR | AND ;
relOp : EQ | NEQ | GT | LT | GEQT | LEQT ;
attributeName : WORD;
numliteral : intliteral | decliteral;
intliteral : INT ;
decliteral : DEC ;
boolliteral : BOOLEAN;
/* LEXICAL RULES */
INIT : 'init';
DERIVE : 'derive';
BOOLEAN : 'true' | 'false' ;
BRACKETSTART : '(' ;
BRACKETSTOP : ')' ;
BRACESTART : '{' ;
BRACESTOP : '}' ;
EQ : '=' ;
NEQ : '!=' ;
NOT : '!' ;
GT : '>' ;
LT : '<' ;
GEQT : '>=' ;
LEQT : '<=' ;
OR : 'or' ;
AND : 'and' ;
DEC : [0-9]* '.' [0-9]* ;
INT : ZERO | POSITIF;
ZERO : '0';
POSITIF : [1-9] [0-9]* ;
WORD : [a-zA-Z] [_0-9a-zA-Z]* ;
WS : (SPACE | NEWLINE)+ -> skip ;
SPACE : [ \t] ; /* Space or tab */
NEWLINE : '\r'? '\n' ; /* Carriage return and new line */
I except that the grammar would run successfully, but what i receive is : "error(119): TestExpression.g4::: The following sets of rules are mutually left-recursive [boolExpression]
1 error(s)
BUILD FAIL"

Apparently ANTLR4's support for (direct) left-recursion does not work when a predicate appears before a left-recursive rule invocation. So you can fix the error by moving the predicate after the first boolExpression in the left-recursive alternatives.
That said, it seems like the predicates aren't really necessary in the first place - at least not in the example you've shown us (or the one before your edit as far as I could tell). Since a boolExpression with the constraint type INIT can apparently only match boolLiteral, you can just change initDefinition as follows:
initDefinition : t=INIT ':' boolLiteral ;
Then boolExpression will always have the constraint type DERIVE and no predicates are necessary anymore.
Generally, if you want to allow different alternatives in non-terminal x based on whether it was invoked by y or z, you should simply have multiple versions of x and then call one from y and the other from z. That's usually a lot less hassle than littering the code with actions and predicates.
Similarly it can also make sense to have a rule that matches more than it should and then detect illegal expressions in a later phase instead of trying to reject them at the syntax level. Specifically beginners often try to write grammars that only allow well-typed expressions (rejecting something like 1+true with a syntax error) and that never works out well.

ANTLR4 AST Creation - How to create an AstVistor

With the help of this SO question How to create AST with ANTLR4? I was able to create the AST Nodes, but I'm stuck at coding the BuildAstVisitor as depicted in the accepted answer's example.
I have a grammar that starts like this:
mini: (constDecl | varDef | funcDecl | funcDef)* ;
And I can neither assign a label to the block (antlr4 says label X assigned to a block which is not a set), and I have no idea how to visit the next node.
public Expr visitMini(MiniCppParser.MiniContext ctx) {
return visitConstDecl(ctx.constDecl());
}
I have the following problems with the code above: I don't know how to decide whether it's a constDecl, varDef or any other option and ctx.constDecl() returns a List<ConstDeclContext> whereas I only need one element for the visitConstDecl function.
edit:
More grammar rules:
mini: (constDecl | varDef | funcDecl | funcDef)* ;
//--------------------------------------------------
constDecl: 'const' type ident=ID init ';' ;
init: '=' ( value=BOOLEAN | sign=('+' | '-')? value=NUMBER ) ;
// ...
//--------------------------------------------------
OP_ADD: '+';
OP_SUB: '-';
OP_MUL: '*';
OP_DIV: '/';
OP_MOD: '%';
BOOLEAN : 'true' | 'false' ;
NUMBER : '-'? INT ;
fragment INT : '0' | [1-9] [0-9]* ;
ID : [a-zA-Z]+ ;
// ...
I'm still not entirely sure on how to implement the BuildAstVisitor. I now have something along the lines of the following, but it certainly doesn't look right to me...
#Override
public Expr visitMini(MiniCppParser.MiniContext ctx) {
for (MiniCppParser.ConstDeclContext constDeclCtx : ctx.constDecl()) {
visit(constDeclCtx);
}
return null;
}
#Override
public Expr visitConstDecl(MiniCppParser.ConstDeclContext ctx) {
visit(ctx.type());
return visit(ctx.init());
}

If you want to get the individual subrules then implement the visitXXX functions for them (visitConstDecl(), visitVarDef() etc.) instead of the visitMini() function. They will only be called if there's really a match for them in the input. Hence you don't need to do any checks for occurences.

Antlr4 doesn't recognize identifiers

I'm trying to create a grammar which parses a file line by line.
grammar Comp;
options
{
language = Java;
}
#header {
package analyseur;
import java.util.*;
import component.*;
}
#parser::members {
/** Line to write in the new java file */
public String line;
}
start
: objectRule {System.out.println("OBJ"); line = $objectRule.text;}
| anyString {System.out.println("ANY"); line = $anyString.text;}
;
objectRule : ObjectKeyword ID ;
anyString : ANY_STRING ;
ObjectKeyword : 'Object' ;
ID : [a-zA-Z]+ ;
ANY_STRING : (~'\n')+ ;
WhiteSpace : (' '|'\t') -> skip;
When I send the lexem 'Object o' to the grammar, the output is ANY instead of OBJ.
'Object o' => 'ANY' // I would like OBJ
I know the ANY_STRING is longer but I wrote lexer tokens in the order. What is the problem ?
Thank you very much for your help ! ;)

For lexer rules, the rule with the longest match wins, independent of rule ordering. If the match length is the same, then the first listed rule wins.
To make rule order meaningful, reduce the possible match length of the ANY_STRING rule to be the same or less than any key word or id:
ANY_STRING: ~( ' ' | '\n' | '\t' ) ; // also?: '\r' | '\f' | '_'
Update
To see what the lexer is actually doing, dump the token stream.

How to match a multiline regex in Clojure to parse a Groovy source file?

I'm trying to run a Clojure regex on a Groovy source file to parse out the individual functions.
// gremlin.groovy
def warm_cache() {
for (vertex in g.getVertices()) {
vertex.getOutEdges()
}
}
def clear() {
g.clear()
}
This is the pattern I'm using in Clojure:
(def source (read-file "gremlin.groovy"))
(def pattern #"(?m)^def.*[^}]")
(re-seq pattern source)
However, it's only grabbing the first line, not the multiline func.

As a demonstration of how you can grab the AST from the GroovyRecognizer, and avoid having the cope with trying to parse a language using regular expressions, you can do this in Groovy:
import org.codehaus.groovy.antlr.*
import org.codehaus.groovy.antlr.parser.*
def code = '''
// gremlin.groovy
def warm_cache() {
for (vertex in g.getVertices()) {
vertex.getOutEdges()
}
}
def clear() {
g.clear()
}
'''
def ast = new GroovyRecognizer( new GroovyLexer( new StringReader( code ) ).plumb() ).with { p ->
p.compilationUnit()
p.AST
}
while( ast ) {
println ast.toStringTree()
ast = ast.nextSibling
}
That prints out the AST for each GroovySourceAST node in the AST, giving you (for this example):
( METHOD_DEF MODIFIERS TYPE warm_cache PARAMETERS ( { ( for ( in vertex ( ( ( . g getVertices ) ELIST ) ) ( { ( EXPR ( ( ( . vertex getOutEdges ) ELIST ) ) ) ) ) )
( METHOD_DEF MODIFIERS TYPE clear PARAMETERS ( { ( EXPR ( ( ( . g clear ) ELIST ) ) ) )
You should be able to do the same thing with Clojure's java interop and the groovy-all jar file
Edit
To get a bit more info, you just need to drill down into the AST and manipulate the input script a bit. Changing the while loop in the above code to:
while( ast ) {
if( ast.type == GroovyTokenTypes.METHOD_DEF ) {
println """Lines $ast.line to $ast.lineLast
| Name: $ast.firstChild.nextSibling.nextSibling.text
| Code: ${code.split('\n')[ (ast.line-1)..<ast.lineLast ]*.trim().join( ' ' )}
| AST: ${ast.toStringTree()}""".stripMargin()
}
ast = ast.nextSibling
}
prints out:
Lines 4 to 8
Name: warm_cache
Code: def warm_cache() { for (vertex in g.getVertices()) { vertex.getOutEdges() } }
AST: ( METHOD_DEF MODIFIERS TYPE warm_cache PARAMETERS ( { ( for ( in vertex ( ( ( . g getVertices ) ELIST ) ) ( { ( EXPR ( ( ( . vertex getOutEdges ) ELIST ) ) ) ) ) )
Lines 10 to 12
Name: clear
Code: def clear() { g.clear() }
AST: ( METHOD_DEF MODIFIERS TYPE clear PARAMETERS ( { ( EXPR ( ( ( . g clear ) ELIST ) ) ) )
Obviously, the Code: section is just the lines joined back together, so might not work if pasted back into groovy, but they give you an idea of the original code...

It's your regex, not Clojure. You request to match def, then anything, then one char that is not equal to the closing brace. That char can be anywhere. What you want to achieve is this: (?sm)def.*?^}.

Short answer
(re-seq (Pattern/compile "(?m)^def.*[^}]" Pattern/MULTILINE) source)
From http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html
By default, the regular expressions ^ and $ ignore line terminators and only match at the beginning and the end, respectively, of the entire input sequence. If MULTILINE mode is activated then ^ matches at the beginning of input and after any line terminator except at the end of input. When in MULTILINE mode $ matches just before a line terminator or the end of the input sequence.
You need to be able to pass in
Pattern.MULTILINE
when the pattern is compiled. But there is no option for this on re-seq, so you'll probably need to drop down into Java interop to get this to work properly? Ideally, you really should be able to specify this in Clojure land... :(
UPDATE:
Actually, it's not all that bad. Instead of using the literal expression for a regex, just use Java interop for your pattern. Use (re-seq (Pattern/compile "(?m)^def.*[^}]" Pattern/MULTILINE) source) instead (assuming that you've imported java.util.regex.Pattern). I haven't tested this, but I think that will do the trick for you.

How to iterate over a production in ANTLR

Lets suppose the following scenarios with 2 ANTLR grammars:
1)
expr : antExp+;
antExpr : '{' T '}' ;
T : 'foo';
2)
expr : antExpr;
antExpr : '{' T* '}' ;
T : 'bar';
In both cases I need to know how to iterate over antExp+ and T*, because I need to generate an ArrayList of each element of them. Of course my grammar is more complex, but I think that this example should explain what I'm needing. Thank you!

Production rules in ANTLR can have one or more return types which you can reference inside a loop (a (...)* or (...)+). So, let's say you want to print each of the T's text the antExp rule matches. This could be done like this:
expr
: (antExp {System.out.println($antExp.str);} )+
;
antExpr returns [String str]
: '{' T '}' {$str = $T.text;}
;
T : 'foo';
The same principle holds for example grammar #2:
expr : antExpr;
antExpr : '{' (T {System.out.println($T.text);} )* '}' ;
T : 'bar';
EDIT
Note that you're not restricted to returning a single reference. Running the parser generated from:
grammar T;
parse
: ids {System.out.println($ids.firstId + "\n" + $ids.allIds);}
;
ids returns [String firstId, List<String> allIds]
#init{$allIds = new ArrayList<String>();}
#after{$firstId = $allIds.get(0);}
: (ID {$allIds.add($ID.text);})+
;
ID : ('a'..'z' | 'A'..'Z')+;
SPACE : ' ' {skip();};
on the input "aaa bbb ccc" would print the following:
aaa
[aaa, bbb, ccc]

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java CC issue - "Expansion within "(...)*" can be matched by empty string" - java

Related

How to fix the error in left-recursion used with semantic predicates?

ANTLR4 AST Creation - How to create an AstVistor

Antlr4 doesn't recognize identifiers

How to match a multiline regex in Clojure to parse a Groovy source file?

How to iterate over a production in ANTLR

Categories

Resources