Antlr4 regular expression grammar, data structure for NFA transition table

Antlr4 regular expression grammar, data structure for NFA transition table - java

I apologize for the extremely long explanation but I'm stuck for a month now and I really can't figure out how to solve this.
I have to derive, as a project, a compiler with antlr4 for a regex grammar that generate a program (JAVA) able to distinguish words belonging to the language generated by a regex used as input for antlr4 compiler.
The grammar that we have to use is this one:
RE ::= union | simpleRE
union ::= simpleRE + RE
simpleRE ::= concatenation | basicRE
concatenation ::= basicRE simpleRE
basicRE ::= group | any | char
group ::= (RE) | (RE)∗ | (RE)+
any ::= ?
char ::= a | b | c | ··· | z | A | B | C | ··· | Z | 0 | 1 | 2 | ··· | 9 | . | − | _
and from that, I gave this grammar to antrl4
Regexp.g4
grammar Regxp;
start_rule
: re # start
;
re
: union
| simpleRE
;
union
: simpleRE '+' re # unionOfREs
;
simpleRE
: concatenation
| basicRE
;
concatenation
: basicRE simpleRE #concatOfREs
;
basicRE
: group
| any
| cHAR
;
group
: LPAREN re RPAREN '*' # star
| LPAREN re RPAREN '+' # plus
| LPAREN re RPAREN # singleWithParenthesis
;
any
: '?'
;
cHAR
: CHAR #singleChar
;
WS : [ \t\r\n]+ -> skip ;
LPAREN : '(' ;
RPAREN : ')' ;
CHAR : LETTER | DIGIT | DOT | D | UNDERSCORE
;
/* tokens */
fragment LETTER: [a-zA-Z]
;
fragment DIGIT: [0-9]
;
fragment DOT: '.'
;
fragment D: '-'
;
fragment UNDERSCORE: '_'
;
Then i generated the java files from antlr4 with visitors.
As far as i understood the logic of the project, when the visitor is traversing the parse tree, it has to generate lines of code to fill the transition table of the NFA derived as applying the Thompson rules on the input regexp.
Then these lines of code are to be saved as a .java text file, and compiled to a program that takes in input a string (word) and tells if the word belongs or not to the language generated by the regex.
The result should be like this:
RE word Result
a+b a OK
b OK
ac KO
a∗b aab OK
b OK
aaaab OK
abb KO
So I'm asking, how can I represent the transition table in a way such that it can be filled during the visit of the parse tree and then exported in order to be used by a simple java program implementing the acceptance algorithm for an NFA? (i'm considering this pseudo-code):
S = ε−closure(s0);
c = nextChar();
while (c ≠ eof) do
S = ε−closure(move(S,c));
c = nextChar();
end while
if (S ∩ F ≠ ∅) then return “yes”;
else return “no”;
end if
As of now I managed to make that, when the visitor is for example in the unionOfREs rule, it will do something like this:
MyVisitor.java
private List<String> generatedCode = new ArrayList<String>();
/* ... */
#Override
public String visitUnionOfREs(RegxpParser.UnionOfREsContext ctx) {
System.out.println("unionOfRExps");
String char1 = visit(ctx.simpleRE());
String char2 = visit(ctx.re());
generatedCode.add("tTable.addUnion("+char1+","+char2+");");
//then this line of code will populate the transition table
return char1+"+"+char2;
}
/* ... */
The addUnion it's inside a java file that will contains all the methods to fill the transition table. I wrote code for the union, but i dont' like it because it's like to write the transition table of the NFA, as you would write it on a paper: example.
I got this when I noticed that by building the table iteratively, you can define 2 "pointers" on the table, currentBeginning and currentEnd, that tell you where to expand again the character written on the table, with the next rule that the visitor will find on the parse tree. Because this character can be another production or just a single character. On the link it is represented the written-on-paper example that convinced me to use this approach.
TransitionTable.java
/* ... */
public void addUnion(String char1, String char2) {
if (transitionTable.isEmpty()) {
List<List<Integer>> lc1 = Arrays.asList(Arrays.asList(null)
,Arrays.asList(currentBeginning+3)
,Arrays.asList(null)
,Arrays.asList(null)
,Arrays.asList(null)
,Arrays.asList(null));
List<List<Integer>> lc2 = Arrays.asList(Arrays.asList(null)
,Arrays.asList(null)
,Arrays.asList(currentBeginning+4)
,Arrays.asList(null)
,Arrays.asList(null)
,Arrays.asList(null));
List<List<Integer>> le = Arrays.asList(Arrays.asList(currentBeginning+1,currentBeginning+2)
,Arrays.asList(null)
,Arrays.asList(null)
,Arrays.asList(currentBeginning+5)
,Arrays.asList(currentBeginning+5)
,Arrays.asList(null));
transitionTable.put(char1, lc1);
transitionTable.put(char2, lc2);
transitionTable.put("epsilon", le);
//currentBeginning += 2;
//currentEnd = transitionTable.get(char2).get(currentBeginning).get(0);
currentEnd = transitionTable.get("epsilon").size()-1;//il 5
} else { //not the first time it encounters this rule, beginning and end changed
//needs to add 2 less states
}
}
/* ... */
At the moment I'm representing the transition table as HashMap<String, List<List<Integer>>> strings are for chars on the edges of the NFA and List<List<Integer>> because by being non deterministic, it needs to represent more transitions from a single state.
But going this way, for a parse tree like this i will obtain this line of code for the union : "tTable.addUnion("tTable.addConcat(a,b)","+char2+");"
And i'm blocked here, i don't know how to solve this and i really can't think a different way to represent the transition table or to fill it while visiting the parse tree.
Thank You.

Using Thompson's construction, every regular (sub-)expression produces an NFA, and every regular expression operator (union, cat, *) can be implemented by adding a couple states and connecting them to states that already exists. See:
https://en.wikipedia.org/wiki/Thompson%27s_construction
So, when parsing the regex, every terminal or non-terminal production should add the required states and transitions to the NFA, and return its start and end state to the containing production. Non-terminal productions will combine their children and return their own start+end states so that your NFA can be built from the leaves of the regular expression up.
The representation of the state table is not critical for building. Thompson's construction will never require you to modify a state or transition that you built before, so you just need to be able to add new ones. You will also never need more than one transition from a state on the same character, or even more than one non-epsilon transition. In fact, if all your operators are binary you will never need more than 2 transitions on a state. Usually the representation is designed to make it easy to do the next steps, like DFA generation or direct execution of the NFA against strings.
For example, a class like this can completely represent a state:
class State
{
public char matchChar;
public State matchState; //where to go if you match matchChar, or null
public State epsilon1; //or null
public State epsilon2; //or null
}
This would actually be a pretty reasonable representation for directly executing an NFA. But if you already have code for directly executing an NFA, then you should probably just build whatever it uses so you don't have to do another transformation.

Related

Antlr4 token ambiguity for single character

I have a problem with the rule mnemonic_format.
Instead to recognize a simple text like A100 it gives the following error :
mismatched input 'A100' expecting 'A'
The grammar is:
grammar SimpleMathGrammar;
INTEGER : [0-9]+;
FLOAT : [0-9]+ '.' [0-9]+;
ADD : '+';
SUB : '-';
DOT : '.';
AND : 'AND';
BACKSLASH : '\\';
fragment SINGLELETTER : ( 'a'..'z' | 'A'..'Z');
fragment LOWERCASE : 'a'..'z';
fragment UNDERSCORE : '_';
fragment DOLLAR : '$';
fragment NUMBER : '0'..'9';
VARIABLENAME
: SINGLELETTER
| (SINGLELETTER|UNDERSCORE) (SINGLELETTER | UNDERSCORE | DOLLAR | NUMBER)*;
HASH : '#';
/* PARSER */
operation
: (INTEGER | FLOAT) ADD (INTEGER | FLOAT)
| (INTEGER | FLOAT) SUB (INTEGER | FLOAT);
operation_with_backslash : BACKSLASH operation BACKSLASH;
mnemonic: HASH VARIABLENAME HASH;
mnemonic_format
// Example: A100
: 'A' INTEGER;
At this point, i know that the token VARIABLENAME should not include the character A (correct me if im wrong)
So what can i do for include a single character (o fixed sequence) in distinct rule? (and which is my error?)
EDIT: I found the origin of the problem (by remove all of the other tokens and rules) in the following token case:
VARIABLENAME: (SINGLELETTER|UNDERSCORE) (SINGLELETTER | UNDERSCORE | DOLLAR | NUMBER)*;
So how can i create a token or a lexer rule that give me the basic for detect some generic text (like a Class name or a Variable name) by also create rules where i must accept a fixed sequence of characters?

Ok,
The trick was the "general scope" of the token VARIABLENAME.
In other terms, the token is too much generic.
In my case the sub-condition VARIABLENAME: SINGLELETTER NUMBER* crash/collide with the condition mnemonic_format: 'A' INTEGER
(Indeed i can create the string A100 with VARIABLENAME or mnemonic_format and this create an ambiguity)
So i "specialize" VARIABLENAME for accept a prefix, for example:
VARIABLENAME
: HASH (SINGLELETTER|UNDERSCORE)(SINGLELETTER|UNDERSCORE|DOLLAR|NUMBER)*
| 'class ' (SINGLELETTER|UNDERSCORE)(SINGLELETTER|UNDERSCORE|DOLLAR|NUMBER)*
...
This should avoid an ambiguity between the token and the rule

ANTLR4 -no viable alternative at input 's4'

I have to make a calculator in Java using AntLR .But when i try to calculate the square root using the command s 4 it shows me :no viable alternative at input 's4'.
I really need your help for this. I try everything and I dont know whats is wrong.
This is my grammar:
grammar Hello;
r : r SEMI r EOF
| r SEMI
| plus_op
| minus_op
| sqrt_op;
// match keyword hello followed by an identifier
ID : [a-z]+ ; // match lower-case identifiers
NUM : [0-9];
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
ADD : '+';
MINUS : '-';
SEMI: ';';
SQRT: 's';
plus_token: NUM | ID;
minus_token: NUM | ID;
sqrt_token: NUM;
plus_op : plus_token ADD plus_token;
minus_op : minus_token MINUS minus_token;
sqrt_op: SQRT sqrt_token;

SQRT will never be matched as ID matches the same input. This is because ANTLR won't try to match the input with multiple rules in the lexer but rather uses the first found rule that can match the current input.
The problem should be solved if SQRT is being defined before ID in the grammar.

ANTLR4 AST Creation - How to create an AstVistor

With the help of this SO question How to create AST with ANTLR4? I was able to create the AST Nodes, but I'm stuck at coding the BuildAstVisitor as depicted in the accepted answer's example.
I have a grammar that starts like this:
mini: (constDecl | varDef | funcDecl | funcDef)* ;
And I can neither assign a label to the block (antlr4 says label X assigned to a block which is not a set), and I have no idea how to visit the next node.
public Expr visitMini(MiniCppParser.MiniContext ctx) {
return visitConstDecl(ctx.constDecl());
}
I have the following problems with the code above: I don't know how to decide whether it's a constDecl, varDef or any other option and ctx.constDecl() returns a List<ConstDeclContext> whereas I only need one element for the visitConstDecl function.
edit:
More grammar rules:
mini: (constDecl | varDef | funcDecl | funcDef)* ;
//--------------------------------------------------
constDecl: 'const' type ident=ID init ';' ;
init: '=' ( value=BOOLEAN | sign=('+' | '-')? value=NUMBER ) ;
// ...
//--------------------------------------------------
OP_ADD: '+';
OP_SUB: '-';
OP_MUL: '*';
OP_DIV: '/';
OP_MOD: '%';
BOOLEAN : 'true' | 'false' ;
NUMBER : '-'? INT ;
fragment INT : '0' | [1-9] [0-9]* ;
ID : [a-zA-Z]+ ;
// ...
I'm still not entirely sure on how to implement the BuildAstVisitor. I now have something along the lines of the following, but it certainly doesn't look right to me...
#Override
public Expr visitMini(MiniCppParser.MiniContext ctx) {
for (MiniCppParser.ConstDeclContext constDeclCtx : ctx.constDecl()) {
visit(constDeclCtx);
}
return null;
}
#Override
public Expr visitConstDecl(MiniCppParser.ConstDeclContext ctx) {
visit(ctx.type());
return visit(ctx.init());
}

If you want to get the individual subrules then implement the visitXXX functions for them (visitConstDecl(), visitVarDef() etc.) instead of the visitMini() function. They will only be called if there's really a match for them in the input. Hence you don't need to do any checks for occurences.

Antlr4 doesn't recognize identifiers

I'm trying to create a grammar which parses a file line by line.
grammar Comp;
options
{
language = Java;
}
#header {
package analyseur;
import java.util.*;
import component.*;
}
#parser::members {
/** Line to write in the new java file */
public String line;
}
start
: objectRule {System.out.println("OBJ"); line = $objectRule.text;}
| anyString {System.out.println("ANY"); line = $anyString.text;}
;
objectRule : ObjectKeyword ID ;
anyString : ANY_STRING ;
ObjectKeyword : 'Object' ;
ID : [a-zA-Z]+ ;
ANY_STRING : (~'\n')+ ;
WhiteSpace : (' '|'\t') -> skip;
When I send the lexem 'Object o' to the grammar, the output is ANY instead of OBJ.
'Object o' => 'ANY' // I would like OBJ
I know the ANY_STRING is longer but I wrote lexer tokens in the order. What is the problem ?
Thank you very much for your help ! ;)

For lexer rules, the rule with the longest match wins, independent of rule ordering. If the match length is the same, then the first listed rule wins.
To make rule order meaningful, reduce the possible match length of the ANY_STRING rule to be the same or less than any key word or id:
ANY_STRING: ~( ' ' | '\n' | '\t' ) ; // also?: '\r' | '\f' | '_'
Update
To see what the lexer is actually doing, dump the token stream.

ANTLR: parse configuration file

I'm missing some basic knowledge. Started playing around with ATLR today missing any source telling me how to do the following:
I'd like to parse a configuration file a program of mine currently reads in a very ugly way. Basically it looks like:
A [Data] [Data]
B [Data] [Data] [Data]
where A/B/... are objects with their associated data following (dynamic amount, only simple digits).
A grammar should not be that hard but how to use ANTLR now?
lexer only: A/B are tokens and I ask for the tokens he read. How to ask this and how to detect malformatted input?
lexer & parser: A/B are parser rules and... how to know the parser processed successfully A/B? The same object could appear multiple times in the file and I need to consider every single one. It's more like listing instances in the config file.
Edit:
My problem is not the grammer but how to get informed by parser/lexer what they actually found/parsed? Best would be: invoke a function upon recognition of a rule like recursive descent

ANTLR production rules can have return value(s) you can use to get the contents of your configuration file.
Here's a quick demo:
grammar T;
parse returns [java.util.Map<String, List<Integer>> map]
#init{$map = new java.util.HashMap<String, List<Integer>>();}
: (line {$map.put($line.key, $line.values);} )+ EOF
;
line returns [String key, List<Integer> values]
: Id numbers (NL | EOF)
{
$key = $Id.text;
$values = $numbers.list;
}
;
numbers returns [List<Integer> list]
#init{$list = new ArrayList<Integer>();}
: (Num {$list.add(Integer.parseInt($Num.text));} )+
;
Num : '0'..'9'+;
Id : ('a'..'z' | 'A'..'Z')+;
NL : '\r'? '\n' | '\r';
Space : (' ' | '\t')+ {skip();};
If you runt the class below:
import org.antlr.runtime.*;
import java.util.*;
public class Main {
public static void main(String[] args) throws Exception {
String input = "A 12 34\n" +
"B 5 6 7 8\n" +
"C 9";
TLexer lexer = new TLexer(new ANTLRStringStream(input));
TParser parser = new TParser(new CommonTokenStream(lexer));
Map<String, List<Integer>> values = parser.parse();
System.out.println(values);
}
}
the following will be printed to the console:
{A=[12, 34], B=[5, 6, 7, 8], C=[9]}

The grammar should be something like this (it's pseudocode not ANTLR):
FILE ::= STATEMENT ('\n' STATEMENT)*
STATEMENT ::= NAME ITEM*
ITEM = '[' \d+ ']'
NAME = \w+

If you are looking for way to execute code when something is parsed, you should either use actions or AST (look them up in the documentation).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Antlr4 regular expression grammar, data structure for NFA transition table - java

Related

Antlr4 token ambiguity for single character

ANTLR4 -no viable alternative at input 's4'

ANTLR4 AST Creation - How to create an AstVistor

Antlr4 doesn't recognize identifiers

ANTLR: parse configuration file

Categories

Resources