ANTLR4 AST Creation - How to create an AstVistor - java

With the help of this SO question How to create AST with ANTLR4? I was able to create the AST Nodes, but I'm stuck at coding the BuildAstVisitor as depicted in the accepted answer's example.
I have a grammar that starts like this:
mini: (constDecl | varDef | funcDecl | funcDef)* ;
And I can neither assign a label to the block (antlr4 says label X assigned to a block which is not a set), and I have no idea how to visit the next node.
public Expr visitMini(MiniCppParser.MiniContext ctx) {
return visitConstDecl(ctx.constDecl());
}
I have the following problems with the code above: I don't know how to decide whether it's a constDecl, varDef or any other option and ctx.constDecl() returns a List<ConstDeclContext> whereas I only need one element for the visitConstDecl function.
edit:
More grammar rules:
mini: (constDecl | varDef | funcDecl | funcDef)* ;
//--------------------------------------------------
constDecl: 'const' type ident=ID init ';' ;
init: '=' ( value=BOOLEAN | sign=('+' | '-')? value=NUMBER ) ;
// ...
//--------------------------------------------------
OP_ADD: '+';
OP_SUB: '-';
OP_MUL: '*';
OP_DIV: '/';
OP_MOD: '%';
BOOLEAN : 'true' | 'false' ;
NUMBER : '-'? INT ;
fragment INT : '0' | [1-9] [0-9]* ;
ID : [a-zA-Z]+ ;
// ...
I'm still not entirely sure on how to implement the BuildAstVisitor. I now have something along the lines of the following, but it certainly doesn't look right to me...
#Override
public Expr visitMini(MiniCppParser.MiniContext ctx) {
for (MiniCppParser.ConstDeclContext constDeclCtx : ctx.constDecl()) {
visit(constDeclCtx);
}
return null;
}
#Override
public Expr visitConstDecl(MiniCppParser.ConstDeclContext ctx) {
visit(ctx.type());
return visit(ctx.init());
}

If you want to get the individual subrules then implement the visitXXX functions for them (visitConstDecl(), visitVarDef() etc.) instead of the visitMini() function. They will only be called if there's really a match for them in the input. Hence you don't need to do any checks for occurences.

Related

How to fix the error in left-recursion used with semantic predicates?

I would like to parse two type of expression with boolean :
- the first would be an init expression with boolean like : init : false
- and the last one would be a derive expression with boolean like : derive : !express or (express and (amount >= 100))
My idea is to put semantic predicates in a set of rules,
the goal is when I'm parsing a boolean expression beginning with the word 'init' then it has to go to only one alternative rule proposed who is boolliteral, the last alternative in boolExpression. And if it's an expression beginning with the word 'derive' then it could have access to all alternatives of boolExpression.
I know that I could make two type of boolExpression without semantic predicates like boolExpressionInit and boolExpressionDerive... But I would like to try with my idea if it's could work with a only one boolExpression with semantic predicates.
Here's my grammar
grammar TestExpression;
#header
{
package testexpressionparser;
}
#parser::members {
int vConstraintType;
}
/* SYNTAX RULES */
textInput : initDefinition
| derDefinition ;
initDefinition : t=INIT {vConstraintType = $t.type;} ':' boolExpression ;
derDefinition : t=DERIVE {vConstraintType = $t.type;} ':' boolExpression ;
boolExpression : {vConstraintType != INIT || vConstraintType == DERIVE}? boolExpression (boolOp|relOp) boolExpression
| {vConstraintType != INIT || vConstraintType == DERIVE}? NOT boolExpression
| {vConstraintType != INIT || vConstraintType == DERIVE}? '(' boolExpression ')'
| {vConstraintType != INIT || vConstraintType == DERIVE}? attributeName
| {vConstraintType != INIT || vConstraintType == DERIVE}? numliteral
| {vConstraintType == INIT || vConstraintType == DERIVE}? boolliteral
;
boolOp : OR | AND ;
relOp : EQ | NEQ | GT | LT | GEQT | LEQT ;
attributeName : WORD;
numliteral : intliteral | decliteral;
intliteral : INT ;
decliteral : DEC ;
boolliteral : BOOLEAN;
/* LEXICAL RULES */
INIT : 'init';
DERIVE : 'derive';
BOOLEAN : 'true' | 'false' ;
BRACKETSTART : '(' ;
BRACKETSTOP : ')' ;
BRACESTART : '{' ;
BRACESTOP : '}' ;
EQ : '=' ;
NEQ : '!=' ;
NOT : '!' ;
GT : '>' ;
LT : '<' ;
GEQT : '>=' ;
LEQT : '<=' ;
OR : 'or' ;
AND : 'and' ;
DEC : [0-9]* '.' [0-9]* ;
INT : ZERO | POSITIF;
ZERO : '0';
POSITIF : [1-9] [0-9]* ;
WORD : [a-zA-Z] [_0-9a-zA-Z]* ;
WS : (SPACE | NEWLINE)+ -> skip ;
SPACE : [ \t] ; /* Space or tab */
NEWLINE : '\r'? '\n' ; /* Carriage return and new line */
I except that the grammar would run successfully, but what i receive is : "error(119): TestExpression.g4::: The following sets of rules are mutually left-recursive [boolExpression]
1 error(s)
BUILD FAIL"
Apparently ANTLR4's support for (direct) left-recursion does not work when a predicate appears before a left-recursive rule invocation. So you can fix the error by moving the predicate after the first boolExpression in the left-recursive alternatives.
That said, it seems like the predicates aren't really necessary in the first place - at least not in the example you've shown us (or the one before your edit as far as I could tell). Since a boolExpression with the constraint type INIT can apparently only match boolLiteral, you can just change initDefinition as follows:
initDefinition : t=INIT ':' boolLiteral ;
Then boolExpression will always have the constraint type DERIVE and no predicates are necessary anymore.
Generally, if you want to allow different alternatives in non-terminal x based on whether it was invoked by y or z, you should simply have multiple versions of x and then call one from y and the other from z. That's usually a lot less hassle than littering the code with actions and predicates.
Similarly it can also make sense to have a rule that matches more than it should and then detect illegal expressions in a later phase instead of trying to reject them at the syntax level. Specifically beginners often try to write grammars that only allow well-typed expressions (rejecting something like 1+true with a syntax error) and that never works out well.

Prevent left recursion in ANTLR 4 from matching invalid inputs

I am making a simple programming language. It has the following grammar:
program: declaration+;
declaration: varDeclaration
| statement
;
varDeclaration: 'var' IDENTIFIER ('=' expression)?';';
statement: exprStmt
| assertStmt
| printStmt
| block
;
exprStmt: expression';';
assertStmt: 'assert' expression';';
printStmt: 'print' expression';';
block: '{' declaration* '}';
//expression without left recursion
/*
expression: assignment
;
assignment: IDENTIFIER '=' assignment
| equality;
equality: comparison (op=('==' | '!=') comparison)*;
comparison: addition (op=('>' | '>=' | '<' | '<=') addition)* ;
addition: multiplication (op=('-' | '+') multiplication)* ;
multiplication: unary (op=( '/' | '*' ) unary )* ;
unary: op=( '!' | '-' ) unary
| primary
;
*/
//expression with left recursion
expression: IDENTIFIER '=' expression
| expression op=('==' | '!=') expression
| expression op=('>' | '>=' | '<' | '<=') expression
| expression op=('-' | '+') expression
| expression op=( '/' | '*' ) expression
| op=( '!' | '-' ) expression
| primary
;
primary: intLiteral
| booleanLiteral
| stringLiteral
| identifier
| group
;
intLiteral: NUMBER;
booleanLiteral: value=('True' | 'False');
stringLiteral: STRING;
identifier: IDENTIFIER;
group: '(' expression ')';
TRUE: 'True';
FALSE: 'False';
NUMBER: [0-9]+ ;
STRING: '"' ~('\n'|'"')* '"' ;
IDENTIFIER : [a-zA-Z]+ ;
This left recursive grammar is useful because it ensures every node in the parse tree has at most 2 children. For example,
var a = 1 + 2 + 3 will turn into two nested addition expressions, rather than one addition expression with three children. That behavior is useful because it makes writing an interpreter easy, since I can just do (highly simplified):
public Object visitAddition(AdditionContext ctx) {
return visit(ctx.addition(0)) + visit(ctx.addition(1));
}
instead of iterating through all the child nodes.
However, this left recursive grammar has one flaw, which is that it accepts invalid statements.
For example:
var a = 3;
var b = 4;
a = b == b = a;
is valid under this grammar even though the expected behavior would be
b == b is parsed first since == has higher precedence than assignment (=).
Because b == b is parsed first, the expression becomes incoherent. Parsing fails.
Instead, the following undesired behavior occurs: the final line is parsed as (a = b) == (b = a).
How can I prevent left recursion from parsing incoherent statements, such as a = b == b = a?
The non-left-recursive grammar recognizes this input is correct and throws a parsing error, which is the desired behavior.

Antlr4 regular expression grammar, data structure for NFA transition table

I apologize for the extremely long explanation but I'm stuck for a month now and I really can't figure out how to solve this.
I have to derive, as a project, a compiler with antlr4 for a regex grammar that generate a program (JAVA) able to distinguish words belonging to the language generated by a regex used as input for antlr4 compiler.
The grammar that we have to use is this one:
RE ::= union | simpleRE
union ::= simpleRE + RE
simpleRE ::= concatenation | basicRE
concatenation ::= basicRE simpleRE
basicRE ::= group | any | char
group ::= (RE) | (RE)∗ | (RE)+
any ::= ?
char ::= a | b | c | ··· | z | A | B | C | ··· | Z | 0 | 1 | 2 | ··· | 9 | . | − | _
and from that, I gave this grammar to antrl4
Regexp.g4
grammar Regxp;
start_rule
: re # start
;
re
: union
| simpleRE
;
union
: simpleRE '+' re # unionOfREs
;
simpleRE
: concatenation
| basicRE
;
concatenation
: basicRE simpleRE #concatOfREs
;
basicRE
: group
| any
| cHAR
;
group
: LPAREN re RPAREN '*' # star
| LPAREN re RPAREN '+' # plus
| LPAREN re RPAREN # singleWithParenthesis
;
any
: '?'
;
cHAR
: CHAR #singleChar
;
WS : [ \t\r\n]+ -> skip ;
LPAREN : '(' ;
RPAREN : ')' ;
CHAR : LETTER | DIGIT | DOT | D | UNDERSCORE
;
/* tokens */
fragment LETTER: [a-zA-Z]
;
fragment DIGIT: [0-9]
;
fragment DOT: '.'
;
fragment D: '-'
;
fragment UNDERSCORE: '_'
;
Then i generated the java files from antlr4 with visitors.
As far as i understood the logic of the project, when the visitor is traversing the parse tree, it has to generate lines of code to fill the transition table of the NFA derived as applying the Thompson rules on the input regexp.
Then these lines of code are to be saved as a .java text file, and compiled to a program that takes in input a string (word) and tells if the word belongs or not to the language generated by the regex.
The result should be like this:
RE word Result
a+b a OK
b OK
ac KO
a∗b aab OK
b OK
aaaab OK
abb KO
So I'm asking, how can I represent the transition table in a way such that it can be filled during the visit of the parse tree and then exported in order to be used by a simple java program implementing the acceptance algorithm for an NFA? (i'm considering this pseudo-code):
S = ε−closure(s0);
c = nextChar();
while (c ≠ eof) do
S = ε−closure(move(S,c));
c = nextChar();
end while
if (S ∩ F ≠ ∅) then return “yes”;
else return “no”;
end if
As of now I managed to make that, when the visitor is for example in the unionOfREs rule, it will do something like this:
MyVisitor.java
private List<String> generatedCode = new ArrayList<String>();
/* ... */
#Override
public String visitUnionOfREs(RegxpParser.UnionOfREsContext ctx) {
System.out.println("unionOfRExps");
String char1 = visit(ctx.simpleRE());
String char2 = visit(ctx.re());
generatedCode.add("tTable.addUnion("+char1+","+char2+");");
//then this line of code will populate the transition table
return char1+"+"+char2;
}
/* ... */
The addUnion it's inside a java file that will contains all the methods to fill the transition table. I wrote code for the union, but i dont' like it because it's like to write the transition table of the NFA, as you would write it on a paper: example.
I got this when I noticed that by building the table iteratively, you can define 2 "pointers" on the table, currentBeginning and currentEnd, that tell you where to expand again the character written on the table, with the next rule that the visitor will find on the parse tree. Because this character can be another production or just a single character. On the link it is represented the written-on-paper example that convinced me to use this approach.
TransitionTable.java
/* ... */
public void addUnion(String char1, String char2) {
if (transitionTable.isEmpty()) {
List<List<Integer>> lc1 = Arrays.asList(Arrays.asList(null)
,Arrays.asList(currentBeginning+3)
,Arrays.asList(null)
,Arrays.asList(null)
,Arrays.asList(null)
,Arrays.asList(null));
List<List<Integer>> lc2 = Arrays.asList(Arrays.asList(null)
,Arrays.asList(null)
,Arrays.asList(currentBeginning+4)
,Arrays.asList(null)
,Arrays.asList(null)
,Arrays.asList(null));
List<List<Integer>> le = Arrays.asList(Arrays.asList(currentBeginning+1,currentBeginning+2)
,Arrays.asList(null)
,Arrays.asList(null)
,Arrays.asList(currentBeginning+5)
,Arrays.asList(currentBeginning+5)
,Arrays.asList(null));
transitionTable.put(char1, lc1);
transitionTable.put(char2, lc2);
transitionTable.put("epsilon", le);
//currentBeginning += 2;
//currentEnd = transitionTable.get(char2).get(currentBeginning).get(0);
currentEnd = transitionTable.get("epsilon").size()-1;//il 5
} else { //not the first time it encounters this rule, beginning and end changed
//needs to add 2 less states
}
}
/* ... */
At the moment I'm representing the transition table as HashMap<String, List<List<Integer>>> strings are for chars on the edges of the NFA and List<List<Integer>> because by being non deterministic, it needs to represent more transitions from a single state.
But going this way, for a parse tree like this i will obtain this line of code for the union : "tTable.addUnion("tTable.addConcat(a,b)","+char2+");"
And i'm blocked here, i don't know how to solve this and i really can't think a different way to represent the transition table or to fill it while visiting the parse tree.
Thank You.
Using Thompson's construction, every regular (sub-)expression produces an NFA, and every regular expression operator (union, cat, *) can be implemented by adding a couple states and connecting them to states that already exists. See:
https://en.wikipedia.org/wiki/Thompson%27s_construction
So, when parsing the regex, every terminal or non-terminal production should add the required states and transitions to the NFA, and return its start and end state to the containing production. Non-terminal productions will combine their children and return their own start+end states so that your NFA can be built from the leaves of the regular expression up.
The representation of the state table is not critical for building. Thompson's construction will never require you to modify a state or transition that you built before, so you just need to be able to add new ones. You will also never need more than one transition from a state on the same character, or even more than one non-epsilon transition. In fact, if all your operators are binary you will never need more than 2 transitions on a state. Usually the representation is designed to make it easy to do the next steps, like DFA generation or direct execution of the NFA against strings.
For example, a class like this can completely represent a state:
class State
{
public char matchChar;
public State matchState; //where to go if you match matchChar, or null
public State epsilon1; //or null
public State epsilon2; //or null
}
This would actually be a pretty reasonable representation for directly executing an NFA. But if you already have code for directly executing an NFA, then you should probably just build whatever it uses so you don't have to do another transformation.

ANTLR4 -> Check a literal is STRING_LITERAL or a NUMBER_LITERAL

Currently, I'm using this ANTLR4 grammar (part of) in order to get strings and numbers:
Go figure this summarized grammar:
gramm
: expr SCOL
;
expr
: literal #LiteralExpression
;
literal
: NUMERIC_LITERAL
| STRING_LITERAL
;
NUMERIC_LITERAL
: DIGIT+ ( '.' DIGIT* )? ( E [-+]? DIGIT+ )?
| '.' DIGIT+ ( E [-+]? DIGIT+ )?
;
STRING_LITERAL
: '\'' ( ~'\'' | '\'\'' )* '\''
;
SPACES
: [ \u000B\t\r\n] -> channel(HIDDEN)
;
fragment DIGIT : [0-9];
So, I'm implementing an GrammBaseVisitor<Void>.
I'm not quite to figure out how to check whether a literal is a NUMERIC_LITERAL or a STRING_LITERAL.
As far I've been able to get, I've override visitLiteral() and visitLiteralExpression():
#Override
public Void visitLiteral(LiteralContext ctx) {
// TODO What should I do here in order to check whether
// ctx contains an STRING_LITERAL or a NUMBER_LITERAL?
return super.visitLiteral(ctx);
}
#Override
public Void visitLiteralExpression(LiteralExpressionContext ctx) {
return super.visitLiteralExpression(ctx);
}
What's the difference between visitLiteral and visitLiteralExpression()?
Your literal production consists of two possible terminals, the numeric and string literal. Whether parsed input contains one or the other, you can determine with null checks inside visitLiteral.
#Override
public Object visitLiteral(LiteralContext ctx) {
TerminalNode numeric = ctx.NUMERIC_LITERAL();
TerminalNode string = ctx.STRING_LITERAL();
if (numeric != null) {
System.out.println(numeric.getSymbol().getType());
} else if (string != null) {
System.out.println(string.getSymbol().getType());
}
return super.visitLiteral(ctx);
}
You can visit all terminals by overriding visitTerminal.
#Override
public Object visitTerminal(TerminalNode node) {
int type = node.getSymbol().getType(); // matches a constant in your parser
switch (type) {
case GrammParser.NUMERIC_LITERAL:
System.out.println("numeric literal");
break;
case GrammParser.STRING_LITERAL:
System.out.println("string literal");
break;
}
System.out.println(node.getSymbol().getText());
return super.visitTerminal(node);
}
What's the difference between visitLiteral and visitLiteralExpression()?
The former represents your literal production and the latter represents your expr production. Note that # symbol has special meaning in ANTLR 4 syntax, representing a label - a name for alternatives inside productions. It is not a comment. Since your expr only has one alternative, it becomes visitLiteralExpression. Try commenting it out (//) and see your generated code change.

Rewriting parts of a string

I have a string I would like to rewrite. The string contains substrings that look like "DDT" plus four digits. I'll call these blocks. It also contains connectives like "&" and "|", where | represents "or", as well as parentheses.
Now I would like to rewrite this string such that blocks separated by &s should be written as "min(x(block1), x(block2), etc.)", whereas blocks separated by |s should be written as "max(x(block1), x(block2), etc.)".
Looking at an example should help:
public class test{
public static void main(String[] arg) throws Exception {
String str = "(DDT1453 & DDT1454) | (DDT3524 & DDT3523 & DDT3522 & DDT3520)";
System.out.println(str.replaceAll("DDT\\d+","x($0)"));
}
}
My desired output is:
max(min(x(DDT1453),x(DDT1454)),min(x(DDT3524),x(DDT3523),x(DDT3522),x(DDT3520)))
As you can see, I performed an initial substitution to include the x(block) part of the output, but I cannot get the rest. Any ideas on how to achieve my desired output?
Just doing string substitution is the wrong way to go about this. Use Recursive-Descent Parsing instead
First you want to define what symbols create what for example:
program -> LiteralArg|fn(x)|program
LiteralArg -> LiteralArg
LiteralArg&LiteralArg -> fn(LiteralArg) & fn'(LiteralArg)
fn(x) -> fn(x)
fn(x) |fn(y) -> fn(x),fn(y)
From there you make functions which will recursively parse your data expecting certain things to happen. For example
String finalResult = "";
function parse(baseString) {
if(basestring.isLiteralArg)
{
if(peekAheadToCheckForAmpersand())
{
expectAnotherLiteralArgAfterAmpersandOtherwiseThrowError();
finalResult += fn(LiteralArg) & fn'(LiteralArg)
parse(baseString - recentToken);
}
else
{
finalResult += literalArg;
parse(baseString - recentToken);
}
}
else if(baseString.isFunction()
{
if(peekAheadToCheckForPipe())
{
expectAnotherFunctionAfterAmpersandOtherwiseThrowError();
finalResult += fn(x),fn(y)
parse(baseString - recentToken);
}
else
{
finalResult += fn(x)
parse(baseString - recentToken);
}
}
}
As you find tokens, take them off the string and call the parse function on the remaining string.
Rough example which I'm basing off a project I did years ago. Here is the relevant lecture:
http://faculty.ycp.edu/~dhovemey/fall2009/cs340/lecture/lecture7.html
If you insist on using regex substitutions, then the following code seems to work:
str = str.replaceAll("\\([^)]*\\)", "min$0");
str = str.replaceAll("DDT\\d+","x($0)");
str = str.replaceAll("&|\\|",",");
str = "max(" + str + ")";
Hoewever, I would consider what the others suggest - using a parsing logic instead.
This way you can extend your grammar easily in the future, and you'll also be able to validate the input and report meaningful error messages.
--EDIT--
The solution above assumes there's no nesting. If nesting is legal, then you definitely can't use the regex solution.
If you are intested to learn and use ANTLR
Following ANTLR grammer
grammar DDT;
options {
output = AST;
ASTLabelType = CommonTree;
}
tokens { DDT; AMP; PIPE;}
#members {}
expr : op1=amp (oper=PIPE^ op2=amp)*;
amp : op1=atom (oper=AMP^ op2=atom)*;
atom : DDT! INT | '('! expr ')'!;
fragment
Digit : '0'..'9';
PIPE : '|' ;
AMP : '&';
DDT : 'DDT';
INT : Digit Digit*;
produces below AST (abstract syntax tree) for input (DDT1 | DDT2) & (DDT3 | DDT4) & DDT5
above syntax tree (CommonTree) could be walked in intended order (optionally using StringTemplates) and the desired result could be obtained.
A full-blown parser for such a small grammar could be an overkill, specially when the OP obviously has no prior experience with them. Not even using parser generators like ANTLR or JavaCC seems a good idea.
It's not easy to elaborate more with the current information. OP, please provide information requested as comments to your question.
Tentative grammar:
maxExpr ::= maxExpr '|' '(' minExpr ')'
maxExpr ::= '(' minExpr ')'
minExpr ::= minExpr '&' ITEM
minExpr ::= ITEM
ITEM ::= 'DDT\d{4}'
Realized that, true, the grammar is excessive for a RegEx, but for a single RegEx. Nobody is saying we can't use more than one. In fact, even the simplest RegEx substitution can be regarded as a step in a Turing machine, and thus the problem is solvable using them. So...
str= str.replaceAll("\\s+", "" ) ;
str= str.replaceAll("&", "," ) ;
str= str.replaceAll("\\([^)]+\\)", "-$0" ) ;
str= str.replaceAll("\\|", "," ) ;
str= str.replaceAll(".+", "+($0)" ) ;
str= str.replaceAll("\\w+", "x($0)" ) ;
str= str.replaceAll("\\+", "max" ) ;
str= str.replaceAll("-", "min" ) ;
I didn't take many shortcuts. The general idea is that "+" equates to a production of maxExpr and "-" to one of minExpr.
I tested this with input
str= "(DDT1453 & DDT1454 & DDT1111) | (DDT3524 & DDT3523 & DDT3522 & DDT3520)" ;
Output is:
max(min(x(DDT1453),x(DDT1454),x(DDT1111)),min(x(DDT3524),x(DDT3523),x(DDT3522),x(DDT3520)))
Back to the idea of a grammar, it's easy to recognize that the significant elements of it really are ITEMS and '|' . All the rest (parentheses and '&') is just decoration.
Simplified grammar:
maxExpr ::= maxExpr '|' minExpr
maxExpr ::= minExpr
minExpr ::= minExpr ITEM
minExpr ::= ITEM
ITEM ::= 'DDT\d{4}'
From here, a very simple finite automaton:
<start>
maxExpr= new List() ;
minExpr= new List() ;
"Expecting ITEM" (BEFORE_ITEM):
ITEM -> minExpr.add(ITEM) ; move to "Expecting ITEM, |, or END"
"Expecting ITEM, |, or END" (AFTER_ITEM):
ITEM -> minExpr.add(ITEM) ; move to "Expecting ITEM, |, or END"
| -> maxExpr.add(minExpr); minExpr= new List(); move to "Expecting ITEM"
END -> maxExpr.add(minExpr); move to <finish>
... and the corresponding implementation:
static Pattern pattern= Pattern.compile("(\\()|(\\))|(\\&)|(\\|)|(\\w+)|(\\s+)") ;
static enum TokenType { OPEN, CLOSE, MIN, MAX, ITEM, SPACE, _END_, _ERROR_ };
static enum State { BEFORE_ITEM, AFTER_ITEM, END }
public static class Token {
TokenType type;
String value;
public Token(TokenType type, String value) {
this.type= type ;
this.value= value ;
}
}
public static class Lexer {
Scanner scanner;
public Lexer(String input) {
this.scanner= new Scanner(input) ;
}
public Token getNext() {
String tokenValue= scanner.findInLine(pattern) ;
TokenType tokenType;
if( tokenValue == null ) tokenType= TokenType._END_ ;
else if( tokenValue.matches("\\s+") ) tokenType= TokenType.SPACE ;
else if( "(".equals(tokenValue) ) tokenType= TokenType.OPEN ;
else if( ")".equals(tokenValue) ) tokenType= TokenType.CLOSE ;
else if( "&".equals(tokenValue) ) tokenType= TokenType.MIN ;
else if( "|".equals(tokenValue) ) tokenType= TokenType.MAX ;
else if( tokenValue.matches("\\w+") ) tokenType= TokenType.ITEM ;
else tokenType= TokenType._ERROR_ ;
return new Token(tokenType,tokenValue) ;
}
public void close() {
scanner.close();
}
}
public static String formatColl(String pre,Collection<?> coll,String sep,String post) {
StringBuilder result= new StringBuilder() ;
result.append(pre);
boolean first= true ;
for(Object item: coll ) {
if( ! first ) result.append(sep);
result.append(item);
first= false ;
}
result.append(post);
return result.toString() ;
}
public static void main(String... args) {
String str= "(DDT1453 & DDT1454) | (DDT3524 & DDT3523 & DDT3522 & DDT3520)" ;
Lexer lexer= new Lexer(str) ;
State currentState= State.BEFORE_ITEM ;
List<List<String>> maxExpr= new LinkedList<List<String>>() ;
List<String> minExpr= new LinkedList<String>() ;
while( currentState != State.END ) {
Token token= lexer.getNext() ;
switch( currentState ) {
case BEFORE_ITEM:
switch( token.type ) {
case ITEM:
minExpr.add("x("+token.value+")") ;
currentState= State.AFTER_ITEM ;
break;
case _END_:
maxExpr.add(minExpr) ;
currentState= State.END ;
break;
default:
// Ignore; preserve currentState, of course
break;
}
break;
case AFTER_ITEM:
switch( token.type ) {
case ITEM:
minExpr.add("x("+token.value+")") ;
currentState= State.AFTER_ITEM ;
break;
case MAX:
maxExpr.add(minExpr) ;
minExpr= new LinkedList<String>() ;
currentState= State.BEFORE_ITEM ;
break;
case _END_:
maxExpr.add(minExpr) ;
currentState= State.END ;
break;
default:
// Ignore; preserve currentState, of course
break;
}
break;
}
}
lexer.close();
System.out.println(maxExpr);
List<String> maxResult= new LinkedList<String>() ;
for(List<String> minItem: maxExpr ) {
maxResult.add( formatColl("min(",minExpr,",",")") ) ;
}
System.out.println( formatColl("max(",maxResult,",",")") );
}
Regex is not the best choice to do this - or to say it right away: its not possible (in java).
Regex might be able to change the formating of a given String, using backreferenes, but it can not generate content aware backreferences. In other words: You would require some kind of recursion (or iterative solution) to resolve an infinite depth of nested parenthesis.
Therefore, you would need to write your own parser, that is able to handle your input.
While replacing the DDT1234 Strings with the appropriate x(DDT1234) representation is easily doable (its a single backreference for ALL occurrences), you need to take care for correct nesting on your own.
For parsing nested expressions, you may want to have a look at this example:
Parsing an Infix Expression with Parentheses (like ((2*4-6/3)*(3*5+8/4))-(2+3))
http://www.smccd.net/accounts/hasson/C++2Notes/ArithmeticParsing.html
Its just a (verbal) example of how to handle such a given string.

Categories

Resources