I have a string I would like to rewrite. The string contains substrings that look like "DDT" plus four digits. I'll call these blocks. It also contains connectives like "&" and "|", where | represents "or", as well as parentheses.
Now I would like to rewrite this string such that blocks separated by &s should be written as "min(x(block1), x(block2), etc.)", whereas blocks separated by |s should be written as "max(x(block1), x(block2), etc.)".
Looking at an example should help:
public class test{
public static void main(String[] arg) throws Exception {
String str = "(DDT1453 & DDT1454) | (DDT3524 & DDT3523 & DDT3522 & DDT3520)";
System.out.println(str.replaceAll("DDT\\d+","x($0)"));
}
}
My desired output is:
max(min(x(DDT1453),x(DDT1454)),min(x(DDT3524),x(DDT3523),x(DDT3522),x(DDT3520)))
As you can see, I performed an initial substitution to include the x(block) part of the output, but I cannot get the rest. Any ideas on how to achieve my desired output?
Just doing string substitution is the wrong way to go about this. Use Recursive-Descent Parsing instead
First you want to define what symbols create what for example:
program -> LiteralArg|fn(x)|program
LiteralArg -> LiteralArg
LiteralArg&LiteralArg -> fn(LiteralArg) & fn'(LiteralArg)
fn(x) -> fn(x)
fn(x) |fn(y) -> fn(x),fn(y)
From there you make functions which will recursively parse your data expecting certain things to happen. For example
String finalResult = "";
function parse(baseString) {
if(basestring.isLiteralArg)
{
if(peekAheadToCheckForAmpersand())
{
expectAnotherLiteralArgAfterAmpersandOtherwiseThrowError();
finalResult += fn(LiteralArg) & fn'(LiteralArg)
parse(baseString - recentToken);
}
else
{
finalResult += literalArg;
parse(baseString - recentToken);
}
}
else if(baseString.isFunction()
{
if(peekAheadToCheckForPipe())
{
expectAnotherFunctionAfterAmpersandOtherwiseThrowError();
finalResult += fn(x),fn(y)
parse(baseString - recentToken);
}
else
{
finalResult += fn(x)
parse(baseString - recentToken);
}
}
}
As you find tokens, take them off the string and call the parse function on the remaining string.
Rough example which I'm basing off a project I did years ago. Here is the relevant lecture:
http://faculty.ycp.edu/~dhovemey/fall2009/cs340/lecture/lecture7.html
If you insist on using regex substitutions, then the following code seems to work:
str = str.replaceAll("\\([^)]*\\)", "min$0");
str = str.replaceAll("DDT\\d+","x($0)");
str = str.replaceAll("&|\\|",",");
str = "max(" + str + ")";
Hoewever, I would consider what the others suggest - using a parsing logic instead.
This way you can extend your grammar easily in the future, and you'll also be able to validate the input and report meaningful error messages.
--EDIT--
The solution above assumes there's no nesting. If nesting is legal, then you definitely can't use the regex solution.
If you are intested to learn and use ANTLR
Following ANTLR grammer
grammar DDT;
options {
output = AST;
ASTLabelType = CommonTree;
}
tokens { DDT; AMP; PIPE;}
#members {}
expr : op1=amp (oper=PIPE^ op2=amp)*;
amp : op1=atom (oper=AMP^ op2=atom)*;
atom : DDT! INT | '('! expr ')'!;
fragment
Digit : '0'..'9';
PIPE : '|' ;
AMP : '&';
DDT : 'DDT';
INT : Digit Digit*;
produces below AST (abstract syntax tree) for input (DDT1 | DDT2) & (DDT3 | DDT4) & DDT5
above syntax tree (CommonTree) could be walked in intended order (optionally using StringTemplates) and the desired result could be obtained.
A full-blown parser for such a small grammar could be an overkill, specially when the OP obviously has no prior experience with them. Not even using parser generators like ANTLR or JavaCC seems a good idea.
It's not easy to elaborate more with the current information. OP, please provide information requested as comments to your question.
Tentative grammar:
maxExpr ::= maxExpr '|' '(' minExpr ')'
maxExpr ::= '(' minExpr ')'
minExpr ::= minExpr '&' ITEM
minExpr ::= ITEM
ITEM ::= 'DDT\d{4}'
Realized that, true, the grammar is excessive for a RegEx, but for a single RegEx. Nobody is saying we can't use more than one. In fact, even the simplest RegEx substitution can be regarded as a step in a Turing machine, and thus the problem is solvable using them. So...
str= str.replaceAll("\\s+", "" ) ;
str= str.replaceAll("&", "," ) ;
str= str.replaceAll("\\([^)]+\\)", "-$0" ) ;
str= str.replaceAll("\\|", "," ) ;
str= str.replaceAll(".+", "+($0)" ) ;
str= str.replaceAll("\\w+", "x($0)" ) ;
str= str.replaceAll("\\+", "max" ) ;
str= str.replaceAll("-", "min" ) ;
I didn't take many shortcuts. The general idea is that "+" equates to a production of maxExpr and "-" to one of minExpr.
I tested this with input
str= "(DDT1453 & DDT1454 & DDT1111) | (DDT3524 & DDT3523 & DDT3522 & DDT3520)" ;
Output is:
max(min(x(DDT1453),x(DDT1454),x(DDT1111)),min(x(DDT3524),x(DDT3523),x(DDT3522),x(DDT3520)))
Back to the idea of a grammar, it's easy to recognize that the significant elements of it really are ITEMS and '|' . All the rest (parentheses and '&') is just decoration.
Simplified grammar:
maxExpr ::= maxExpr '|' minExpr
maxExpr ::= minExpr
minExpr ::= minExpr ITEM
minExpr ::= ITEM
ITEM ::= 'DDT\d{4}'
From here, a very simple finite automaton:
<start>
maxExpr= new List() ;
minExpr= new List() ;
"Expecting ITEM" (BEFORE_ITEM):
ITEM -> minExpr.add(ITEM) ; move to "Expecting ITEM, |, or END"
"Expecting ITEM, |, or END" (AFTER_ITEM):
ITEM -> minExpr.add(ITEM) ; move to "Expecting ITEM, |, or END"
| -> maxExpr.add(minExpr); minExpr= new List(); move to "Expecting ITEM"
END -> maxExpr.add(minExpr); move to <finish>
... and the corresponding implementation:
static Pattern pattern= Pattern.compile("(\\()|(\\))|(\\&)|(\\|)|(\\w+)|(\\s+)") ;
static enum TokenType { OPEN, CLOSE, MIN, MAX, ITEM, SPACE, _END_, _ERROR_ };
static enum State { BEFORE_ITEM, AFTER_ITEM, END }
public static class Token {
TokenType type;
String value;
public Token(TokenType type, String value) {
this.type= type ;
this.value= value ;
}
}
public static class Lexer {
Scanner scanner;
public Lexer(String input) {
this.scanner= new Scanner(input) ;
}
public Token getNext() {
String tokenValue= scanner.findInLine(pattern) ;
TokenType tokenType;
if( tokenValue == null ) tokenType= TokenType._END_ ;
else if( tokenValue.matches("\\s+") ) tokenType= TokenType.SPACE ;
else if( "(".equals(tokenValue) ) tokenType= TokenType.OPEN ;
else if( ")".equals(tokenValue) ) tokenType= TokenType.CLOSE ;
else if( "&".equals(tokenValue) ) tokenType= TokenType.MIN ;
else if( "|".equals(tokenValue) ) tokenType= TokenType.MAX ;
else if( tokenValue.matches("\\w+") ) tokenType= TokenType.ITEM ;
else tokenType= TokenType._ERROR_ ;
return new Token(tokenType,tokenValue) ;
}
public void close() {
scanner.close();
}
}
public static String formatColl(String pre,Collection<?> coll,String sep,String post) {
StringBuilder result= new StringBuilder() ;
result.append(pre);
boolean first= true ;
for(Object item: coll ) {
if( ! first ) result.append(sep);
result.append(item);
first= false ;
}
result.append(post);
return result.toString() ;
}
public static void main(String... args) {
String str= "(DDT1453 & DDT1454) | (DDT3524 & DDT3523 & DDT3522 & DDT3520)" ;
Lexer lexer= new Lexer(str) ;
State currentState= State.BEFORE_ITEM ;
List<List<String>> maxExpr= new LinkedList<List<String>>() ;
List<String> minExpr= new LinkedList<String>() ;
while( currentState != State.END ) {
Token token= lexer.getNext() ;
switch( currentState ) {
case BEFORE_ITEM:
switch( token.type ) {
case ITEM:
minExpr.add("x("+token.value+")") ;
currentState= State.AFTER_ITEM ;
break;
case _END_:
maxExpr.add(minExpr) ;
currentState= State.END ;
break;
default:
// Ignore; preserve currentState, of course
break;
}
break;
case AFTER_ITEM:
switch( token.type ) {
case ITEM:
minExpr.add("x("+token.value+")") ;
currentState= State.AFTER_ITEM ;
break;
case MAX:
maxExpr.add(minExpr) ;
minExpr= new LinkedList<String>() ;
currentState= State.BEFORE_ITEM ;
break;
case _END_:
maxExpr.add(minExpr) ;
currentState= State.END ;
break;
default:
// Ignore; preserve currentState, of course
break;
}
break;
}
}
lexer.close();
System.out.println(maxExpr);
List<String> maxResult= new LinkedList<String>() ;
for(List<String> minItem: maxExpr ) {
maxResult.add( formatColl("min(",minExpr,",",")") ) ;
}
System.out.println( formatColl("max(",maxResult,",",")") );
}
Regex is not the best choice to do this - or to say it right away: its not possible (in java).
Regex might be able to change the formating of a given String, using backreferenes, but it can not generate content aware backreferences. In other words: You would require some kind of recursion (or iterative solution) to resolve an infinite depth of nested parenthesis.
Therefore, you would need to write your own parser, that is able to handle your input.
While replacing the DDT1234 Strings with the appropriate x(DDT1234) representation is easily doable (its a single backreference for ALL occurrences), you need to take care for correct nesting on your own.
For parsing nested expressions, you may want to have a look at this example:
Parsing an Infix Expression with Parentheses (like ((2*4-6/3)*(3*5+8/4))-(2+3))
http://www.smccd.net/accounts/hasson/C++2Notes/ArithmeticParsing.html
Its just a (verbal) example of how to handle such a given string.
Related
I have use case where I have a line of text containing nesting tokens (like { and }), and I wish to transform certain substrings nested at specific depths.
Example, capitalize the word moo at depth 1:
moo [moo [moo moo]] moo ->
moo [MOO [moo moo]] moo
Achieved by:
replaceTokens(input, 1, "[", "]", "moo", String::toUpperCase);
Or real world example, supply "--options" not already colored with the color sequence cyan:
#|blue --ignoreLog|# works, but --ignoreOutput silences everything. ->
#|blue --ignoreLog|# works, but #|cyan --ignoreOutput|# silences everything.
Achieved by:
replaceTokens(input, 0, "#|", "|#", "--\\w*", s -> format("#|cyan %s|#", s));
I have implemented this logic and though I feel pretty good about it (except performance probably), I also feel I reinvented the wheel. Here's how I implemented it:
set currentPos to zero
while (input line not fully consumed) {
take the remaining line
if the open token is matched, add to output, increase counter and advance pos accordingly
else if the close token is matched, add to output, decrease counter and advance pos accordingly
else if the counter matches provided depth and given regex matches, invoke replacer function and advance pos accordingly
else just record the next character and advance pos by 1
}
Here's the actual implementation:
public static String replaceNestedTokens(String lineWithTokens, int nestingDepth, String tokenOpen, String tokenClose, String tokenRegexToReplace, Function<String, String> tokenReplacer) {
final Pattern startsWithOpen = compile(quote(tokenOpen));
final Pattern startsWithClose = compile(quote(tokenClose));
final Pattern startsWithTokenToReplace = compile(format("(?<token>%s)", tokenRegexToReplace));
final StringBuilder lineWithTokensReplaced = new StringBuilder();
int countOpenTokens = 0;
int pos = 0;
while (pos < lineWithTokens.length()) {
final String remainingLine = lineWithTokens.substring(pos);
if (startsWithOpen.matcher(remainingLine).lookingAt()) {
countOpenTokens++;
lineWithTokensReplaced.append(tokenOpen);
pos += tokenOpen.length();
} else if (startsWithClose.matcher(remainingLine).lookingAt()) {
countOpenTokens--;
lineWithTokensReplaced.append(tokenClose);
pos += tokenClose.length();
} else if (countOpenTokens == nestingDepth) {
Matcher startsWithTokenMatcher = startsWithTokenToReplace.matcher(remainingLine);
if (startsWithTokenMatcher.lookingAt()) {
String matchedToken = startsWithTokenMatcher.group("token");
lineWithTokensReplaced.append(tokenReplacer.apply(matchedToken));
pos += matchedToken.length();
} else {
lineWithTokensReplaced.append(lineWithTokens.charAt(pos++));
}
} else {
lineWithTokensReplaced.append(lineWithTokens.charAt(pos++));
}
assumeTrue(countOpenTokens >= 0, "Unbalanced token sets: closed token without open token\n\t" + lineWithTokens);
}
assumeTrue(countOpenTokens == 0, "Unbalanced token sets: open token without closed token\n\t" + lineWithTokens);
return lineWithTokensReplaced.toString();
}
I couldn't make it work with a regex like this or this (or Scanner) solution, but I feel I'm reinventing the wheel and could solve this with (vanilla Java) out-of-the-box classes with less code. Also, I'm pretty sure this is a performance nightmare with all the inline patterns/matcher instances and substrings.
Suggestions?
You could be using a parser like ANTLR to create a grammar to describe your language or syntax. Then use a listener or visitor to make an interpreter of tokens.
A sample of the grammar would be like this (what I can infer from your code):
grammar Expr;
prog: (expr NEWLINE)* ;
expr: id '[' expr ']'
| '#|' expr '|#'
| '--ignoreLog' expr
| '--ignoreOutput' expr
| string
;
string: [a-zA-Z0-9];
NEWLINE : [\r\n]+ ;
With the help of this SO question How to create AST with ANTLR4? I was able to create the AST Nodes, but I'm stuck at coding the BuildAstVisitor as depicted in the accepted answer's example.
I have a grammar that starts like this:
mini: (constDecl | varDef | funcDecl | funcDef)* ;
And I can neither assign a label to the block (antlr4 says label X assigned to a block which is not a set), and I have no idea how to visit the next node.
public Expr visitMini(MiniCppParser.MiniContext ctx) {
return visitConstDecl(ctx.constDecl());
}
I have the following problems with the code above: I don't know how to decide whether it's a constDecl, varDef or any other option and ctx.constDecl() returns a List<ConstDeclContext> whereas I only need one element for the visitConstDecl function.
edit:
More grammar rules:
mini: (constDecl | varDef | funcDecl | funcDef)* ;
//--------------------------------------------------
constDecl: 'const' type ident=ID init ';' ;
init: '=' ( value=BOOLEAN | sign=('+' | '-')? value=NUMBER ) ;
// ...
//--------------------------------------------------
OP_ADD: '+';
OP_SUB: '-';
OP_MUL: '*';
OP_DIV: '/';
OP_MOD: '%';
BOOLEAN : 'true' | 'false' ;
NUMBER : '-'? INT ;
fragment INT : '0' | [1-9] [0-9]* ;
ID : [a-zA-Z]+ ;
// ...
I'm still not entirely sure on how to implement the BuildAstVisitor. I now have something along the lines of the following, but it certainly doesn't look right to me...
#Override
public Expr visitMini(MiniCppParser.MiniContext ctx) {
for (MiniCppParser.ConstDeclContext constDeclCtx : ctx.constDecl()) {
visit(constDeclCtx);
}
return null;
}
#Override
public Expr visitConstDecl(MiniCppParser.ConstDeclContext ctx) {
visit(ctx.type());
return visit(ctx.init());
}
If you want to get the individual subrules then implement the visitXXX functions for them (visitConstDecl(), visitVarDef() etc.) instead of the visitMini() function. They will only be called if there's really a match for them in the input. Hence you don't need to do any checks for occurences.
Currently, I'm using this ANTLR4 grammar (part of) in order to get strings and numbers:
Go figure this summarized grammar:
gramm
: expr SCOL
;
expr
: literal #LiteralExpression
;
literal
: NUMERIC_LITERAL
| STRING_LITERAL
;
NUMERIC_LITERAL
: DIGIT+ ( '.' DIGIT* )? ( E [-+]? DIGIT+ )?
| '.' DIGIT+ ( E [-+]? DIGIT+ )?
;
STRING_LITERAL
: '\'' ( ~'\'' | '\'\'' )* '\''
;
SPACES
: [ \u000B\t\r\n] -> channel(HIDDEN)
;
fragment DIGIT : [0-9];
So, I'm implementing an GrammBaseVisitor<Void>.
I'm not quite to figure out how to check whether a literal is a NUMERIC_LITERAL or a STRING_LITERAL.
As far I've been able to get, I've override visitLiteral() and visitLiteralExpression():
#Override
public Void visitLiteral(LiteralContext ctx) {
// TODO What should I do here in order to check whether
// ctx contains an STRING_LITERAL or a NUMBER_LITERAL?
return super.visitLiteral(ctx);
}
#Override
public Void visitLiteralExpression(LiteralExpressionContext ctx) {
return super.visitLiteralExpression(ctx);
}
What's the difference between visitLiteral and visitLiteralExpression()?
Your literal production consists of two possible terminals, the numeric and string literal. Whether parsed input contains one or the other, you can determine with null checks inside visitLiteral.
#Override
public Object visitLiteral(LiteralContext ctx) {
TerminalNode numeric = ctx.NUMERIC_LITERAL();
TerminalNode string = ctx.STRING_LITERAL();
if (numeric != null) {
System.out.println(numeric.getSymbol().getType());
} else if (string != null) {
System.out.println(string.getSymbol().getType());
}
return super.visitLiteral(ctx);
}
You can visit all terminals by overriding visitTerminal.
#Override
public Object visitTerminal(TerminalNode node) {
int type = node.getSymbol().getType(); // matches a constant in your parser
switch (type) {
case GrammParser.NUMERIC_LITERAL:
System.out.println("numeric literal");
break;
case GrammParser.STRING_LITERAL:
System.out.println("string literal");
break;
}
System.out.println(node.getSymbol().getText());
return super.visitTerminal(node);
}
What's the difference between visitLiteral and visitLiteralExpression()?
The former represents your literal production and the latter represents your expr production. Note that # symbol has special meaning in ANTLR 4 syntax, representing a label - a name for alternatives inside productions. It is not a comment. Since your expr only has one alternative, it becomes visitLiteralExpression. Try commenting it out (//) and see your generated code change.
What pattern can I use to split a string like this:
f.id AS id, CONCAT(a1.id, a2.id, a3.id) AS cnp, SUM(A3.nr) AS sum
in such a way that the result is an array of 3 groups like this:
f.id AS id
CONCAT(a1.id, a2.id, a3.id) AS cnp
SUM(A3.nr) AS sum
Can I match a comma that is not enclosed by parentheses?
The pattern appears to always take the format ... AS ... and you can just use a regular expression to match that:
Pattern p = Pattern.compile("(.*? as .*?)(,|$)", Pattern.CASE_INSENSITIVE );
String query = "f.id AS id, CONCAT(a1.id, a2.id, a3.id) AS cnp, SUM(A3.nr) AS sum";
Matcher m = p.matcher( query );
while ( m.find() ){
System.out.println( m.group(1) );
}
IDEONE
So long as you are not expecting any correlated sub-queries to be nested in your select values (or other edge cases such as strings containing ' as error,' AS id, ...) then this ought to work for inputs similar to your format.
Probably there is a killer regular expression for this, but what would be more maintanable could be to:
Temporarily set placeholders on blocks between parentheses
Split the result on the desired separator
Replace the placeholders with their original values
To make step 1 more general, you should insert placeholders at sections where the separator should not function. As long as you are able to accurately determine what those sections are, you could apply this recipe.
Using an actual SQL Parser, as suggested by #KevinEsche, is probably the most robust choice.
However, if you don't require parsing of all SQL expressions, I would just use plain old char matching: go through the string a character at a time, counting how deeply nested in the brackets you are:
List<String> parts = new ArrayList<>();
int i = 0;
int depth = 0;
while (i < str.length()) {
int start = i;
while (i < str.length()) {
char ch = str.charAt(i);
if (ch == '(') {
depth++;
} else if (ch == ')') {
depth--;
} else if (ch == ',' && depth == 0) {
break;
}
i++;
}
// Maybe check that depth == 0 here.
parts.add(str.substring(start, i));
i++; // To skip the comma.
}
Thank you for your answers. I tried to vote but I can't yet.
I used look ahead pattern to solve the problem:
String pattern = ",(?!([^(]*\\)))";
String str = "f.id AS id, CONCAT(a1.id, a2.id, a3.id) AS cnp, SUM(A3.nr) AS sum";
String strg [] = str.split(pattern);
for(int i=0;i<strg.length;i++) {
System.err.println("Group "+i+" is "+strg[i]);
}
And the result is:
Group 0 is f.id AS id
Group 1 is CONCAT(a1.id, a2.id, a3.id) AS cnp
Group 2 is SUM(A3.nr) AS sum
In the end is too complicated to write a SQL Parser so I decided to use ANTLR4.
I used the example from here and works fine.
https://github.com/bkiers/sqlite-parser
But I don't know how to extract only some parts of the query(select, joins, order...) and I can't find any examples online. Can someoane show how this is done?
Thank you.
I'm starting with ANTLR, but I get some errors and I really don't understand why.
Here you have my really simple grammar
grammar Expr;
options {backtrack=true;}
#header {}
#members {}
expr returns [String s]
: (LETTER SPACE DIGIT | TKDC) {$s = $DIGIT.text + $TKDC.text;}
;
// TOKENS
SPACE : ' ' ;
LETTER : 'd' ;
DIGIT : '0'..'9' ;
TKDC returns [String s] : 'd' SPACE 'C' {$s = "d C";} ;
This is the JAVA source, where I only ask for the "expr" result:
import org.antlr.runtime.*;
class Testantlr {
public static void main(String[] args) throws Exception {
ExprLexer lex = new ExprLexer(new ANTLRFileStream(args[0]));
CommonTokenStream tokens = new CommonTokenStream(lex);
ExprParser parser = new ExprParser(tokens);
try {
System.out.println(parser.expr());
} catch (RecognitionException e) {
e.printStackTrace();
}
}
}
The problem comes when my input file has the following content d 9.
I get the following error:
x line 1:2 mismatched character '9' expecting 'C'
x line 1:3 no viable alternative at input '<EOF>'
Does anyone knwos the problem here?
There are a few things wrong with your grammar:
lexer rules can only return Tokens, so returns [String s] is ignored after TKDC;
backtrack=true in your options section does not apply to lexer rules, that is why you get mismatched character '9' expecting 'C' (no backtracking there!);
the contents of your expr rule: (LETTER SPACE DIGIT | TKDC) {$s = $DIGIT.text + $TKDC.text;} doesn't make much sense (to me). You either want to match LETTER SPACE DIGIT or TKDC, yet you're trying to grab the text of both choices: $DIGIT.text and $TKDC.text.
It looks to me TKDC needs to be "promoted" to a parser rule instead.
I think you dumbed down your example a bit too much to illustrate the problem you were facing. Perhaps it's a better idea to explain your actual problem instead: what are you trying to parse exactly?