Find comma in the parameters of a function using regex

Find comma in the parameters of a function using regex - java

I need to find the occurrence of a chain call functions, but must include the case where there is more than one parameter passed as (programming in java):
Tower.getType(i,j).initialPrice(f,g);
Tower.getType().initialPrice();
So far only managed to make the regex when there is only one parameter or no one:
[\w]+([.]+[\w]+[(]+[\w]*+[)]){2,}+[;]
Like:
Tower.getType().initialPrice();
object.function().function2().function3().function4().function5();
I trying this, but its not working:
[\w] + ([\.] + [\w] + [(] + [\w]* + ([\,] + [\w])* + [)]) {2,} + [;]
My code:
public static void checksMessageChain (String s) {
if (s!=null && s.matches("[\\w]+([\\.]+[\\w]+[(]+[\\w]*+[)]){2,}+[;]")) {
System.out.println("\nIts Message Chain for "+s+"\n");
splitMessageChain(s); // {0,} equivale a *
} else if (s!=null && s.matches("[\\w] + ([\\.] + [\\w] + [(] + [\\w]* + ([\\,] + [\\w])* + [)]) {2,} + [;]")) {
System.out.println("\nIts Message Chain for "+s+"\n");
splitMessageChain(s);
} else {
System.out.println("\nIts not Message Chain for "+s+"\n");
} }

I wouldn't bother with specifying the parameters and just use this :
\\w+(?:\\.\\w+\\([^)]*\\)){2,};
Note that this will fail if one String parameter contains a closing parenthesis, and it doesn't handle multi-line expressions.
As is usual for such a complex language, you'd better use a specialized parser. I have never used one, but a quick search reveals javaparser available on Github.
You can try it on regex101.

Do not try to do this using Regex: you will get an headache and a program that will not cover all possible cases.
Just use JavaParser: it is opensource, it is lightweight, it does not have dependencies.
You can either parse the whole Java file and then navigate the Abstract Syntax Tree to find your function calls or you can directly parse the single expression you are interested into, if you have already identified that piece of source code.
In the first case you want to call parse, otherwise parseExpression
Once you get your MethodCallExpr if its parent is another MethodCallExpr then you have chained calls.
Disclaimer: I am a JavaParser contributor

Related

BYACCJ: How do I include line number in my error message?

This is my current error handling function:
public void yyerror(String error) {
System.err.println("Error: "+ error);
}
This is the default error function I found on the BYACC/J homepage. I can't find any way to add the line number. My question is similar to this question. But the solution to it doesn't work here.
For my lexer I am using a JFlex file.

It's not that different from the bison/flex solution proposed in the question you link. At least, the principle is the same. Only the details differ.
The key fact is that it is the scanner, not the parser, which needs to count lines, because it is the scanner which converts the input text into tokens. The parser knows nothing about the original text; it just receives a sequence of nicely-processed tokens.
So we have to scour the documentation for JFlex to figure out how to get it to track line numbers, and then we find the following in the section on options and declarations:
%line
Turns line counting on. The int member variable yyline contains the number of lines (starting with 0) from the beginning of input to the beginning of the current token.
The JFlex manual doesn't mention that yyline is a private member variable, so in order to get at it from the parser you need to add something like the following to your JFlex file:
%line
{
public int GetLine() { return yyline + 1; }
// ...
}
You can then add a call to GetLine in the error function:
public void yyerror (String error) {
System.err.println ("Error at line " + lexer.GetLine() + ": " + error);
}
That will sometimes produce confusing error messages, because by the time yyerror is called, the parser has already requested the lookahead token, which may be on the line following the error or even separated from the error by several lines of comments. (This problem often shows up when the error is a missing statement terminator.) But it's a good start.

How do I track variable dependencies in Nashorn?

I would like to use the Nashorn engine as a general computation engine. It is powerful, fast has plenty of built-in functions and new functions are very easy to add, using #FunctionalInterface or static methods. Even better, it also provides value-adds like cyclic dependency checking, syntax checking, etc.
However I need to automatically update "output" variables when a dependency changes.
The general idea is that in Java, I'll have something like:
class CalculationEngine {
Data addData(String name, Number value){
...
}
Data addData(String name, String formula){
...
}
String getScript(){
...
}
}
CalculationEngine engine = new CalculationEngine();
Data datum1 = engine.addData("datum1", 1); // Constant integer 1
Data datum2 = engine.addData("datum2", 2); // Constant integer 2
Data datum3 = engine.addData("datum3", "datum1*10");
Data datum4 = engine.addData("datum4", "datum3+datum2");
The CalculationEngine service class knows how to use Nashorn to create a script string out of the Data objects that looks like this:
final String script = engine.getScript(); // "var datum1=1; var datum2=2; var datum3=datum1*10; var datum4=datum3+datum2;"
I know I can parse the script with the Nashorn Parser:
final CompilationUnitTree tree = parser.parse("test", script, null);
But how do I extract the dependencies:
List<Data> whatDependsOn(Data input){
// Process the parsed tree
return list;
}
such that whatDependsOn(datum2) returns [datum4] and whatDependsOn(datum1) returns [datum3, datum4] ?
Or the inverse function getReferencedVariables such that getReferencedVariables(datum3) returns [datum1] and getReferencedVariables(datum4) returns [datum2, datum3] (and I can recursively query getReferencedVariables until all referenced variables have been found).
Basically, when the "value" of one of my Data objects change (due to an external event), how I determine which of my script formulae are affected and need to be recomputed?
I know that the Nashorn script can be parsed but I can not figure out how to use the SimpleTreeVisitorES6 to build up a variable dependency graph:
final CompilationUnitTree tree = parser.parse("test", script, null);
if (tree != null) {
tree.accept(new SimpleTreeVisitorES6<Void, Void>() {
#Override
public Void visitVariable(VariableTree tree, Void v) {
final Kind kind = tree.getKind();
System.out.println("Found a variable: " + kind);
System.out.println(" name: " + kind.toString());
IdentifierTree binding = (IdentifierTree) tree.getBinding();
System.out.println(" kind: " + binding.getKind().name());
System.out.println(" name: " + binding.getName());
System.out.println(" val: " + kind.name());
return null;
}
}, null);
}

one of Nashorn devs here. What you are trying to do is compute the so called def-use relations on source code (well, more likely their transitive closure, but I digress). That's a well-understood compiler theory concept. The good news is that CompilationUnitTree and friends should give you enough information to implement an algorithm for computing this information. The bad news is you'll have to roll up your sleeves and roll your own implementation, I'm afraid. You'll basically have to gather this information, produce merges at control flow join points (back edges and exits of loops, ends of if statements, but you'll also have to handle more exotic stuff like switch/case with their fallthrough semantics and also try/catch/finally, which is the least fun of these as basically control can transfer from anywhere in try block to a catch block.) Your algorithm will also have to repeatedly evaluate loop bodies until the static information you're gathering reaches a fixpoint.
FWIW, while writing Nashorn I had to implement these kinds of things few times using Nashorn's internal parser API (which is different but similar to the public one). If you want some inspiration, you can look into the source code for Nashorn static type analyzer for inferring types of local variables in a JavaScript function which is something I wrote some years ago. If nothing else, it'll give you an idea how to walk an AST tree and keep track of control flow edges and partially computed static analysis data at the edges.
I wish there were an easier way to do this… FWIW, a generalized static analyzer that helps you with bookeeping of flow control could be possible. Good luck.

Replace repeated xml tags value using regex

Input -
String ipXmlString = "<root>"
+ "<accntNoGrp><accntNo>1234567</accntNo></accntNoGrp>"
+ "<accntNoGrp><accntNo>6663823</accntNo></accntNoGrp>"
+ "</root>";
Tried follwing things using to mask values within using
String op = ipXmlString .replaceAll("<accntNo>(.+?)</accntNo>", "######");
But above code masks all the values
<root><accntNoGrp>######</accntNoGrp><accntNoGrp>######</accntNoGrp></root>
Expected Output:
<root><accntNoGrp><accntNo>#####67</accntNo></accntNoGrp><accntNoGrp><accntNo>#####23</accntNo></accntNoGrp></root>
How to achieve this using java regex ?Could someone help

Your replacement is wrong, you need to include the <accntNo> tag in the actual replacement. Also, it appears that you want to show the last two characters/numbers of the account number. In this case, we can capture this information during the match and use it in the replacement.
Code:
String op = ipXmlString.replaceAll("<accntNo>(?:.+?)(.{2})</accntNo>", "<accntNo>######$1</accntNo>");
Explanation:
<accntNo> match an opening tag
(?:.+?) match, but do not capture, anything up until the first
(.{2}) two characters before closing tag (and capture this)
</accntNo> match a closing tag
Note here that by using ?: inside a parenthesis in the pattern, we tell the regex engine to not capture it. There is no point in capturing anything before the last two characters of the account number because we don't want to us it.
The $1 quantity in the replacement refers to the first capture group. In this case, it is the last two characters of the account number. Hence, we build the replacement string you want this way.
Demo here:
Rextester

Try this code:
public static void main(String[] args) {
String ipXmlString = "<root>"
+ "<accntNoGrp><accntNo>1234567</accntNo></accntNoGrp>"
+ "<accntNoGrp><accntNo>6663823</accntNo></accntNoGrp>"
+ "</root>";
String replaceAll = ipXmlString.replaceAll("\\d+", "######");
System.out.println(replaceAll);
}
Prints:
<root><accntNoGrp><accntNo>######</accntNo></accntNoGrp><accntNoGrp><accntNo>######</accntNo></accntNoGrp></root>

BASIC Lexer with regex written in Java

I have to code a Lexer in Java for a dialect of BASIC.
I group all the TokenType in Enum
public enum TokenType {
INT("-?[0-9]+"),
BOOLEAN("(TRUE|FALSE)"),
PLUS("\\+"),
MINUS("\\-"),
//others.....
}
The name is the TokenType name and into the brackets there is the regex that I use to match the Type.
If i want to match the INT type i use "-?[0-9]+".
But now i have a problem. I put into a StringBuffer all the regex of the TokenType with this:
private String pattern() {
StringBuffer tokenPatternsBuffer = new StringBuffer();
for(TokenType token : TokenType.values())
tokenPatternsBuffer.append("|(?<" + token.name() + ">" + token.getPattern() + ")");
String tokenPatternsString = tokenPatternsBuffer.toString().substring(1);
return tokenPatternsString;
}
So it returns a String like:
(?<INT>-?[0-9]+)|(?<BOOLEAN>(TRUE|FALSE))|(?<PLUS>\+)|(?<MINUS>\-)|(?<PRINT>PRINT)....
Now i use this string to create a Pattern
Pattern pattern = Pattern.compile(STRING);
Then I create a Matcher
Matcher match = pattern.match("line of code");
Now i want to match all the TokenType and group them into an ArrayList of Token. If the code syntax is correct it returns an ArrayList of Token (Token name, value).
But i don't know how to exit the while-loop if the syntax is incorrect and then Print an Error.
This is a piece of code used to create the ArrayList of Token.
private void lex() {
ArrayList<Token> tokens = new ArrayList<Token>();
int tokenSize = TokenType.values().length;
int counter = 0;
//Iterate over the arrayLinee (ArrayList of String) to get matches of pattern
for(String linea : arrayLinee) {
counter = 0;
Matcher match = pattern.matcher(linea);
while(match.find()) {
System.out.println(match.group(1));
counter = 0;
for(TokenType token : TokenType.values()) {
counter++;
if(match.group(token.name()) != null) {
tokens.add(new Token(token , match.group(token.name())));
counter = 0;
continue;
}
}
if(counter==tokenSize) {
System.out.println("Syntax Error in line : " + linea);
break;
}
}
tokenList.add("EOL");
}
}
The code doesn't break if the for-loop iterate over all TokenType and doesn't match any regex of TokenType. How can I return an Error if the Syntax isn't correct?
Or do you know where I can find information on developing a lexer?

All you need to do is add an extra "INVALID" token at the end of your enum type with a regex like ".+" (match everything). Because the regexs are evaluated in order, it will only match if no other token was found. You then check to see if the last token in your list was the INVALID token.

If you are working in Java, I recommend trying out ANTLR 4 for creating your lexer. The grammar syntax is much cleaner than regular expressions, and the lexer generated from your grammar will automatically support reporting syntax errors.

If you are writing a full lexer, I'd recommend use an existing grammar builder. Antlr is one solution but I personally recommend parboiled instead, which allows to write grammars in pure Java.

Not sure if this was answered, or you came to an answer, but a lexer is broken into two distinct phases, the scanning phase and the parsing phase. You can combine them into one single pass (regex matching) but you'll find that a single pass lexer has weaknesses if you need to do anything more than the most basic of string translations.
In the scanning phase you're breaking the character sequence apart based on specific tokens that you've specified. What you should have done was included an example of the text you were trying to parse. But Wiki has a great example of a simple text lexer that turns a sentence into tokens (eg. str.split(' ')). So with the scanner you're going to tokenize the block of text into chunks by spaces(this should be the first action almost always) and then you're going to tokenize even further based on other tokens, such as what you're attempting to match.
Then the parsing/evaluation phase will iterate over each token and decide what to do with each token depending on the business logic, syntax rules etc., whatever you set it. This could be expressing some sort of math function to perform (eg. max(3,2)), or a more common example is for query language building. You might make a web app that has a specific query language (SOLR comes to mind, as well as any SQL/NoSQL DB) that is translated into another language to make requests against a datasource. Lexers are commonly used in IDE's for code hinting and auto-completion as well.
This isn't a code-based answer, but it's an answer that should give you an idea on how to tackle the problem.

How to merge two ASTs?

I'm trying to implement a tool for merging different versions of some source code. Given two versions of the same source code, the idea would be to parse them, generate the respective Abstract Source Trees (AST), and finally merge them into a single output source keeping grammatical consistency - the lexer and parser are those of question ANTLR: How to skip multiline comments.
I know there is class ParserRuleReturnScope that helps... but getStop() and getStart() always return null :-(
Here is a snippet that illustrates how I modified my perser to get rules printed:
parser grammar CodeTableParser;
options {
tokenVocab = CodeTableLexer;
backtrack = true;
output = AST;
}
#header {
package ch.bsource.ice.parsers;
}
#members {
private void log(ParserRuleReturnScope rule) {
System.out.println("Rule: " + rule.getClass().getName());
System.out.println(" getStart(): " + rule.getStart());
System.out.println(" getStop(): " + rule.getStop());
System.out.println(" getTree(): " + rule.getTree());
}
}
parse
: codeTabHeader codeTable endCodeTable eof { log(retval); }
;
codeTabHeader
: comment CodeTabHeader^ { log(retval); }
;
...

Assuming you have the ASTs (often difficult to get in the first place, parsing real languages is often harder than it looks), you first have to determine what they have in common, and build a mapping collecting that information. That's not as easy as it looks; do you count a block of code that has moved, but is the same exact subtree, as "common"? What about two subtrees that are the same except for consistent renaming of an identifier? What about changed comments? (most ASTs lose the comments; most programmers will think this is a really bad idea).
You can build a variation of the "Longest Common Substring" algorithm to compare trees. I've used that in tools that I have built.
Finally, after you've merged the trees, now you need to regenerate the text, ideally preserving most of the layout of the original code. (Programmers hate when you change the layout they so loving produced). So your ASTs need to capture position information, and your regeneration has to honor that where it can.

The call to log(retval) in your parser code looks like it's going to happen at the end of the rule, but it's not. You'll want to move the call into an #after block.
I changed log to spit out a message as well as the scope information and added calls to it to my own grammar like so:
script
#init {log("#init", retval);}
#after {log("#after", retval);}
: statement* EOF {log("after last rule reference", retval);}
-> ^(STMTS statement*)
;
Parsing test input produced the following output:
Logging from #init
getStart(): [#0,0:4='Print',<10>,1:0]
getStop(): null
getTree(): null
Logging from after last rule reference
getStart(): [#0,0:4='Print',<10>,1:0]
getStop(): null
getTree(): null
Logging from #after
getStart(): [#0,0:4='Print',<10>,1:0]
getStop(): [#4,15:15='<EOF>',<-1>,1:15]
getTree(): STMTS
The call in the after block has both the stop and tree fields populated.
I can't say whether this will help you with your merging tool, but I think this will at least get you past the problem with the half-populated scope object.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.