How to merge two ASTs? - java

I'm trying to implement a tool for merging different versions of some source code. Given two versions of the same source code, the idea would be to parse them, generate the respective Abstract Source Trees (AST), and finally merge them into a single output source keeping grammatical consistency - the lexer and parser are those of question ANTLR: How to skip multiline comments.
I know there is class ParserRuleReturnScope that helps... but getStop() and getStart() always return null :-(
Here is a snippet that illustrates how I modified my perser to get rules printed:
parser grammar CodeTableParser;
options {
tokenVocab = CodeTableLexer;
backtrack = true;
output = AST;
}
#header {
package ch.bsource.ice.parsers;
}
#members {
private void log(ParserRuleReturnScope rule) {
System.out.println("Rule: " + rule.getClass().getName());
System.out.println(" getStart(): " + rule.getStart());
System.out.println(" getStop(): " + rule.getStop());
System.out.println(" getTree(): " + rule.getTree());
}
}
parse
: codeTabHeader codeTable endCodeTable eof { log(retval); }
;
codeTabHeader
: comment CodeTabHeader^ { log(retval); }
;
...

Assuming you have the ASTs (often difficult to get in the first place, parsing real languages is often harder than it looks), you first have to determine what they have in common, and build a mapping collecting that information. That's not as easy as it looks; do you count a block of code that has moved, but is the same exact subtree, as "common"? What about two subtrees that are the same except for consistent renaming of an identifier? What about changed comments? (most ASTs lose the comments; most programmers will think this is a really bad idea).
You can build a variation of the "Longest Common Substring" algorithm to compare trees. I've used that in tools that I have built.
Finally, after you've merged the trees, now you need to regenerate the text, ideally preserving most of the layout of the original code. (Programmers hate when you change the layout they so loving produced). So your ASTs need to capture position information, and your regeneration has to honor that where it can.

The call to log(retval) in your parser code looks like it's going to happen at the end of the rule, but it's not. You'll want to move the call into an #after block.
I changed log to spit out a message as well as the scope information and added calls to it to my own grammar like so:
script
#init {log("#init", retval);}
#after {log("#after", retval);}
: statement* EOF {log("after last rule reference", retval);}
-> ^(STMTS statement*)
;
Parsing test input produced the following output:
Logging from #init
getStart(): [#0,0:4='Print',<10>,1:0]
getStop(): null
getTree(): null
Logging from after last rule reference
getStart(): [#0,0:4='Print',<10>,1:0]
getStop(): null
getTree(): null
Logging from #after
getStart(): [#0,0:4='Print',<10>,1:0]
getStop(): [#4,15:15='<EOF>',<-1>,1:15]
getTree(): STMTS
The call in the after block has both the stop and tree fields populated.
I can't say whether this will help you with your merging tool, but I think this will at least get you past the problem with the half-populated scope object.

Related

XMLUnit - compare xml and ignore few tags based on a condition

I have couple of xmls which needs to be compared with different set of similar xml and while comparing i need to ignore tags based on a condition, for example
personal.xml - ignore fullname
address.xml - igone zipcode
contact.xml - ignore homephone
here is the code
Diff documentDiff=DiffBuilder
.compare(actualxmlfile)
.withTest(expectedxmlfile)
.withNodeFilter(node -> !node.getNodeName().equals("FullName"))
.ignoreWhitespace()
.build();
How can i add conditions at " .withNodeFilter(node -> !node.getNodeName().equals("FullName")) " or is there a smarter way to do this
You can join multiple conditions together using "and" (&&):
private static void doDemo1(File actual, File expected) {
Diff docDiff = DiffBuilder
.compare(actual)
.withTest(expected)
.withNodeFilter(
node -> !node.getNodeName().equals("FullName")
&& !node.getNodeName().equals("ZipCode")
&& !node.getNodeName().equals("HomePhone")
)
.ignoreWhitespace()
.build();
System.out.println(docDiff.toString());
}
If you want to keep your builder tidy, you can move the node filter to a separate method:
private static void doDemo2(File actual, File expected) {
Diff docDiff = DiffBuilder
.compare(actual)
.withTest(expected)
.withNodeFilter(node -> testNode(node))
.ignoreWhitespace()
.build();
System.out.println(docDiff.toString());
}
private static boolean testNode(Node node) {
return !node.getNodeName().equals("FullName")
&& !node.getNodeName().equals("ZipCode")
&& !node.getNodeName().equals("HomePhone");
}
The risk with this is you may have element names which appear in more than one type of file - where that node needs to be filtered from one type of file, but not any others.
In this case, you would also need to take into account the type of file you are handling. For example, you can use the file names (if they follow a suitable naming convention) or use the root elements (assuming they are different) - such as <Personal>, <Address>, <Contact> - or whatever they are, in your case.
However, if you need to distinguish between XML file types, for this reason, you may be better off using that information to have separate DiffBuilder objects, with different filters. That may result in clearer code.
I had provided the separate method in the below link for !node.getNodeName().equals("FullName")(which you are using in your code), I think by using that separate method you can just pass the array of nodes which you want to ignore and see the results. And incase you wish to add any other conditions based on your requirement, you can try and play in this method.
https://stackoverflow.com/a/68099435/13451711

How do I track variable dependencies in Nashorn?

I would like to use the Nashorn engine as a general computation engine. It is powerful, fast has plenty of built-in functions and new functions are very easy to add, using #FunctionalInterface or static methods. Even better, it also provides value-adds like cyclic dependency checking, syntax checking, etc.
However I need to automatically update "output" variables when a dependency changes.
The general idea is that in Java, I'll have something like:
class CalculationEngine {
Data addData(String name, Number value){
...
}
Data addData(String name, String formula){
...
}
String getScript(){
...
}
}
CalculationEngine engine = new CalculationEngine();
Data datum1 = engine.addData("datum1", 1); // Constant integer 1
Data datum2 = engine.addData("datum2", 2); // Constant integer 2
Data datum3 = engine.addData("datum3", "datum1*10");
Data datum4 = engine.addData("datum4", "datum3+datum2");
The CalculationEngine service class knows how to use Nashorn to create a script string out of the Data objects that looks like this:
final String script = engine.getScript(); // "var datum1=1; var datum2=2; var datum3=datum1*10; var datum4=datum3+datum2;"
I know I can parse the script with the Nashorn Parser:
final CompilationUnitTree tree = parser.parse("test", script, null);
But how do I extract the dependencies:
List<Data> whatDependsOn(Data input){
// Process the parsed tree
return list;
}
such that whatDependsOn(datum2) returns [datum4] and whatDependsOn(datum1) returns [datum3, datum4] ?
Or the inverse function getReferencedVariables such that getReferencedVariables(datum3) returns [datum1] and getReferencedVariables(datum4) returns [datum2, datum3] (and I can recursively query getReferencedVariables until all referenced variables have been found).
Basically, when the "value" of one of my Data objects change (due to an external event), how I determine which of my script formulae are affected and need to be recomputed?
I know that the Nashorn script can be parsed but I can not figure out how to use the SimpleTreeVisitorES6 to build up a variable dependency graph:
final CompilationUnitTree tree = parser.parse("test", script, null);
if (tree != null) {
tree.accept(new SimpleTreeVisitorES6<Void, Void>() {
#Override
public Void visitVariable(VariableTree tree, Void v) {
final Kind kind = tree.getKind();
System.out.println("Found a variable: " + kind);
System.out.println(" name: " + kind.toString());
IdentifierTree binding = (IdentifierTree) tree.getBinding();
System.out.println(" kind: " + binding.getKind().name());
System.out.println(" name: " + binding.getName());
System.out.println(" val: " + kind.name());
return null;
}
}, null);
}
one of Nashorn devs here. What you are trying to do is compute the so called def-use relations on source code (well, more likely their transitive closure, but I digress). That's a well-understood compiler theory concept. The good news is that CompilationUnitTree and friends should give you enough information to implement an algorithm for computing this information. The bad news is you'll have to roll up your sleeves and roll your own implementation, I'm afraid. You'll basically have to gather this information, produce merges at control flow join points (back edges and exits of loops, ends of if statements, but you'll also have to handle more exotic stuff like switch/case with their fallthrough semantics and also try/catch/finally, which is the least fun of these as basically control can transfer from anywhere in try block to a catch block.) Your algorithm will also have to repeatedly evaluate loop bodies until the static information you're gathering reaches a fixpoint.
FWIW, while writing Nashorn I had to implement these kinds of things few times using Nashorn's internal parser API (which is different but similar to the public one). If you want some inspiration, you can look into the source code for Nashorn static type analyzer for inferring types of local variables in a JavaScript function which is something I wrote some years ago. If nothing else, it'll give you an idea how to walk an AST tree and keep track of control flow edges and partially computed static analysis data at the edges.
I wish there were an easier way to do this… FWIW, a generalized static analyzer that helps you with bookeeping of flow control could be possible. Good luck.

Using Java code in antlr grammar g4 file

I would like to define a grammar that should parse words that are related to units of measure e.g. for kilograms: 'kg', 'KG', 'kilogram', 'kilograms', 'l', 'liters', 'litres' etc.
I am already doing something similar using a Java enum class to validate input strings supposed to represent a unit of measure.
I was wondering if it's possible to reuse the already defined units of measure in the enum class inside the ANTLR grammar file. Basically I would like to set a lexer in a .g4 grammar file like:
UNITS: UnitMeasures.values()
Where the .values() method returns the enum values inside the UnitMeasures enum Java class, this "should be equivalent" to ANTLR grammar lexer:
UNITS: ('kg' | 'KG' | 'kilograms' | 'l' | 'litres' | 'liters' );
The reasons why I am trying to do this are:
I would like to avoid code duplication between the enum Java class and the ANTLR grammar file;
I can not use only ANTLR and delete the enum Java class as it is already used in many different places;
Now I am trying to use the units of measure in a more complex scenario where I need to parse amounts, units of measures and other related stuff, so I decided to use ANTLR.
Is it possible to avoid this code duplication somehow?
If the enums were not already present in your program, I would suggest generating runtime artifacts based on the grammar itself.
Since you already have the enums defined, let's implement unit recognition after parsing is complete with an AbstractParseTreeVisitor.
1)
Add a units parser rule and generalize your UNITS lexer rule:
...
unit : ID
;
...
ID: [a-zA-Z_0-9]+ ; // whatever you want/need
...
Now your grammar does not duplicate any code, but your rule for units is too general. We will solve this on the java side of things.
2)
Generate a visitor and override visitUnit(UnitContext).
#Override
public Object visitUnit(UnitContext ctx) {
String unitId = ctx.ID();
try{
// Next line will throw exception if unitId is not
// the name of one of your enums.
UnitMeasures unit = UnitMeasures.valueOf(unitId);
// do something maybe?
} catch (IllegalArgumentException(e) {
throw new RuntimeException("Invalid unit: " + unitId);
}
return super.visitUnit(ctx);
}
This will eliminate any code duplication. Now, any time you add a new enum to UnitMeasures, you don't have to alter your grammar. You won't even need to regenerate your parser.
Another option:
This will add a java dependency to your grammar, but you could add a little action right after your unit rule which could respond appropriately if the ID was not a valid unit based on your enum.
unit : ID
{
try {
UnitMeasures.valueOf($unit.text);
}
catch(IllegalArgumentException e) {
//report invalid unit
}
}
;

Is there any Checkstyle/PMD/Findbugs rule to force "else if" to be on the same line?

In our project for chained if/else/if we would like to have following formatting:
if (flag1) {
// Do something 1
} else if (flag2) {
// Do something 2
} else if (flag3) {
// Do something 3
}
And forbid following one:
if (flag1) {
// Do something 1
} else {
if (flag2) {
// Do something 2
} else {
if (flag3) {
// Do something 3
}
}
}
Is there some predefined rule in either of listed above static code analysis tools to force this code style? If no - I know there is an ability to write custom rules in all of those tools, which one would you suggest to implement such a rule (not really familiar with writing custom rules in either of them)?
It can be done with CheckStyle, but you'll have to code a custom check.
Using a custom check allows you to completely ignore comments. The line number that a token is on can be determined by calling getLineNo() on the DetailAST. Here's what the AST looks like, with line number information (red circles):
The custom check's code will likely be quite short. You basically register for LITERAL_ELSE tokens and see if LITERAL_IF is their only child. Also remember to handle SLISTs. In those cases, LITERAL_IF and RCURLY should be the only children. Both cases are illustrated in the above picture.
Alternative using a RegExp check
For the record, I originally thought one could also configure a regex match using else[ \t{]*[\r\n]+[ \t{]*if\b for the format property (based on this post).
Here's the mentioned regex as a railroad diagram:
This turned out not to be feasible, because it produces false negatives when there are comments between between else and if. Worse, it also produces false positives when the nested if is followed by unrelated code (like else { if() {...} <block of code>}. Thanks #Anatoliy for pointing this out! Since comments and matching braces which are mixed with comments cannot be reliably grasped by regexes, these problems obsolete the RegExp approach.
This post says you can't do it in Checkstyle.
In PMD you definitely can. The AST (abstract syntax tree) is different.
For the pattern you don't want
if (true) {
String a;
} else {
if (true) {
String b;
}
}
The tree looks like:
<IfStatement>
<Expression>...</Expression>
<Statement>...</Statement>
<Statement>
<Block>
<BlockStatement>
<IfStatement>...
For the pattern you do want
if (true) {
String a;
} else if (true) {
String b;
}
The tree looks like:
<IfStatement>
<Expression>...</Expression>
<Statement>...</Statement>
<Statement>
<IfStatement>...
In PMD 4 (which I used to make these trees), you write a rule by writing a XPath expression matching the pattern you don't want to occur.

ANTLR: "missing attribute access on rule scope" problem

I am trying to build an ANTLR grammar that parses tagged sentences such as:
DT The NP cat VB ate DT a NP rat
and have the grammar:
fragment TOKEN : (('A'..'Z') | ('a'..'z'))+;
fragment WS : (' ' | '\t')+;
WSX : WS;
DTTOK : ('DT' WS TOKEN);
NPTOK : ('NP' WS TOKEN);
nounPhrase: (DTTOK WSX NPTOK);
chunker : nounPhrase {System.out.println("chunk found "+"("+$nounPhrase+")");};
The grammar generator generates the "missing attribute access on rule scope: nounPhrase" in the last line.
[I am still new to ANTLR and although some grammars work it's still trial and error. I also frequently get an "OutOfMemory" error when running grammars as small as this - any help welcome.]
I am using ANTLRWorks 1.3 to generate the code and am running under Java 1.6.
"missing attribute access" means that you've referenced a scope ($nounPhrase) rather than an attribute of the scope (such as $nounPhrase.text).
In general, a good way to troubleshoot problems with attributes is to look at the generated parser method for the rule in question.
For example, my initial attempt at creating a new rule when I was a little rusty:
multiple_names returns [List<Name> names]
#init {
names = new ArrayList<Name>(4);
}
: a=fullname ' AND ' b=fullname { names.add($a.value); names.add($b.value); };
resulted in "unknown attribute for rule fullname". So I tried
multiple_names returns [List<Name> names]
#init {
names = new ArrayList<Name>(4);
}
: a=fullname ' AND ' b=fullname { names.add($a); names.add($b); };
which results in "missing attribute access". Looking at the generated parser method made it clear what I needed to do though. While there are some cryptic pieces, the parts relevant to scopes (variables) are easily understood:
public final List<Name> multiple_names() throws RecognitionException {
List<Name> names = null; // based on "returns" clause of rule definition
Name a = null; // based on scopes declared in rule definition
Name b = null; // based on scopes declared in rule definition
names = new ArrayList<Name>(4); // snippet inserted from `#init` block
try {
pushFollow(FOLLOW_fullname_in_multiple_names42);
a=fullname();
state._fsp--;
match(input,189,FOLLOW_189_in_multiple_names44);
pushFollow(FOLLOW_fullname_in_multiple_names48);
b=fullname();
state._fsp--;
names.add($a); names.add($b);// code inserted from {...} block
}
catch (RecognitionException re) {
reportError(re);
recover(input,re);
}
finally {
// do for sure before leaving
}
return names; // based on "returns" clause of rule definition
}
After looking at the generated code, it's easy to see that the fullname rule is returning instances of the Name class, so what I needed in this case was simply:
multiple_names returns [List<Name> names]
#init {
names = new ArrayList<Name>(4);
}
: a=fullname ' AND ' b=fullname { names.add(a); names.add(b); };
The version you need in your situation may be different, but you'll generally be able to figure it out pretty easily by looking at the generated code.
In the original grammer, why not include the attribute it is asking for, most likely:
chunker : nounPhrase {System.out.println("chunk found "+"("+$nounPhrase.text+")");};
Each of your rules (chunker being the one I can spot quickly) have attributes (extra information) associated with them. You can find a quick list of the different attributes for the different types of rules at http://www.antlr.org/wiki/display/ANTLR3/Attribute+and+Dynamic+Scopes, would be nice if descriptions were put on the web page for each of those attributes (like for the start and stop attribute for the parser rules refer to tokens from your lexer - which would allow you to get back to your line number and position).
I think your chunker rule should just be changed slightly, instead of $nounPhrase you should use $nounPhrase.text. text is an attribute for your nounPhrase rule.
You might want to do a little other formating as well, usually the parser rules (start with lowercase letter) appear before the lexer rules (start with uppercase letter)
PS. When I type in the box the chunker rule is starting on a new line but in my original answer it didn't start on a new line.
If you accidentally do something silly like $thing.$attribute where you mean $thing.attribute, you will also see the missing attribute access on rule scope error message. (I know this question was answered a long time ago, but this bit of trivia might help someone else who sees the error message!)
Answering question after having found a better way...
WS : (' '|'\t')+;
TOKEN : (('A'..'Z') | ('a'..'z'))+;
dttok : 'DT' WS TOKEN;
nntok : 'NN' WS TOKEN;
nounPhrase : (dttok WS nntok);
chunker : nounPhrase ;
The problem was I was getting muddled between the lexer and the parser (this is apparently very common). The uppercase items are lexical, the lowercase in the parser. This now seems to work. (NB I have changed NP to NN).

Categories

Resources