Using Java code in antlr grammar g4 file - java

I would like to define a grammar that should parse words that are related to units of measure e.g. for kilograms: 'kg', 'KG', 'kilogram', 'kilograms', 'l', 'liters', 'litres' etc.
I am already doing something similar using a Java enum class to validate input strings supposed to represent a unit of measure.
I was wondering if it's possible to reuse the already defined units of measure in the enum class inside the ANTLR grammar file. Basically I would like to set a lexer in a .g4 grammar file like:
UNITS: UnitMeasures.values()
Where the .values() method returns the enum values inside the UnitMeasures enum Java class, this "should be equivalent" to ANTLR grammar lexer:
UNITS: ('kg' | 'KG' | 'kilograms' | 'l' | 'litres' | 'liters' );
The reasons why I am trying to do this are:
I would like to avoid code duplication between the enum Java class and the ANTLR grammar file;
I can not use only ANTLR and delete the enum Java class as it is already used in many different places;
Now I am trying to use the units of measure in a more complex scenario where I need to parse amounts, units of measures and other related stuff, so I decided to use ANTLR.
Is it possible to avoid this code duplication somehow?

If the enums were not already present in your program, I would suggest generating runtime artifacts based on the grammar itself.
Since you already have the enums defined, let's implement unit recognition after parsing is complete with an AbstractParseTreeVisitor.
1)
Add a units parser rule and generalize your UNITS lexer rule:
...
unit : ID
;
...
ID: [a-zA-Z_0-9]+ ; // whatever you want/need
...
Now your grammar does not duplicate any code, but your rule for units is too general. We will solve this on the java side of things.
2)
Generate a visitor and override visitUnit(UnitContext).
#Override
public Object visitUnit(UnitContext ctx) {
String unitId = ctx.ID();
try{
// Next line will throw exception if unitId is not
// the name of one of your enums.
UnitMeasures unit = UnitMeasures.valueOf(unitId);
// do something maybe?
} catch (IllegalArgumentException(e) {
throw new RuntimeException("Invalid unit: " + unitId);
}
return super.visitUnit(ctx);
}
This will eliminate any code duplication. Now, any time you add a new enum to UnitMeasures, you don't have to alter your grammar. You won't even need to regenerate your parser.
Another option:
This will add a java dependency to your grammar, but you could add a little action right after your unit rule which could respond appropriately if the ID was not a valid unit based on your enum.
unit : ID
{
try {
UnitMeasures.valueOf($unit.text);
}
catch(IllegalArgumentException e) {
//report invalid unit
}
}
;

Related

Java expression parsing with ANTLR

I'm writing a toolkit in Java that uses Java expression parsing. I thought I'd try using ANTLR since
It seems to be used ubiquitously for this sort of thing
There don't seem to be a lot of open source alternatives
I actually tried to write my own generalized parser a while back and gave up. That stuff's hard.
I have to say, after what I feel is a lot of reading and trying different things (more than I had expected to spend, anyway), ANTLR seems incredibly difficult to use. The API is very unintuitive--I'm never quite sure whether I'm calling it right.
Although ANTLR tutorials and examples abound, I haven't had luck finding any examples that involve parsing Java "expressions" -- everyone else seems to want to parse whole java files.
I started off calling it like this:
Java8Lexer lexer = new Java8Lexer(CharStreams.fromString(text));
CommonTokenStream tokens = new CommonTokenStream(lexer);
Java8Parser parser = new Java8Parser(tokens);
ParseTree result = parser.expression();
but that wouldn't parse the whole expression. E.g. with text "a.b" it would return a result that only consisted of the "a" part, just quitting after the first thing it could parse.
Fine. So I changed to:
String input = "return " + text + ";";
Java8Lexer lexer = new Java8Lexer(CharStreams.fromString(input));
CommonTokenStream tokens = new CommonTokenStream(lexer);
Java8Parser parser = new Java8Parser(tokens);
ParseTree result = parser.returnStatement();
result = result.getChild(1);
thinking this would force it to parse the entire expression, then I could just extract the part I cared about. That worked for name expressions like "a.b", but if I try to parse a method expression like "a.b.c(d)" it gives an error:
line 1:12 mismatched input '(' expecting '.'
Interestingly, a(), a.b(), and a.b.c parse fine, but a.b.c() also dies with the same error.
Is there an ANTLR expert here who might have an idea what I'm doing wrong?
Separately, it bothers me quite a bit that the error above is printed to stderr, but I can't find it in the result object anywhere. I'd like to be able to present that error message (vague as it is) to the user that entered the expression--they may not be looking at a console, and even if they are, there's no context there. Is there a way to find that information in the result I get back?
Any help is greatly appreciated.
For a rule like expression, ANTLR will stop parsing once it recognizes an expression.
You can force it to continue by adding an `EOF to you start rule.
(You don’t want to modify the actual `expressions rule, but you can add a rule like this:
expressionStart: expressions EOF;
Then you can use:
ParseTree result = parser.expressionStart();
This will force ANTLR to continue parsing you’re input until it reaches the end of you input.
re: returnStatement
When i run return a.b.c(); through the ANTLR Preview in IntelliJ, I get this parse tree:
A little bit of following the grammar rules, and I stumble across these rules:
typeName: Identifier | packageOrTypeName '.' Identifier;
packageOrTypeName
: Identifier
| packageOrTypeName '.' Identifier
;
That both rules include an alternative for packageOrTypeName '.' Identifier looks problematic to me.
In the tree, we see primaryNoNewArray_lfno_primary:2 which indicates a match of the second alternative in this rule:
primaryNoNewArray_lfno_primary
: literal
| typeName ('[' ']')* '.' 'class' // <-- trying to match this rule
| unannPrimitiveType ('[' ']')* '.' 'class'
| 'void' '.' 'class'
| 'this'
| typeName '.' 'this'
| '(' expression ')'
| classInstanceCreationExpression_lfno_primary
| fieldAccess_lfno_primary
| arrayAccess_lfno_primary
| methodInvocation_lfno_primary
| methodReference_lfno_primary
;
I'm out of time at the moment, but will keep looking at it. It seems pretty unlikely there's this obvious a bug in the Java8Parser.g4, but it certainly seems like a bug at the moment. I'm not sure what about the context would change how this is parsed (by context, meaning where returnStatement is natively called in the grammar.)
I tried this input (starting with the compilationUnit rule:
class Test {
class A {
public B b;
}
class B {
String c() {
return "";
}
}
String test() {
A a = new A();
return a.b.c();
}
}
And it parses correctly (so, we've not found a major bug in the Java8Parser grammar 😔):
Still, this doesn't seem right.
Getting closer:
If I start with the block rule, and wrap in curly braces ({return a.b.c();}), it parses fine.
I'm going to go with the theory that ANTLR needs a bit more lookahead to resolve an "ambiguity".

How do I track variable dependencies in Nashorn?

I would like to use the Nashorn engine as a general computation engine. It is powerful, fast has plenty of built-in functions and new functions are very easy to add, using #FunctionalInterface or static methods. Even better, it also provides value-adds like cyclic dependency checking, syntax checking, etc.
However I need to automatically update "output" variables when a dependency changes.
The general idea is that in Java, I'll have something like:
class CalculationEngine {
Data addData(String name, Number value){
...
}
Data addData(String name, String formula){
...
}
String getScript(){
...
}
}
CalculationEngine engine = new CalculationEngine();
Data datum1 = engine.addData("datum1", 1); // Constant integer 1
Data datum2 = engine.addData("datum2", 2); // Constant integer 2
Data datum3 = engine.addData("datum3", "datum1*10");
Data datum4 = engine.addData("datum4", "datum3+datum2");
The CalculationEngine service class knows how to use Nashorn to create a script string out of the Data objects that looks like this:
final String script = engine.getScript(); // "var datum1=1; var datum2=2; var datum3=datum1*10; var datum4=datum3+datum2;"
I know I can parse the script with the Nashorn Parser:
final CompilationUnitTree tree = parser.parse("test", script, null);
But how do I extract the dependencies:
List<Data> whatDependsOn(Data input){
// Process the parsed tree
return list;
}
such that whatDependsOn(datum2) returns [datum4] and whatDependsOn(datum1) returns [datum3, datum4] ?
Or the inverse function getReferencedVariables such that getReferencedVariables(datum3) returns [datum1] and getReferencedVariables(datum4) returns [datum2, datum3] (and I can recursively query getReferencedVariables until all referenced variables have been found).
Basically, when the "value" of one of my Data objects change (due to an external event), how I determine which of my script formulae are affected and need to be recomputed?
I know that the Nashorn script can be parsed but I can not figure out how to use the SimpleTreeVisitorES6 to build up a variable dependency graph:
final CompilationUnitTree tree = parser.parse("test", script, null);
if (tree != null) {
tree.accept(new SimpleTreeVisitorES6<Void, Void>() {
#Override
public Void visitVariable(VariableTree tree, Void v) {
final Kind kind = tree.getKind();
System.out.println("Found a variable: " + kind);
System.out.println(" name: " + kind.toString());
IdentifierTree binding = (IdentifierTree) tree.getBinding();
System.out.println(" kind: " + binding.getKind().name());
System.out.println(" name: " + binding.getName());
System.out.println(" val: " + kind.name());
return null;
}
}, null);
}
one of Nashorn devs here. What you are trying to do is compute the so called def-use relations on source code (well, more likely their transitive closure, but I digress). That's a well-understood compiler theory concept. The good news is that CompilationUnitTree and friends should give you enough information to implement an algorithm for computing this information. The bad news is you'll have to roll up your sleeves and roll your own implementation, I'm afraid. You'll basically have to gather this information, produce merges at control flow join points (back edges and exits of loops, ends of if statements, but you'll also have to handle more exotic stuff like switch/case with their fallthrough semantics and also try/catch/finally, which is the least fun of these as basically control can transfer from anywhere in try block to a catch block.) Your algorithm will also have to repeatedly evaluate loop bodies until the static information you're gathering reaches a fixpoint.
FWIW, while writing Nashorn I had to implement these kinds of things few times using Nashorn's internal parser API (which is different but similar to the public one). If you want some inspiration, you can look into the source code for Nashorn static type analyzer for inferring types of local variables in a JavaScript function which is something I wrote some years ago. If nothing else, it'll give you an idea how to walk an AST tree and keep track of control flow edges and partially computed static analysis data at the edges.
I wish there were an easier way to do this… FWIW, a generalized static analyzer that helps you with bookeeping of flow control could be possible. Good luck.

Is there any Checkstyle/PMD/Findbugs rule to force "else if" to be on the same line?

In our project for chained if/else/if we would like to have following formatting:
if (flag1) {
// Do something 1
} else if (flag2) {
// Do something 2
} else if (flag3) {
// Do something 3
}
And forbid following one:
if (flag1) {
// Do something 1
} else {
if (flag2) {
// Do something 2
} else {
if (flag3) {
// Do something 3
}
}
}
Is there some predefined rule in either of listed above static code analysis tools to force this code style? If no - I know there is an ability to write custom rules in all of those tools, which one would you suggest to implement such a rule (not really familiar with writing custom rules in either of them)?
It can be done with CheckStyle, but you'll have to code a custom check.
Using a custom check allows you to completely ignore comments. The line number that a token is on can be determined by calling getLineNo() on the DetailAST. Here's what the AST looks like, with line number information (red circles):
The custom check's code will likely be quite short. You basically register for LITERAL_ELSE tokens and see if LITERAL_IF is their only child. Also remember to handle SLISTs. In those cases, LITERAL_IF and RCURLY should be the only children. Both cases are illustrated in the above picture.
Alternative using a RegExp check
For the record, I originally thought one could also configure a regex match using else[ \t{]*[\r\n]+[ \t{]*if\b for the format property (based on this post).
Here's the mentioned regex as a railroad diagram:
This turned out not to be feasible, because it produces false negatives when there are comments between between else and if. Worse, it also produces false positives when the nested if is followed by unrelated code (like else { if() {...} <block of code>}. Thanks #Anatoliy for pointing this out! Since comments and matching braces which are mixed with comments cannot be reliably grasped by regexes, these problems obsolete the RegExp approach.
This post says you can't do it in Checkstyle.
In PMD you definitely can. The AST (abstract syntax tree) is different.
For the pattern you don't want
if (true) {
String a;
} else {
if (true) {
String b;
}
}
The tree looks like:
<IfStatement>
<Expression>...</Expression>
<Statement>...</Statement>
<Statement>
<Block>
<BlockStatement>
<IfStatement>...
For the pattern you do want
if (true) {
String a;
} else if (true) {
String b;
}
The tree looks like:
<IfStatement>
<Expression>...</Expression>
<Statement>...</Statement>
<Statement>
<IfStatement>...
In PMD 4 (which I used to make these trees), you write a rule by writing a XPath expression matching the pattern you don't want to occur.

How to merge two ASTs?

I'm trying to implement a tool for merging different versions of some source code. Given two versions of the same source code, the idea would be to parse them, generate the respective Abstract Source Trees (AST), and finally merge them into a single output source keeping grammatical consistency - the lexer and parser are those of question ANTLR: How to skip multiline comments.
I know there is class ParserRuleReturnScope that helps... but getStop() and getStart() always return null :-(
Here is a snippet that illustrates how I modified my perser to get rules printed:
parser grammar CodeTableParser;
options {
tokenVocab = CodeTableLexer;
backtrack = true;
output = AST;
}
#header {
package ch.bsource.ice.parsers;
}
#members {
private void log(ParserRuleReturnScope rule) {
System.out.println("Rule: " + rule.getClass().getName());
System.out.println(" getStart(): " + rule.getStart());
System.out.println(" getStop(): " + rule.getStop());
System.out.println(" getTree(): " + rule.getTree());
}
}
parse
: codeTabHeader codeTable endCodeTable eof { log(retval); }
;
codeTabHeader
: comment CodeTabHeader^ { log(retval); }
;
...
Assuming you have the ASTs (often difficult to get in the first place, parsing real languages is often harder than it looks), you first have to determine what they have in common, and build a mapping collecting that information. That's not as easy as it looks; do you count a block of code that has moved, but is the same exact subtree, as "common"? What about two subtrees that are the same except for consistent renaming of an identifier? What about changed comments? (most ASTs lose the comments; most programmers will think this is a really bad idea).
You can build a variation of the "Longest Common Substring" algorithm to compare trees. I've used that in tools that I have built.
Finally, after you've merged the trees, now you need to regenerate the text, ideally preserving most of the layout of the original code. (Programmers hate when you change the layout they so loving produced). So your ASTs need to capture position information, and your regeneration has to honor that where it can.
The call to log(retval) in your parser code looks like it's going to happen at the end of the rule, but it's not. You'll want to move the call into an #after block.
I changed log to spit out a message as well as the scope information and added calls to it to my own grammar like so:
script
#init {log("#init", retval);}
#after {log("#after", retval);}
: statement* EOF {log("after last rule reference", retval);}
-> ^(STMTS statement*)
;
Parsing test input produced the following output:
Logging from #init
getStart(): [#0,0:4='Print',<10>,1:0]
getStop(): null
getTree(): null
Logging from after last rule reference
getStart(): [#0,0:4='Print',<10>,1:0]
getStop(): null
getTree(): null
Logging from #after
getStart(): [#0,0:4='Print',<10>,1:0]
getStop(): [#4,15:15='<EOF>',<-1>,1:15]
getTree(): STMTS
The call in the after block has both the stop and tree fields populated.
I can't say whether this will help you with your merging tool, but I think this will at least get you past the problem with the half-populated scope object.

ANTLR: "missing attribute access on rule scope" problem

I am trying to build an ANTLR grammar that parses tagged sentences such as:
DT The NP cat VB ate DT a NP rat
and have the grammar:
fragment TOKEN : (('A'..'Z') | ('a'..'z'))+;
fragment WS : (' ' | '\t')+;
WSX : WS;
DTTOK : ('DT' WS TOKEN);
NPTOK : ('NP' WS TOKEN);
nounPhrase: (DTTOK WSX NPTOK);
chunker : nounPhrase {System.out.println("chunk found "+"("+$nounPhrase+")");};
The grammar generator generates the "missing attribute access on rule scope: nounPhrase" in the last line.
[I am still new to ANTLR and although some grammars work it's still trial and error. I also frequently get an "OutOfMemory" error when running grammars as small as this - any help welcome.]
I am using ANTLRWorks 1.3 to generate the code and am running under Java 1.6.
"missing attribute access" means that you've referenced a scope ($nounPhrase) rather than an attribute of the scope (such as $nounPhrase.text).
In general, a good way to troubleshoot problems with attributes is to look at the generated parser method for the rule in question.
For example, my initial attempt at creating a new rule when I was a little rusty:
multiple_names returns [List<Name> names]
#init {
names = new ArrayList<Name>(4);
}
: a=fullname ' AND ' b=fullname { names.add($a.value); names.add($b.value); };
resulted in "unknown attribute for rule fullname". So I tried
multiple_names returns [List<Name> names]
#init {
names = new ArrayList<Name>(4);
}
: a=fullname ' AND ' b=fullname { names.add($a); names.add($b); };
which results in "missing attribute access". Looking at the generated parser method made it clear what I needed to do though. While there are some cryptic pieces, the parts relevant to scopes (variables) are easily understood:
public final List<Name> multiple_names() throws RecognitionException {
List<Name> names = null; // based on "returns" clause of rule definition
Name a = null; // based on scopes declared in rule definition
Name b = null; // based on scopes declared in rule definition
names = new ArrayList<Name>(4); // snippet inserted from `#init` block
try {
pushFollow(FOLLOW_fullname_in_multiple_names42);
a=fullname();
state._fsp--;
match(input,189,FOLLOW_189_in_multiple_names44);
pushFollow(FOLLOW_fullname_in_multiple_names48);
b=fullname();
state._fsp--;
names.add($a); names.add($b);// code inserted from {...} block
}
catch (RecognitionException re) {
reportError(re);
recover(input,re);
}
finally {
// do for sure before leaving
}
return names; // based on "returns" clause of rule definition
}
After looking at the generated code, it's easy to see that the fullname rule is returning instances of the Name class, so what I needed in this case was simply:
multiple_names returns [List<Name> names]
#init {
names = new ArrayList<Name>(4);
}
: a=fullname ' AND ' b=fullname { names.add(a); names.add(b); };
The version you need in your situation may be different, but you'll generally be able to figure it out pretty easily by looking at the generated code.
In the original grammer, why not include the attribute it is asking for, most likely:
chunker : nounPhrase {System.out.println("chunk found "+"("+$nounPhrase.text+")");};
Each of your rules (chunker being the one I can spot quickly) have attributes (extra information) associated with them. You can find a quick list of the different attributes for the different types of rules at http://www.antlr.org/wiki/display/ANTLR3/Attribute+and+Dynamic+Scopes, would be nice if descriptions were put on the web page for each of those attributes (like for the start and stop attribute for the parser rules refer to tokens from your lexer - which would allow you to get back to your line number and position).
I think your chunker rule should just be changed slightly, instead of $nounPhrase you should use $nounPhrase.text. text is an attribute for your nounPhrase rule.
You might want to do a little other formating as well, usually the parser rules (start with lowercase letter) appear before the lexer rules (start with uppercase letter)
PS. When I type in the box the chunker rule is starting on a new line but in my original answer it didn't start on a new line.
If you accidentally do something silly like $thing.$attribute where you mean $thing.attribute, you will also see the missing attribute access on rule scope error message. (I know this question was answered a long time ago, but this bit of trivia might help someone else who sees the error message!)
Answering question after having found a better way...
WS : (' '|'\t')+;
TOKEN : (('A'..'Z') | ('a'..'z'))+;
dttok : 'DT' WS TOKEN;
nntok : 'NN' WS TOKEN;
nounPhrase : (dttok WS nntok);
chunker : nounPhrase ;
The problem was I was getting muddled between the lexer and the parser (this is apparently very common). The uppercase items are lexical, the lowercase in the parser. This now seems to work. (NB I have changed NP to NN).

Categories

Resources