Java expression parsing with ANTLR

Java expression parsing with ANTLR - java

I'm writing a toolkit in Java that uses Java expression parsing. I thought I'd try using ANTLR since
It seems to be used ubiquitously for this sort of thing
There don't seem to be a lot of open source alternatives
I actually tried to write my own generalized parser a while back and gave up. That stuff's hard.
I have to say, after what I feel is a lot of reading and trying different things (more than I had expected to spend, anyway), ANTLR seems incredibly difficult to use. The API is very unintuitive--I'm never quite sure whether I'm calling it right.
Although ANTLR tutorials and examples abound, I haven't had luck finding any examples that involve parsing Java "expressions" -- everyone else seems to want to parse whole java files.
I started off calling it like this:
Java8Lexer lexer = new Java8Lexer(CharStreams.fromString(text));
CommonTokenStream tokens = new CommonTokenStream(lexer);
Java8Parser parser = new Java8Parser(tokens);
ParseTree result = parser.expression();
but that wouldn't parse the whole expression. E.g. with text "a.b" it would return a result that only consisted of the "a" part, just quitting after the first thing it could parse.
Fine. So I changed to:
String input = "return " + text + ";";
Java8Lexer lexer = new Java8Lexer(CharStreams.fromString(input));
CommonTokenStream tokens = new CommonTokenStream(lexer);
Java8Parser parser = new Java8Parser(tokens);
ParseTree result = parser.returnStatement();
result = result.getChild(1);
thinking this would force it to parse the entire expression, then I could just extract the part I cared about. That worked for name expressions like "a.b", but if I try to parse a method expression like "a.b.c(d)" it gives an error:
line 1:12 mismatched input '(' expecting '.'
Interestingly, a(), a.b(), and a.b.c parse fine, but a.b.c() also dies with the same error.
Is there an ANTLR expert here who might have an idea what I'm doing wrong?
Separately, it bothers me quite a bit that the error above is printed to stderr, but I can't find it in the result object anywhere. I'd like to be able to present that error message (vague as it is) to the user that entered the expression--they may not be looking at a console, and even if they are, there's no context there. Is there a way to find that information in the result I get back?
Any help is greatly appreciated.

For a rule like expression, ANTLR will stop parsing once it recognizes an expression.
You can force it to continue by adding an `EOF to you start rule.
(You don’t want to modify the actual `expressions rule, but you can add a rule like this:
expressionStart: expressions EOF;
Then you can use:
ParseTree result = parser.expressionStart();
This will force ANTLR to continue parsing you’re input until it reaches the end of you input.
re: returnStatement
When i run return a.b.c(); through the ANTLR Preview in IntelliJ, I get this parse tree:
A little bit of following the grammar rules, and I stumble across these rules:
typeName: Identifier | packageOrTypeName '.' Identifier;
packageOrTypeName
: Identifier
| packageOrTypeName '.' Identifier
;
That both rules include an alternative for packageOrTypeName '.' Identifier looks problematic to me.
In the tree, we see primaryNoNewArray_lfno_primary:2 which indicates a match of the second alternative in this rule:
primaryNoNewArray_lfno_primary
: literal
| typeName ('[' ']')* '.' 'class' // <-- trying to match this rule
| unannPrimitiveType ('[' ']')* '.' 'class'
| 'void' '.' 'class'
| 'this'
| typeName '.' 'this'
| '(' expression ')'
| classInstanceCreationExpression_lfno_primary
| fieldAccess_lfno_primary
| arrayAccess_lfno_primary
| methodInvocation_lfno_primary
| methodReference_lfno_primary
;
I'm out of time at the moment, but will keep looking at it. It seems pretty unlikely there's this obvious a bug in the Java8Parser.g4, but it certainly seems like a bug at the moment. I'm not sure what about the context would change how this is parsed (by context, meaning where returnStatement is natively called in the grammar.)
I tried this input (starting with the compilationUnit rule:
class Test {
class A {
public B b;
}
class B {
String c() {
return "";
}
}
String test() {
A a = new A();
return a.b.c();
}
}
And it parses correctly (so, we've not found a major bug in the Java8Parser grammar 😔):
Still, this doesn't seem right.
Getting closer:
If I start with the block rule, and wrap in curly braces ({return a.b.c();}), it parses fine.
I'm going to go with the theory that ANTLR needs a bit more lookahead to resolve an "ambiguity".

Related

Allow Whitespace sections ANTLR4

I have an antlr4 grammar designed for an a domain specific language that is embedded into a text template.
There are two modes:
Text (whitespace should be preserved)
Code (whitespace should be ignored)
Sample grammar part:
template
: '{' templateBody '}'
;
templateBody
: templateChunk*
;
templateChunk
: code # codeChunk // dsl code, ignore whitespace
| text # textChunk // any text, preserve whitespace
;
The rule for code may contain a nested reference to the template rule. So the parser must support nesting whitespace/non-whitespace sections.
Maybe lexer modes can help - with some drawbacks:
the code sections must be parsed in another compiler pass
I doubt that nested sections could be mapped correctly
Yet the most promising approach seems to be the manipulation of hidden channels.
My question: Is there a best practice to fill these requirements? Is there an example grammar, that has already solved similar problems?
Appendix:
The rest of the grammar could look as following:
code
: '#' function
;
function
: Identifier '(' argument ')'
;
argument
: function
| template
;
text
: Whitespace+
| Identifier
| .+
;
Identifier
: LETTER (LETTER|DIGIT)*
;
Whitespace
: [ \t\n\r] -> channel(HIDDEN)
;
fragment LETTER
: [a-zA-Z]
;
fragment DIGIT
: [0-9]
;
In this example code has a dummy implementation pointing out that it can contain nested code/template sections. Actually code should support
multiple arguments
primitive type Arguments (ints, strings, ...)
maps and lists
function evaluation
...

This is how I solved the problem at the end:
The idea is to enable/disable whitespace in a parser rule:
templateBody : {enableWs();} templateChunk* {disableWs();};
So we will have to define enableWs and disableWs in our parser base class:
public void enableWs() {
if (_input instanceof MultiChannelTokenStream) {
((MultiChannelTokenStream) _input).enable(HIDDEN);
}
}
public void disableWs() {
if (_input instanceof MultiChannelTokenStream) {
((MultiChannelTokenStream) _input).disable(HIDDEN);
}
}
Now what is this MultiChannelTokenStream?
Antlr4 defines a CommonTokenStream which is a token stream reading only from one channel.
MultiChannelTokenStream is a token stream reading from the enabled channels. For implementation I took the source code of CommonTokenStream and replaced each reference to the channel by channels (equality comparison gets contains comparison)
An example implementation with the grammar above could be found at antlr4multichannel

Using Java code in antlr grammar g4 file

I would like to define a grammar that should parse words that are related to units of measure e.g. for kilograms: 'kg', 'KG', 'kilogram', 'kilograms', 'l', 'liters', 'litres' etc.
I am already doing something similar using a Java enum class to validate input strings supposed to represent a unit of measure.
I was wondering if it's possible to reuse the already defined units of measure in the enum class inside the ANTLR grammar file. Basically I would like to set a lexer in a .g4 grammar file like:
UNITS: UnitMeasures.values()
Where the .values() method returns the enum values inside the UnitMeasures enum Java class, this "should be equivalent" to ANTLR grammar lexer:
UNITS: ('kg' | 'KG' | 'kilograms' | 'l' | 'litres' | 'liters' );
The reasons why I am trying to do this are:
I would like to avoid code duplication between the enum Java class and the ANTLR grammar file;
I can not use only ANTLR and delete the enum Java class as it is already used in many different places;
Now I am trying to use the units of measure in a more complex scenario where I need to parse amounts, units of measures and other related stuff, so I decided to use ANTLR.
Is it possible to avoid this code duplication somehow?

If the enums were not already present in your program, I would suggest generating runtime artifacts based on the grammar itself.
Since you already have the enums defined, let's implement unit recognition after parsing is complete with an AbstractParseTreeVisitor.
1)
Add a units parser rule and generalize your UNITS lexer rule:
...
unit : ID
;
...
ID: [a-zA-Z_0-9]+ ; // whatever you want/need
...
Now your grammar does not duplicate any code, but your rule for units is too general. We will solve this on the java side of things.
2)
Generate a visitor and override visitUnit(UnitContext).
#Override
public Object visitUnit(UnitContext ctx) {
String unitId = ctx.ID();
try{
// Next line will throw exception if unitId is not
// the name of one of your enums.
UnitMeasures unit = UnitMeasures.valueOf(unitId);
// do something maybe?
} catch (IllegalArgumentException(e) {
throw new RuntimeException("Invalid unit: " + unitId);
}
return super.visitUnit(ctx);
}
This will eliminate any code duplication. Now, any time you add a new enum to UnitMeasures, you don't have to alter your grammar. You won't even need to regenerate your parser.
Another option:
This will add a java dependency to your grammar, but you could add a little action right after your unit rule which could respond appropriately if the ID was not a valid unit based on your enum.
unit : ID
{
try {
UnitMeasures.valueOf($unit.text);
}
catch(IllegalArgumentException e) {
//report invalid unit
}
}
;

ANTLR4 Lexer error reporting (length of offending characters)

I'm developing a small IDE for some language using ANTLR4 and need to underline erroneous characters when the lexer fails to match them. The built in org.antlr.v4.runtime.ANTLRErrorListener implementation outputs a message to stderr in such cases, similar to this:
line 35:25 token recognition error at: 'foo\n'
I have no problem understanding how information about line and column of the error is obtained (passed as arguments to syntaxError callback), but how do I get the 'foo\n' string inside the callback?
When a parser is the source of the error, it passes the offending token as the second argument of syntaxError callback, so it becomes trivial to extract information about the start and stop offsets of the erroneous input and this is also explained in the reference book. But what about the case when the source is a lexer? The second argument in the callback is null in this case, presumably since the lexer failed to form a token.
I need the length of unmatched characters to know how much to underline, but while debugging my listener implementation I could not find this information anywhere in the supplied callback arguments (other than extracting it from the supplied error message though string manipulation, which would be just wrong). The 'foo\n' string may clearly be obtained somehow, so what am I missing?
I suspect that I might be looking in the wrong place and that I should be looking at extending DefaultErrorStrategy where error messages get formed.

You should write your lexer such that a syntax error is impossible. In ANTLR 4, it is easy to do this by simply adding the following as the last rule of your lexer:
ErrorChar : . ;
By doing this, your errors are moved from the lexer to the parser.
In some cases, you can take additional steps to help users while they edit code in your IDE. For example, suppose your language supports double-quoted strings of the following form, which cannot span multiple lines:
StringLiteral : '"' ~[\r\n"]* '"';
You can improve error reporting in your IDE by using the following pair of rules:
StringLiteral : '"' ~[\r\n"]* '"';
UnterminatedStringLiteral : '"' ~[\r\n"]*;
You can then override the emit() method to treat the UnterminatedStringLiteral in a special way. As a result, the user sees a great error message and the parser sees a single StringLiteral token that it can generally handle well.
#Override
public Token emit() {
switch (getType()) {
case UnterminatedStringLiteral:
setType(StringLiteral);
Token result = super.emit();
// you'll need to define this method
reportError(result, "Unterminated string literal");
return result;
default:
return super.emit();
}
}

BASIC Lexer with regex written in Java

I have to code a Lexer in Java for a dialect of BASIC.
I group all the TokenType in Enum
public enum TokenType {
INT("-?[0-9]+"),
BOOLEAN("(TRUE|FALSE)"),
PLUS("\\+"),
MINUS("\\-"),
//others.....
}
The name is the TokenType name and into the brackets there is the regex that I use to match the Type.
If i want to match the INT type i use "-?[0-9]+".
But now i have a problem. I put into a StringBuffer all the regex of the TokenType with this:
private String pattern() {
StringBuffer tokenPatternsBuffer = new StringBuffer();
for(TokenType token : TokenType.values())
tokenPatternsBuffer.append("|(?<" + token.name() + ">" + token.getPattern() + ")");
String tokenPatternsString = tokenPatternsBuffer.toString().substring(1);
return tokenPatternsString;
}
So it returns a String like:
(?<INT>-?[0-9]+)|(?<BOOLEAN>(TRUE|FALSE))|(?<PLUS>\+)|(?<MINUS>\-)|(?<PRINT>PRINT)....
Now i use this string to create a Pattern
Pattern pattern = Pattern.compile(STRING);
Then I create a Matcher
Matcher match = pattern.match("line of code");
Now i want to match all the TokenType and group them into an ArrayList of Token. If the code syntax is correct it returns an ArrayList of Token (Token name, value).
But i don't know how to exit the while-loop if the syntax is incorrect and then Print an Error.
This is a piece of code used to create the ArrayList of Token.
private void lex() {
ArrayList<Token> tokens = new ArrayList<Token>();
int tokenSize = TokenType.values().length;
int counter = 0;
//Iterate over the arrayLinee (ArrayList of String) to get matches of pattern
for(String linea : arrayLinee) {
counter = 0;
Matcher match = pattern.matcher(linea);
while(match.find()) {
System.out.println(match.group(1));
counter = 0;
for(TokenType token : TokenType.values()) {
counter++;
if(match.group(token.name()) != null) {
tokens.add(new Token(token , match.group(token.name())));
counter = 0;
continue;
}
}
if(counter==tokenSize) {
System.out.println("Syntax Error in line : " + linea);
break;
}
}
tokenList.add("EOL");
}
}
The code doesn't break if the for-loop iterate over all TokenType and doesn't match any regex of TokenType. How can I return an Error if the Syntax isn't correct?
Or do you know where I can find information on developing a lexer?

All you need to do is add an extra "INVALID" token at the end of your enum type with a regex like ".+" (match everything). Because the regexs are evaluated in order, it will only match if no other token was found. You then check to see if the last token in your list was the INVALID token.

If you are working in Java, I recommend trying out ANTLR 4 for creating your lexer. The grammar syntax is much cleaner than regular expressions, and the lexer generated from your grammar will automatically support reporting syntax errors.

If you are writing a full lexer, I'd recommend use an existing grammar builder. Antlr is one solution but I personally recommend parboiled instead, which allows to write grammars in pure Java.

Not sure if this was answered, or you came to an answer, but a lexer is broken into two distinct phases, the scanning phase and the parsing phase. You can combine them into one single pass (regex matching) but you'll find that a single pass lexer has weaknesses if you need to do anything more than the most basic of string translations.
In the scanning phase you're breaking the character sequence apart based on specific tokens that you've specified. What you should have done was included an example of the text you were trying to parse. But Wiki has a great example of a simple text lexer that turns a sentence into tokens (eg. str.split(' ')). So with the scanner you're going to tokenize the block of text into chunks by spaces(this should be the first action almost always) and then you're going to tokenize even further based on other tokens, such as what you're attempting to match.
Then the parsing/evaluation phase will iterate over each token and decide what to do with each token depending on the business logic, syntax rules etc., whatever you set it. This could be expressing some sort of math function to perform (eg. max(3,2)), or a more common example is for query language building. You might make a web app that has a specific query language (SOLR comes to mind, as well as any SQL/NoSQL DB) that is translated into another language to make requests against a datasource. Lexers are commonly used in IDE's for code hinting and auto-completion as well.
This isn't a code-based answer, but it's an answer that should give you an idea on how to tackle the problem.

ANTLR: "missing attribute access on rule scope" problem

I am trying to build an ANTLR grammar that parses tagged sentences such as:
DT The NP cat VB ate DT a NP rat
and have the grammar:
fragment TOKEN : (('A'..'Z') | ('a'..'z'))+;
fragment WS : (' ' | '\t')+;
WSX : WS;
DTTOK : ('DT' WS TOKEN);
NPTOK : ('NP' WS TOKEN);
nounPhrase: (DTTOK WSX NPTOK);
chunker : nounPhrase {System.out.println("chunk found "+"("+$nounPhrase+")");};
The grammar generator generates the "missing attribute access on rule scope: nounPhrase" in the last line.
[I am still new to ANTLR and although some grammars work it's still trial and error. I also frequently get an "OutOfMemory" error when running grammars as small as this - any help welcome.]
I am using ANTLRWorks 1.3 to generate the code and am running under Java 1.6.

"missing attribute access" means that you've referenced a scope ($nounPhrase) rather than an attribute of the scope (such as $nounPhrase.text).
In general, a good way to troubleshoot problems with attributes is to look at the generated parser method for the rule in question.
For example, my initial attempt at creating a new rule when I was a little rusty:
multiple_names returns [List<Name> names]
#init {
names = new ArrayList<Name>(4);
}
: a=fullname ' AND ' b=fullname { names.add($a.value); names.add($b.value); };
resulted in "unknown attribute for rule fullname". So I tried
multiple_names returns [List<Name> names]
#init {
names = new ArrayList<Name>(4);
}
: a=fullname ' AND ' b=fullname { names.add($a); names.add($b); };
which results in "missing attribute access". Looking at the generated parser method made it clear what I needed to do though. While there are some cryptic pieces, the parts relevant to scopes (variables) are easily understood:
public final List<Name> multiple_names() throws RecognitionException {
List<Name> names = null; // based on "returns" clause of rule definition
Name a = null; // based on scopes declared in rule definition
Name b = null; // based on scopes declared in rule definition
names = new ArrayList<Name>(4); // snippet inserted from `#init` block
try {
pushFollow(FOLLOW_fullname_in_multiple_names42);
a=fullname();
state._fsp--;
match(input,189,FOLLOW_189_in_multiple_names44);
pushFollow(FOLLOW_fullname_in_multiple_names48);
b=fullname();
state._fsp--;
names.add($a); names.add($b);// code inserted from {...} block
}
catch (RecognitionException re) {
reportError(re);
recover(input,re);
}
finally {
// do for sure before leaving
}
return names; // based on "returns" clause of rule definition
}
After looking at the generated code, it's easy to see that the fullname rule is returning instances of the Name class, so what I needed in this case was simply:
multiple_names returns [List<Name> names]
#init {
names = new ArrayList<Name>(4);
}
: a=fullname ' AND ' b=fullname { names.add(a); names.add(b); };
The version you need in your situation may be different, but you'll generally be able to figure it out pretty easily by looking at the generated code.

In the original grammer, why not include the attribute it is asking for, most likely:
chunker : nounPhrase {System.out.println("chunk found "+"("+$nounPhrase.text+")");};
Each of your rules (chunker being the one I can spot quickly) have attributes (extra information) associated with them. You can find a quick list of the different attributes for the different types of rules at http://www.antlr.org/wiki/display/ANTLR3/Attribute+and+Dynamic+Scopes, would be nice if descriptions were put on the web page for each of those attributes (like for the start and stop attribute for the parser rules refer to tokens from your lexer - which would allow you to get back to your line number and position).
I think your chunker rule should just be changed slightly, instead of $nounPhrase you should use $nounPhrase.text. text is an attribute for your nounPhrase rule.
You might want to do a little other formating as well, usually the parser rules (start with lowercase letter) appear before the lexer rules (start with uppercase letter)
PS. When I type in the box the chunker rule is starting on a new line but in my original answer it didn't start on a new line.

If you accidentally do something silly like $thing.$attribute where you mean $thing.attribute, you will also see the missing attribute access on rule scope error message. (I know this question was answered a long time ago, but this bit of trivia might help someone else who sees the error message!)

Answering question after having found a better way...
WS : (' '|'\t')+;
TOKEN : (('A'..'Z') | ('a'..'z'))+;
dttok : 'DT' WS TOKEN;
nntok : 'NN' WS TOKEN;
nounPhrase : (dttok WS nntok);
chunker : nounPhrase ;
The problem was I was getting muddled between the lexer and the parser (this is apparently very common). The uppercase items are lexical, the lowercase in the parser. This now seems to work. (NB I have changed NP to NN).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.