ANTLR4 Lexer error reporting (length of offending characters)

ANTLR4 Lexer error reporting (length of offending characters) - java

I'm developing a small IDE for some language using ANTLR4 and need to underline erroneous characters when the lexer fails to match them. The built in org.antlr.v4.runtime.ANTLRErrorListener implementation outputs a message to stderr in such cases, similar to this:
line 35:25 token recognition error at: 'foo\n'
I have no problem understanding how information about line and column of the error is obtained (passed as arguments to syntaxError callback), but how do I get the 'foo\n' string inside the callback?
When a parser is the source of the error, it passes the offending token as the second argument of syntaxError callback, so it becomes trivial to extract information about the start and stop offsets of the erroneous input and this is also explained in the reference book. But what about the case when the source is a lexer? The second argument in the callback is null in this case, presumably since the lexer failed to form a token.
I need the length of unmatched characters to know how much to underline, but while debugging my listener implementation I could not find this information anywhere in the supplied callback arguments (other than extracting it from the supplied error message though string manipulation, which would be just wrong). The 'foo\n' string may clearly be obtained somehow, so what am I missing?
I suspect that I might be looking in the wrong place and that I should be looking at extending DefaultErrorStrategy where error messages get formed.

You should write your lexer such that a syntax error is impossible. In ANTLR 4, it is easy to do this by simply adding the following as the last rule of your lexer:
ErrorChar : . ;
By doing this, your errors are moved from the lexer to the parser.
In some cases, you can take additional steps to help users while they edit code in your IDE. For example, suppose your language supports double-quoted strings of the following form, which cannot span multiple lines:
StringLiteral : '"' ~[\r\n"]* '"';
You can improve error reporting in your IDE by using the following pair of rules:
StringLiteral : '"' ~[\r\n"]* '"';
UnterminatedStringLiteral : '"' ~[\r\n"]*;
You can then override the emit() method to treat the UnterminatedStringLiteral in a special way. As a result, the user sees a great error message and the parser sees a single StringLiteral token that it can generally handle well.
#Override
public Token emit() {
switch (getType()) {
case UnterminatedStringLiteral:
setType(StringLiteral);
Token result = super.emit();
// you'll need to define this method
reportError(result, "Unterminated string literal");
return result;
default:
return super.emit();
}
}

Related

Java expression parsing with ANTLR

I'm writing a toolkit in Java that uses Java expression parsing. I thought I'd try using ANTLR since
It seems to be used ubiquitously for this sort of thing
There don't seem to be a lot of open source alternatives
I actually tried to write my own generalized parser a while back and gave up. That stuff's hard.
I have to say, after what I feel is a lot of reading and trying different things (more than I had expected to spend, anyway), ANTLR seems incredibly difficult to use. The API is very unintuitive--I'm never quite sure whether I'm calling it right.
Although ANTLR tutorials and examples abound, I haven't had luck finding any examples that involve parsing Java "expressions" -- everyone else seems to want to parse whole java files.
I started off calling it like this:
Java8Lexer lexer = new Java8Lexer(CharStreams.fromString(text));
CommonTokenStream tokens = new CommonTokenStream(lexer);
Java8Parser parser = new Java8Parser(tokens);
ParseTree result = parser.expression();
but that wouldn't parse the whole expression. E.g. with text "a.b" it would return a result that only consisted of the "a" part, just quitting after the first thing it could parse.
Fine. So I changed to:
String input = "return " + text + ";";
Java8Lexer lexer = new Java8Lexer(CharStreams.fromString(input));
CommonTokenStream tokens = new CommonTokenStream(lexer);
Java8Parser parser = new Java8Parser(tokens);
ParseTree result = parser.returnStatement();
result = result.getChild(1);
thinking this would force it to parse the entire expression, then I could just extract the part I cared about. That worked for name expressions like "a.b", but if I try to parse a method expression like "a.b.c(d)" it gives an error:
line 1:12 mismatched input '(' expecting '.'
Interestingly, a(), a.b(), and a.b.c parse fine, but a.b.c() also dies with the same error.
Is there an ANTLR expert here who might have an idea what I'm doing wrong?
Separately, it bothers me quite a bit that the error above is printed to stderr, but I can't find it in the result object anywhere. I'd like to be able to present that error message (vague as it is) to the user that entered the expression--they may not be looking at a console, and even if they are, there's no context there. Is there a way to find that information in the result I get back?
Any help is greatly appreciated.

For a rule like expression, ANTLR will stop parsing once it recognizes an expression.
You can force it to continue by adding an `EOF to you start rule.
(You don’t want to modify the actual `expressions rule, but you can add a rule like this:
expressionStart: expressions EOF;
Then you can use:
ParseTree result = parser.expressionStart();
This will force ANTLR to continue parsing you’re input until it reaches the end of you input.
re: returnStatement
When i run return a.b.c(); through the ANTLR Preview in IntelliJ, I get this parse tree:
A little bit of following the grammar rules, and I stumble across these rules:
typeName: Identifier | packageOrTypeName '.' Identifier;
packageOrTypeName
: Identifier
| packageOrTypeName '.' Identifier
;
That both rules include an alternative for packageOrTypeName '.' Identifier looks problematic to me.
In the tree, we see primaryNoNewArray_lfno_primary:2 which indicates a match of the second alternative in this rule:
primaryNoNewArray_lfno_primary
: literal
| typeName ('[' ']')* '.' 'class' // <-- trying to match this rule
| unannPrimitiveType ('[' ']')* '.' 'class'
| 'void' '.' 'class'
| 'this'
| typeName '.' 'this'
| '(' expression ')'
| classInstanceCreationExpression_lfno_primary
| fieldAccess_lfno_primary
| arrayAccess_lfno_primary
| methodInvocation_lfno_primary
| methodReference_lfno_primary
;
I'm out of time at the moment, but will keep looking at it. It seems pretty unlikely there's this obvious a bug in the Java8Parser.g4, but it certainly seems like a bug at the moment. I'm not sure what about the context would change how this is parsed (by context, meaning where returnStatement is natively called in the grammar.)
I tried this input (starting with the compilationUnit rule:
class Test {
class A {
public B b;
}
class B {
String c() {
return "";
}
}
String test() {
A a = new A();
return a.b.c();
}
}
And it parses correctly (so, we've not found a major bug in the Java8Parser grammar 😔):
Still, this doesn't seem right.
Getting closer:
If I start with the block rule, and wrap in curly braces ({return a.b.c();}), it parses fine.
I'm going to go with the theory that ANTLR needs a bit more lookahead to resolve an "ambiguity".

ANTLR4 sending a null context attribute to listener

I'm creating a simple language compiler, and I'm facing an unexpected behavior. I simplified the grammar as follows:
grammar Language;
program : (varDecl)* (funcDecl)* EOF;
varDecl : type IDENTIFIER ('=' expression)? ';';
funcDecl : type IDENTIFIER '(' ')' statementBlock;
type : 'int' # IntType
;
statementBlock : '{' (statement)* '}';
statement : varDecl ;
expression : IDENTIFIER '(' (expression (',' expression)*)? ')' # FuncCallExpression
;
IDENTIFIER : ('a'..'z')+;
WHITE_SPACE : [ \t\u000C\n\r]+ -> skip;
As statementBlock is a mandatory rule inside the funcDecl rule, I would expect that, inside a listener, the FuncDeclContext always contains a non-null funcDecl. The problem is I'm getting a null statementBlock for the following input:
int b() {
}
i nt a() {
int x = b();
}
As far as I understand, when facing an invalid input, ANTLR inserts special nodes that represent the expected matching (like the example from page 163 of the book), but somehow that's not what's happening here (is it a bug?). When I use the following listener, I get "Oh no!":
public class DummyListener extends LanguageBaseListener {
#Override
public void exitFuncDecl(LanguageParser.FuncDeclContext ctx) {
super.exitFuncDecl(ctx);
if (ctx.statementBlock() == null) {
System.out.println("Oh, no :(");
}
}
}
What is the reason of this behavior?
Further investigation
I discovered an interesting behavior.
I changed the funcDecl rule to include an action:
funcDecl : type IDENTIFIER '(' ')' statementBlock { System.out.println("ID: " + $IDENTIFIER.text + ", text is: " + $statementBlock.text); };
and modified exitFuncDecl from the listener to print the identifier too:
System.out.println("Listener: id " + ctx.IDENTIFIER().getText());
if (ctx.statementBlock() == null) {
System.out.println("Oh, no :(");
} else {
System.out.println("content is " + ctx.statementBlock().getText());
}
The output was:
line 3:0 extraneous input 'i' expecting {<EOF>, 'int'}
ID: b, text is: {}
line 4:7 mismatched input '=' expecting '('
Listener: id b
content is {}
Listener: id x
Oh, no :(
It appears that ANTLR is calling exitFuncDecl but not the rule action. I think the rule action behavior is the right one here, as "x" is causing the null statementBlock. I still don't understand why this is happening.

This problem is probably related to ANTLR4 Error recovery. I do not know exactly how it works, but from former debugging sessions I know that the Parser:
inserts expected tokens
deletes tokens until an expected token occurs
From your error messages it seems probable, that recovery rewrites the token stream as following:
int b() {
}
/*deleted: i nt a() {*/
int x /*deleted = b();*/(){
}
Yet the insertion of (){ does not produce a statement block but an error node. So the function declaration will be visitable (although it starts with int x instead of int a) but the statement block does not exist (= is an error node).
The recovery strategy maybe documented in the ANTLR4 book, otherwise you will have to debug DefaultErrorStrategy. You can alter the error strategy if you are not content with this one.
And why does this happen for listener but not for the rule action?
The action of funcDecl is not executed because it has never been parsed, but synthesized by the parsers error recovery. The error recovery cannot take semantic predicates or actions into account.
Now why is the result of the parse a funcDecl node, although it is not parsed? The answer is: If a single error break the building of a parent node, then always the top most node of the tree would be an error node. Breaking the complete tree on an error is not common understanding of error recovery.
I was wondering how should I handle this in my listener code. Checking nulls everywhere?
The listener is the wrong location to handle errors.
If you want to repair errors:
Use another error strategy (you can inherit the default strategy and add your code, I have done this once with ANTLR3):
You could implement it in a way that a broken funcDecl is detached
you could also try to repair it
you could simply report the error or throw an exception in this case
If you want to report errors:
Check if the error strategy has reported errors. If so, then do not apply the visitor an report the errors to the user (possibly rewriting the text to be user friendly). Do not apply the visitor if the parse tree contains errors.

BASIC Lexer with regex written in Java

I have to code a Lexer in Java for a dialect of BASIC.
I group all the TokenType in Enum
public enum TokenType {
INT("-?[0-9]+"),
BOOLEAN("(TRUE|FALSE)"),
PLUS("\\+"),
MINUS("\\-"),
//others.....
}
The name is the TokenType name and into the brackets there is the regex that I use to match the Type.
If i want to match the INT type i use "-?[0-9]+".
But now i have a problem. I put into a StringBuffer all the regex of the TokenType with this:
private String pattern() {
StringBuffer tokenPatternsBuffer = new StringBuffer();
for(TokenType token : TokenType.values())
tokenPatternsBuffer.append("|(?<" + token.name() + ">" + token.getPattern() + ")");
String tokenPatternsString = tokenPatternsBuffer.toString().substring(1);
return tokenPatternsString;
}
So it returns a String like:
(?<INT>-?[0-9]+)|(?<BOOLEAN>(TRUE|FALSE))|(?<PLUS>\+)|(?<MINUS>\-)|(?<PRINT>PRINT)....
Now i use this string to create a Pattern
Pattern pattern = Pattern.compile(STRING);
Then I create a Matcher
Matcher match = pattern.match("line of code");
Now i want to match all the TokenType and group them into an ArrayList of Token. If the code syntax is correct it returns an ArrayList of Token (Token name, value).
But i don't know how to exit the while-loop if the syntax is incorrect and then Print an Error.
This is a piece of code used to create the ArrayList of Token.
private void lex() {
ArrayList<Token> tokens = new ArrayList<Token>();
int tokenSize = TokenType.values().length;
int counter = 0;
//Iterate over the arrayLinee (ArrayList of String) to get matches of pattern
for(String linea : arrayLinee) {
counter = 0;
Matcher match = pattern.matcher(linea);
while(match.find()) {
System.out.println(match.group(1));
counter = 0;
for(TokenType token : TokenType.values()) {
counter++;
if(match.group(token.name()) != null) {
tokens.add(new Token(token , match.group(token.name())));
counter = 0;
continue;
}
}
if(counter==tokenSize) {
System.out.println("Syntax Error in line : " + linea);
break;
}
}
tokenList.add("EOL");
}
}
The code doesn't break if the for-loop iterate over all TokenType and doesn't match any regex of TokenType. How can I return an Error if the Syntax isn't correct?
Or do you know where I can find information on developing a lexer?

All you need to do is add an extra "INVALID" token at the end of your enum type with a regex like ".+" (match everything). Because the regexs are evaluated in order, it will only match if no other token was found. You then check to see if the last token in your list was the INVALID token.

If you are working in Java, I recommend trying out ANTLR 4 for creating your lexer. The grammar syntax is much cleaner than regular expressions, and the lexer generated from your grammar will automatically support reporting syntax errors.

If you are writing a full lexer, I'd recommend use an existing grammar builder. Antlr is one solution but I personally recommend parboiled instead, which allows to write grammars in pure Java.

Not sure if this was answered, or you came to an answer, but a lexer is broken into two distinct phases, the scanning phase and the parsing phase. You can combine them into one single pass (regex matching) but you'll find that a single pass lexer has weaknesses if you need to do anything more than the most basic of string translations.
In the scanning phase you're breaking the character sequence apart based on specific tokens that you've specified. What you should have done was included an example of the text you were trying to parse. But Wiki has a great example of a simple text lexer that turns a sentence into tokens (eg. str.split(' ')). So with the scanner you're going to tokenize the block of text into chunks by spaces(this should be the first action almost always) and then you're going to tokenize even further based on other tokens, such as what you're attempting to match.
Then the parsing/evaluation phase will iterate over each token and decide what to do with each token depending on the business logic, syntax rules etc., whatever you set it. This could be expressing some sort of math function to perform (eg. max(3,2)), or a more common example is for query language building. You might make a web app that has a specific query language (SOLR comes to mind, as well as any SQL/NoSQL DB) that is translated into another language to make requests against a datasource. Lexers are commonly used in IDE's for code hinting and auto-completion as well.
This isn't a code-based answer, but it's an answer that should give you an idea on how to tackle the problem.

How to merge two ASTs?

I'm trying to implement a tool for merging different versions of some source code. Given two versions of the same source code, the idea would be to parse them, generate the respective Abstract Source Trees (AST), and finally merge them into a single output source keeping grammatical consistency - the lexer and parser are those of question ANTLR: How to skip multiline comments.
I know there is class ParserRuleReturnScope that helps... but getStop() and getStart() always return null :-(
Here is a snippet that illustrates how I modified my perser to get rules printed:
parser grammar CodeTableParser;
options {
tokenVocab = CodeTableLexer;
backtrack = true;
output = AST;
}
#header {
package ch.bsource.ice.parsers;
}
#members {
private void log(ParserRuleReturnScope rule) {
System.out.println("Rule: " + rule.getClass().getName());
System.out.println(" getStart(): " + rule.getStart());
System.out.println(" getStop(): " + rule.getStop());
System.out.println(" getTree(): " + rule.getTree());
}
}
parse
: codeTabHeader codeTable endCodeTable eof { log(retval); }
;
codeTabHeader
: comment CodeTabHeader^ { log(retval); }
;
...

Assuming you have the ASTs (often difficult to get in the first place, parsing real languages is often harder than it looks), you first have to determine what they have in common, and build a mapping collecting that information. That's not as easy as it looks; do you count a block of code that has moved, but is the same exact subtree, as "common"? What about two subtrees that are the same except for consistent renaming of an identifier? What about changed comments? (most ASTs lose the comments; most programmers will think this is a really bad idea).
You can build a variation of the "Longest Common Substring" algorithm to compare trees. I've used that in tools that I have built.
Finally, after you've merged the trees, now you need to regenerate the text, ideally preserving most of the layout of the original code. (Programmers hate when you change the layout they so loving produced). So your ASTs need to capture position information, and your regeneration has to honor that where it can.

The call to log(retval) in your parser code looks like it's going to happen at the end of the rule, but it's not. You'll want to move the call into an #after block.
I changed log to spit out a message as well as the scope information and added calls to it to my own grammar like so:
script
#init {log("#init", retval);}
#after {log("#after", retval);}
: statement* EOF {log("after last rule reference", retval);}
-> ^(STMTS statement*)
;
Parsing test input produced the following output:
Logging from #init
getStart(): [#0,0:4='Print',<10>,1:0]
getStop(): null
getTree(): null
Logging from after last rule reference
getStart(): [#0,0:4='Print',<10>,1:0]
getStop(): null
getTree(): null
Logging from #after
getStart(): [#0,0:4='Print',<10>,1:0]
getStop(): [#4,15:15='<EOF>',<-1>,1:15]
getTree(): STMTS
The call in the after block has both the stop and tree fields populated.
I can't say whether this will help you with your merging tool, but I think this will at least get you past the problem with the half-populated scope object.

ANTLR: "missing attribute access on rule scope" problem

I am trying to build an ANTLR grammar that parses tagged sentences such as:
DT The NP cat VB ate DT a NP rat
and have the grammar:
fragment TOKEN : (('A'..'Z') | ('a'..'z'))+;
fragment WS : (' ' | '\t')+;
WSX : WS;
DTTOK : ('DT' WS TOKEN);
NPTOK : ('NP' WS TOKEN);
nounPhrase: (DTTOK WSX NPTOK);
chunker : nounPhrase {System.out.println("chunk found "+"("+$nounPhrase+")");};
The grammar generator generates the "missing attribute access on rule scope: nounPhrase" in the last line.
[I am still new to ANTLR and although some grammars work it's still trial and error. I also frequently get an "OutOfMemory" error when running grammars as small as this - any help welcome.]
I am using ANTLRWorks 1.3 to generate the code and am running under Java 1.6.

"missing attribute access" means that you've referenced a scope ($nounPhrase) rather than an attribute of the scope (such as $nounPhrase.text).
In general, a good way to troubleshoot problems with attributes is to look at the generated parser method for the rule in question.
For example, my initial attempt at creating a new rule when I was a little rusty:
multiple_names returns [List<Name> names]
#init {
names = new ArrayList<Name>(4);
}
: a=fullname ' AND ' b=fullname { names.add($a.value); names.add($b.value); };
resulted in "unknown attribute for rule fullname". So I tried
multiple_names returns [List<Name> names]
#init {
names = new ArrayList<Name>(4);
}
: a=fullname ' AND ' b=fullname { names.add($a); names.add($b); };
which results in "missing attribute access". Looking at the generated parser method made it clear what I needed to do though. While there are some cryptic pieces, the parts relevant to scopes (variables) are easily understood:
public final List<Name> multiple_names() throws RecognitionException {
List<Name> names = null; // based on "returns" clause of rule definition
Name a = null; // based on scopes declared in rule definition
Name b = null; // based on scopes declared in rule definition
names = new ArrayList<Name>(4); // snippet inserted from `#init` block
try {
pushFollow(FOLLOW_fullname_in_multiple_names42);
a=fullname();
state._fsp--;
match(input,189,FOLLOW_189_in_multiple_names44);
pushFollow(FOLLOW_fullname_in_multiple_names48);
b=fullname();
state._fsp--;
names.add($a); names.add($b);// code inserted from {...} block
}
catch (RecognitionException re) {
reportError(re);
recover(input,re);
}
finally {
// do for sure before leaving
}
return names; // based on "returns" clause of rule definition
}
After looking at the generated code, it's easy to see that the fullname rule is returning instances of the Name class, so what I needed in this case was simply:
multiple_names returns [List<Name> names]
#init {
names = new ArrayList<Name>(4);
}
: a=fullname ' AND ' b=fullname { names.add(a); names.add(b); };
The version you need in your situation may be different, but you'll generally be able to figure it out pretty easily by looking at the generated code.

In the original grammer, why not include the attribute it is asking for, most likely:
chunker : nounPhrase {System.out.println("chunk found "+"("+$nounPhrase.text+")");};
Each of your rules (chunker being the one I can spot quickly) have attributes (extra information) associated with them. You can find a quick list of the different attributes for the different types of rules at http://www.antlr.org/wiki/display/ANTLR3/Attribute+and+Dynamic+Scopes, would be nice if descriptions were put on the web page for each of those attributes (like for the start and stop attribute for the parser rules refer to tokens from your lexer - which would allow you to get back to your line number and position).
I think your chunker rule should just be changed slightly, instead of $nounPhrase you should use $nounPhrase.text. text is an attribute for your nounPhrase rule.
You might want to do a little other formating as well, usually the parser rules (start with lowercase letter) appear before the lexer rules (start with uppercase letter)
PS. When I type in the box the chunker rule is starting on a new line but in my original answer it didn't start on a new line.

If you accidentally do something silly like $thing.$attribute where you mean $thing.attribute, you will also see the missing attribute access on rule scope error message. (I know this question was answered a long time ago, but this bit of trivia might help someone else who sees the error message!)

Answering question after having found a better way...
WS : (' '|'\t')+;
TOKEN : (('A'..'Z') | ('a'..'z'))+;
dttok : 'DT' WS TOKEN;
nntok : 'NN' WS TOKEN;
nounPhrase : (dttok WS nntok);
chunker : nounPhrase ;
The problem was I was getting muddled between the lexer and the parser (this is apparently very common). The uppercase items are lexical, the lowercase in the parser. This now seems to work. (NB I have changed NP to NN).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.