parsing simple text with Antlr3.3

parsing simple text with Antlr3.3 - java

I'm working on a parser in Antlr3.3 that parse a string like 'play bob marley' or 'search bob marley'.
The parser should return which keyword ('play', 'search', ...) I using and return the artist I give. Currently it returns in my Interpreter 'NoViableAltException' where the artist should stand.
Sample.g :
grammar Sample;
#header {
package a.b.c;
import java.util.HashMap;
}
#lexer::header {
package a.b.c;
}
#members {
}
text returns [String s] :
wordExp SPACE name
;
wordExp :
'play' | 'search'
;
fragment name :
( TEXT | DIGIT)*
;
fragment TEXT : ('a'..'z' | 'A'..'Z');
fragment DIGIT : '0'..'9';
At the moment it shows that (input: 'play weezer'):
I try to have an output like this:
It has been a while I worked with it and I know there have to be a little loop inside but I have no idea right now.
Do you have any idea how that could work?

Parser rules cannot be fragments: remove fragment from your name rule.
Seems to me you're trying to do something like this:
text
: wordExp name
;
name
: WORD WORD? // one ore two words
;
wordExp
: PLAY
| SEARCH
;
// Keywords definition _before_ the `WORD` rule!
PLAY : 'play';
SEARCH : 'search';
WORD : ( 'a'..'z' | 'A'..'Z' )+; // digits in here?
SPACES : ( ' ' | '\t' | '\r' | '\n' )+ {skip();};
Be aware that parsing such short sentences would be fine, but parsing stuff that resembles English more closely will be difficult to do in ANTLR (or any parser generator using some (E)BNF notation). In that case, google for NLTK.

Related

What does cause Antlr to create a large tokenstream resulting in out of memory

One of our web applications regulary dies because its out of memory. The sparse data we gathered from memory dumps suggests there is an issue in our antlr parsing implementation. What we see is a antlr tokenstream containing more than a million items. The input text which causes this has yet to be found.
Is it possible this is somehow related to an zero width item beeing matched?
Could there be another issue in the grammer resulting in excessive memory usage?
Here is the current grammar we use:
grammar AdvancedQueries;
options {
language = Java;
output = AST;
ASTLabelType=CommonTree;
}
tokens {
FOR;
END;
FIELDSEARCH;
TARGETFIELD;
RELATION;
NOTNODE;
ANDNODE;
NEARDISTANCE;
OUTOFPLACE;
}
#header {
package de.bsmo.fast.parsing;
}
#lexer::header {
package de.bsmo.fast.parsing;
}
startExpression : orEx;
expressionLevel4
: LPARENTHESIS! orEx RPARENTHESIS! | atomicExpression | outofplace;
expressionLevel3
: (fieldExpression) | expressionLevel4 ;
expressionLevel2
: (nearExpression) | expressionLevel3 ;
expressionLevel1
: (countExpression) | expressionLevel2 ;
notEx : NOT^? a=expressionLevel1 ;
andEx : (notEx -> notEx)
(AND? a=notEx -> ^(ANDNODE $andEx $a))*;
orEx : andEx (OR^ andEx)*;
countExpression : COUNT LPARENTHESIS countSub RPARENTHESIS RELATION NUMBERS -> ^(COUNT countSub RELATION NUMBERS);
countSub
: orEx;
nearExpression : NEAR LPARENTHESIS (WORD|PHRASE) MULTIPLESEPERATOR (WORD|PHRASE) MULTIPLESEPERATOR NUMBERS RPARENTHESIS -> ^(NEAR WORD* PHRASE* ^(NEARDISTANCE NUMBERS));
fieldExpression : WORD PROPERTYSEPERATOR fieldSub -> ^(FIELDSEARCH ^(TARGETFIELD WORD) fieldSub );
fieldSub
: WORD | PHRASE | LPARENTHESIS! orEx RPARENTHESIS!;
atomicExpression
: WORD
| PHRASE
| NUMBERS
;
//Out of place are elements captured that may be in the parseable input but need to be ommited from output later
//Those unwanted elements are captured here.
//MULTIPLESEPERATOR capture unwanted ","
outofplace
: MULTIPLESEPERATOR -> ^(OUTOFPLACE ^(MULTIPLESEPERATOR));
fragment NUMBER : ('0'..'9');
fragment CHARACTER : ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'?');
fragment QUOTE : ('"');
fragment LESSTHEN : '<';
fragment MORETHEN: '>';
fragment EQUAL: '=';
fragment SPACE : ('\u0009'|'\u0020'|'\u000C'|'\u00A0');
fragment WORDMATTER: ('!'|'0'..'9'|'\u0023'..'\u0027'|'*'|'+'|'\u002D'..'\u0039'|'\u003F'..'\u007E'|'\u00A1'..'\uFFFE');
LPARENTHESIS : '(';
RPARENTHESIS : ')';
AND : ('A'|'a')('N'|'n')('D'|'d');
OR : ('O'|'o')('R'|'r');
ANDNOT : ('A'|'a')('N'|'n')('D'|'d')('N'|'n')('O'|'o')('T'|'t');
NOT : ('N'|'n')('O'|'o')('T'|'t');
COUNT:('C'|'c')('O'|'o')('U'|'u')('N'|'n')('T'|'t');
NEAR:('N'|'n')('E'|'e')('A'|'a')('R'|'r');
PROPERTYSEPERATOR : ':';
MULTIPLESEPERATOR : ',';
WS : (SPACE) { $channel=HIDDEN; };
NUMBERS : (NUMBER)+;
RELATION : (LESSTHEN | MORETHEN)? EQUAL // '<=', '>=', or '='
| (LESSTHEN | MORETHEN); // '<' or '>'
PHRASE : (QUOTE)(.)*(QUOTE);
WORD : WORDMATTER* ;

The most common cause of this is a token that can have length 0. There can be an infinite number of such a token between any two other tokens in the file. Defining a token like this now results in a compiler warning in ANTLR 4.
The following rule can match the empty string:
WORD : WORDMATTER*;
Perhaps you meant to use the following instead?
WORD : WORDMATTER+;

Antlr MismatchedSetException During Interpretation

I am new to Antlr and I have defined a basic grammar using Antlr 3. The grammar compiles and ANTLRWorks generates the Parser and Lexer code without any problems.
The grammar can be seen below:
grammar i;
#header {
package i;
}
module : 'Module1'| 'Module2';
object : 'I';
objectType : 'Name';
filters : EMPTY | 'WHERE' module;
table : module object objectType;
STRING : ('a'..'z'|'A'..'Z')+;
EMPTY : ' ';
The problem is that when I interpret the table Parser I get a MismatchedSetException. This is due to having the EMPTY. As soon as I remove EMPTY from the grammar, the interpretation works. I have looked on the Antlr website and some other examples and the Empty space is ' '. I am not sure what to do. I need this EMPTY.
When it interprets, I get the following Exception:
Interpreting...
[11:02:14] problem matching token at 1:4 NoViableAltException(' '#[1:1: Tokens : ( T__4 | T__5 | T__6 | T__7 | T__8 | T__9 | T__10 | T__11 | T__12 | T__13 | T__14 | T__15 );])
[11:02:14] problem matching token at 1:9 NoViableAltException(' '#[1:1: Tokens : ( T__4 | T__5 | T__6 | T__7 | T__8 | T__9 | T__10 | T__11 | T__12 | T__13 | T__14 | T__15 );])
As soon as I change the EMPTY to be the following:
EMPTY : '';
instead of:
EMPTY : ' ';
It actually interprets it. However, I am getting the following Exception:
Interpreting...
[10:57:23] problem matching token at 1:4 NoViableAltException(' '#[1:1: Tokens : ( T__4 | T__5 | T__6 | T__7 | T__8 | T__9 | T__10 | T__11 | T__12 | T__13 | T__14 | T__15 | T__16 );])
[10:57:23] problem matching token at 1:9 NoViableAltException(' '#[1:1: Tokens : ( T__4 | T__5 | T__6 | T__7 | T__8 | T__9 | T__10 | T__11 | T__12 | T__13 | T__14 | T__15 | T__16 );])
However, ANLTWorks still generates the Lexer and Parser code.
I hope you can help.
EDIT:
grammar i;
#header {
package i;
}
select : 'SELECT *' 'FROM' table filters';';
filters : EMPTY | 'WHERE' conditions;
conditions : STRING operator value;
operator : '=' | '!=';
true : 'true';
value : true;
STRING : ('a'..'z'|'A'..'Z')+;
EMPTY : ' ';

I'm still a bit unsure about usage, but I think we're talking about the same thing when we say "empty input". Here's an answer to get the ball rolling, starting with a modified grammar.
grammar i;
#header {
package i;
}
module : 'Module1'| 'Module2';
object : 'I';
objectType : 'Name';
filters : | 'WHERE' module;
table : module object objectType filters;
STRING : ('a'..'z'|'A'..'Z')+;
WS : (' '|'\t'|'\f'|'\n'|'\r')+ {skip();}; //ignore whitespace
Note that I tacked filters onto the end of the table rule to explain what I'm talking about.
This grammar accepts the following input (starting with rule table) as it did before:
Module1 I Name
It works because filters matches even though nothing follows the text Name: it matches on empty input using the first alternative.
The grammar also accepts this:
Module1 I Name WHERE Module2
The filters rule is satisfied with the text WHERE Module2 matching the second alternative (defined as 'WHERE' module in the grammar).
A cleaner approach would be to change filters and table to the following rules (recognizing, of course, that I changed table in the first place).
filters : 'WHERE' module; //no more '|'
table : module object objectType filters?; //added '?'
The grammar matches the same input as before, but the terms a little clearer: instead of saying "filters is required in table and filters matches on empty", we now say "filters are optional in table and filters doesn't match on empty".
It amounts to the same thing in this case. Matching on empty (foo: | etc;) is perfectly valid, but I've run into more problems using it than I have with matching optional (foo?) rules.
Update following your update.
I'm taking a step back here to get us out of the theoretical and into the practical. Here is an updated grammar, Java test code that invokes it, test input, and test output. Please give it a run.
Grammar
Altered for testing but follows the same idea as before.
grammar i;
#header {
package i;
}
selects : ( //test rule to allow processing multiple select calls. Don't worry about the details.
{System.out.println(">>select");}
select
{System.out.println("<<select");}
)+
;
select : 'SELECT *' 'FROM' table filters? ';'
{System.out.println("\tFinished select.");} //test output
;
module : 'Module1'| 'Module2';
object : 'I';
objectType : 'Name';
filters : 'WHERE' conditions
{System.out.println("\tFinished filters.");} //test output
;
table : module object objectType
{System.out.println("\tFinished table.");} //test output
;
conditions : STRING operator value
{System.out.println("\tCondition test on " + $STRING.text);}
;
operator : '=' | '!=';
true_ : 'true'; //changed so that Java code could be generated
value : true_;
STRING : ('a'..'z'|'A'..'Z')+;
WS : (' '|'\t'|'\f'|'\n'|'\r')+ {skip();}; //ignore whitespace
TestiGrammar.java
package i;
import java.io.InputStream;
import org.antlr.runtime.ANTLRInputStream;
import org.antlr.runtime.CharStream;
import org.antlr.runtime.CommonTokenStream;
public class TestiGrammar {
public static void main(String[] args) throws Exception {
InputStream resource = TestiGrammar.class.getResourceAsStream("itest.txt");
CharStream input = new ANTLRInputStream(resource);
resource.close();
iLexer lexer = new iLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
iParser parser = new iParser(tokens);
parser.selects();
}
}
itest.txt Test input file
SELECT * FROM Module2 I Name;
SELECT * FROM Module2 I Name WHERE foobar = true;
SELECT * FROM Module2 I Name WHERE dingdong != true;
test output
>>select
Finished table.
Finished select.
<<select
>>select
Finished table.
Condition test on foobar
Finished filters.
Finished select.
<<select
>>select
Finished table.
Condition test on dingdong
Finished filters.
Finished select.
<<select

Match String to Lexer : 'expecting' error

I have a small problem with my grammar.
I am trying to detect whether my string is a date comparison or not.
But the DATE lexer I created seem not to be recognized by antlr, and I get an error that I cannot solve.
Here is my input expression :
"FDT > '2007/10/09 12:00:0.0'"
I simply expect such a tree as output :
COMP_OP
FDT my_DATE
Here is my grammar :
// Aiming at parsing a complete BQS formed Query
grammar Logic;
options {
output=AST;
}
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
// precedence order is (low to high): or, and, not, [comp_op, geo_op, rel_geo_op, like, not like, exists], ()
parse
: expression EOF -> expression
; // ommit the EOF token
expression
: query
;
query
: atom (COMP_OP^ DATE)*
;
//END BIG PART
atom
: ID
| | '(' expression ')' -> expression
;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
// GENERAL OPERATORS:
DATE : '\'' YEAR '/' MONTH '/' DAY (' ' HOUR ':' MINUTE ':' SECOND)? '\'';
ID : (CHARACTER|DIGIT|','|'.'|'\''|':'|'/')+;
COMP_OP : '=' | '<' | '>' | '<>' | '<=' | '>=';
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
fragment YEAR : DIGIT DIGIT DIGIT DIGIT;
fragment MONTH : DIGIT DIGIT;
fragment DAY : DIGIT DIGIT;
fragment HOUR : DIGIT DIGIT;
fragment MINUTE : DIGIT DIGIT;
fragment SECOND : DIGIT DIGIT ('.' (DIGIT)+)?;
fragment DIGIT : '0'..'9' ;
fragment DIGIT_SEQ :(DIGIT)+;
fragment CHARACTER : ('a'..'z' | 'A'..'Z');
As an output error, I get :
line 1:25 mismatched character '.' expecting set null
line 1:27 mismatched input ''' expecting DATE
I also tried to remove the ' ' from my Date (thinking that it was perhaps the problem, as I remove them in the grammar.)
In this case, I get this error :
line 1:6 mismatched input ''2007/10/09' expecting DATE
Can anyone explain me why I get such an error, and how I could solve it ?
This question is q subset of my complete task, where I have to differentiate lots of omparisons (dates, geographic, strings, . . .). I would thus really need to be able to give 'tags' to my atoms.
Thank you very much !
As a complement, here is my current Java code :
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
// the expression
String src = "FDT > '2007/10/09 12:00:0.0'";
// create a lexer & parser
//LogicLexer lexer = new LogicLexer(new ANTLRStringStream(src));
//LogicParser parser = new LogicParser(new CommonTokenStream(lexer));
LogicLexer lexer = new LogicLexer(new ANTLRStringStream(src));
LogicParser parser = new LogicParser(new CommonTokenStream(lexer));
// invoke the entry point of the parser (the parse() method) and get the AST
CommonTree tree = (CommonTree)parser.parse().getTree();
// print the DOT representation of the AST
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}

I finally got it,
Seems like even though I remove whitespaces, I still have to include them in expressions that contain one.
In addition, there was a small error in second definition, as the second digit was optional.
The grammar is thus slightly modified :
fragment SECOND : DIGIT (DIGIT)? ('.' (DIGIT)+)?;
which gives this output :
I know hope this will still work in my more complete grammar :)
Hope it helps someone.

lexer that takes "not" but not "not like"

I need a small trick to get my parser completely working.
I use antlr to parse boolean queries.
a query is composed of elements, linked together by ands, ors and nots.
So I can have something like :
"(P or not Q or R) or (( not A and B) or C)"
Thing is, an element can be long, and is generally in the form :
a an_operator b
for example :
"New-York matches NY"
Trick, one of the an_operator is "not like"
So I would like to modify my lexer so that the not checks that there is no like after it, to avoid parsing elements containing "not like" operators.
My current grammar is here :
// save it in a file called Logic.g
grammar Logic;
options {
output=AST;
}
// parser/production rules start with a lower case letter
parse
: expression EOF! // omit the EOF token
;
expression
: orexp
;
orexp
: andexp ('or'^ andexp)* // make `or` the root
;
andexp
: notexp ('and'^ notexp)* // make `and` the root
;
notexp
: 'not'^ atom // make `not` the root
| atom
;
atom
: ID
| '('! expression ')'! // omit both `(` andexp `)`
;
// lexer/terminal rules start with an upper case letter
ID : ('a'..'z' | 'A'..'Z')+;
Space : (' ' | '\t' | '\r' | '\n')+ {$channel=HIDDEN;};
Any help would be appreciated.
Thanks !

Here's a possible solution:
grammar Logic;
options {
output=AST;
}
tokens {
NOT_LIKE;
}
parse
: expression EOF!
;
expression
: orexp
;
orexp
: andexp (Or^ andexp)*
;
andexp
: fuzzyexp (And^ fuzzyexp)*
;
fuzzyexp
: (notexp -> notexp) ( Matches e=notexp -> ^(Matches $fuzzyexp $e)
| Not Like e=notexp -> ^(NOT_LIKE $fuzzyexp $e)
| Like e=notexp -> ^(Like $fuzzyexp $e)
)?
;
notexp
: Not^ atom
| atom
;
atom
: ID
| '('! expression ')'!
;
And : 'and';
Or : 'or';
Not : 'not';
Like : 'like';
Matches : 'matches';
ID : ('a'..'z' | 'A'..'Z')+;
Space : (' ' | '\t' | '\r' | '\n')+ {$channel=HIDDEN;};
which will parse the input "A not like B or C like D and (E or not F) and G matches H" into the following AST:

The following alternatives can never be reached: 2

I'm trying to create a very simple grammar to learn to use ANTLR but I get the following message:
"The following alternatives can never be reached: 2"
This is my grammar attempt:
grammar Robot;
file : command+;
command : ( delay|type|move|click|rclick) ;
delay : 'wait' number ';';
type : 'type' id ';';
move : 'move' number ',' number ';';
click : 'click' ;
rclick : 'rlick' ;
id : ('a'..'z'|'A'..'Z')+ ;
number : ('0'..'9')+ ;
WS : (' ' | '\t' | '\r' | '\n' ) { skip();} ;
I'm using ANTLRWorks plugin for IDEA:

The .. (range) inside parser rules means something different than inside lexer rules. Inside lexer rules, it means: "from char X to char Y", and inside parser rule it matches "from token M to token N". And since you made number a parser rule, it does not do what you think it does (and are therefor receiving an obscure error message).
The solution: make number a lexer rule instead (so, capitalize it: Number):
grammar Robot;
file : command+;
command : (delay | type | move | Click | RClick) ;
delay : 'wait' Number ';';
type : 'type' Id ';';
move : 'move' Number ',' Number ';';
Click : 'click' ;
RClick : 'rlick' ;
Id : ('a'..'z'|'A'..'Z')+ ;
Number : ('0'..'9')+ ;
WS : (' ' | '\t' | '\r' | '\n') { skip();} ;
And as you can see, I also made id, click and rclick lexer rules instead. If you're not sure what the difference is between parser- and lexer rules, please say so and I'll add an explanation to this answer.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

parsing simple text with Antlr3.3 - java

Related

What does cause Antlr to create a large tokenstream resulting in out of memory

Antlr MismatchedSetException During Interpretation

Match String to Lexer : 'expecting' error

lexer that takes "not" but not "not like"

The following alternatives can never be reached: 2

Categories

Resources