I'm trying to parse some messages using an ANTLR grammar. Messages carry the structure:
:20:REF123456
:72:Some narrative text which
may contain new lines and
occassionally other : characters
:80A:Another field
The target output is a table containing the text between the colons as a 'key' and the text until the next key as the value of that key. For example:
Key | Values
--------------------------------------
20 | REF123456
72 | Some narrative text which
may contain new lines and
occassionally other : characters
80 | Another field
I'm able to write a grammar to do this, as long as colons are not allowed in the value field based on the following reference http://danielveselka.blogspot.fr/2011/02/antlr-swift-fields-parser.html
Can anyone offer guidance on how to approach this problem?
I'd skip v3 and go with ANTLR v4. A quick demo of how to do this in v4 would look like this:
grammar Swift;
parse
: entries? EOF
;
entries
: entry ( LINE_BREAK entry )*
;
entry
: key value
;
key
: ':' DATA ':'
;
value
: line ( LINE_BREAK line )*
;
line
: ( DATA | SPACES ) ( COLON | DATA | SPACES )*
;
LINE_BREAK
: '\r'? '\n'
| '\r'
;
COLON
: ':'
;
DATA
: ~[\r\n: \t]+
;
SPACES
: [ \t]+
;
Now all you need to do now is attach a listener to a tree-walker and listen for enterEntry occurrences and capture the key and value text. Here's how to do that:
public class Main {
public static void main(String[] args) throws Exception {
String input = ":20:REF123456\n" +
":72:Some narrative text which\n" +
"may contain new lines and\n" +
"occassionally other : characters\n" +
":80A:Another field";
SwiftLexer lexer = new SwiftLexer(new ANTLRInputStream(input));
SwiftParser parser = new SwiftParser(new CommonTokenStream(lexer));
ParseTreeWalker.DEFAULT.walk(new SwiftBaseListener(){
#Override
public void enterEntry(#NotNull SwiftParser.EntryContext ctx) {
String key = ctx.key().getText().replace(":", "");
String value = ctx.value().getText().replaceAll("\\s+", " ");
System.out.printf("key -> %s\nvalue -> %s\n================\n", key, value);
}
}, parser.parse());
}
}
Running the demo above will print the following on your console:
key -> 20
value -> REF123456
================
key -> 72
value -> Some narrative text which may contain new lines and occassionally other : characters
================
key -> 80A
value -> Another field
================
Related
I'm designing a language that allows you to make predicates on data. Here is my lexer.
lexer grammar Studylexer;
fragment LETTER : [A-Za-z];
fragment DIGIT : [0-9];
fragment TWODIGIT : DIGIT DIGIT;
fragment MONTH: ('0' [1-9] | '1' [0-2]);
fragment DAY: ('0' [1-9] | '1' [1-9] | '2' [1-9] | '3' [0-1]);
TIMESTAMP: TWODIGIT ':' TWODIGIT; // représentation de la timestamp
DATE : TWODIGIT TWODIGIT MONTH DAY; // représentation de la date
ID : LETTER+; // match identifiers
STRING : '"' ( ~ '"' )* '"' ; // match string content
NEWLINE:'\r'? '\n' ; // return newlines to parser (is end-statement signal)
WS : [ \t]+ -> skip ; // toss out whitespace
LIST: ( LISTSTRING | LISTDATE | LISTTIMESTAMP ) ; // list of variabels;
// list of operators
GT: '>';
LT: '<';
GTEQ: '>=';
LTEQ:'<=';
EQ: '=';
IN: 'in';
fragment LISTSTRING: STRING ',' STRING (',' STRING)*; // list of strings
fragment LISTDATE : DATE ',' DATE (',' DATE)*; // list of dates
fragment LISTTIMESTAMP:TIMESTAMP ',' TIMESTAMP (',' TIMESTAMP )*; // list of timestamps
NAMES: 'filename' | 'timestamp' | 'tso' | 'region' | 'processType' | 'businessDate' | 'lastModificationDate'; // name of variables in the where block
KEY: ID '[' NAMES ']' | ID '.' NAMES; // predicat key
and here is a part of my grammar.
expr: KEY op = ('>' | '<') value = ( DATE | TIMESTAMP ) NEWLINE # exprGTORLT
| KEY op = ('>='| '<=') value = ( DATE | TIMESTAMP ) NEWLINE # exprGTEQORLTEQ
| KEY '=' value = ( STRING | DATE | TIMESTAMP ) NEWLINE # exprEQ
| KEY 'in' LIST NEWLINE #exprIn
When I make a predicate for example.
tab [key] in "value1", "value2"
ANTLR generates an error.
no viable alternative at input tab [key] in
What can I do to resolve this problem?
First tab [key] does not produce a KEY token like you want it to for two reasons:
It contains spaces and KEY doesn't allow any spaces. The best way to fix that would be to remove the KEY rule from your lexer and instead turn it into a parser rule (meaning you also need to turn [ and ] into their own tokens). Then the white space in your input would be between tokens and thus successfully skipped.
key is not actually one of the words listed in NAMES.
Then another issue is that in is recognized as an ID token, not an IN token. That's because both ID and IN would produce a match of the same length and in cases like that the rule that's listed first takes precedence. So you should define ID after all of the keywords because otherwise the keywords will never be matched.
I have a problem with my antlr grammar or(lexer). In my case I need to parse a string with custom text and find functions in it. Format of function $foo($bar(3),'strArg'). I found solution in this post ANTLR Nested Functions and little bit improved it for my needs. But while testing different cases I found one that brakes parser: $foo($3,'strArg'). This will throw IncorectSyntax exception. I tried many variants(for example not to skip $ and include it in parsing tree) but it all these attempts were unsuccessfully
Lexer
lexer grammar TLexer;
TEXT
: ~[$]
;
FUNCTION_START
: '$' -> pushMode(IN_FUNCTION), skip
;
mode IN_FUNCTION;
FUNTION_NESTED : '$' -> pushMode(IN_FUNCTION), skip;
ID : [a-zA-Z_]+;
PAR_OPEN : '(';
PAR_CLOSE : ')' -> popMode;
NUMBER : [0-9]+;
STRING : '\'' ( ~'\'' | '\'\'' )* '\'';
COMMA : ',';
SPACE : [ \t\r\n]-> skip;
Parser
options {
tokenVocab=TLexer;
}
parse
: atom* EOF
;
atom
: text
| function
;
text
: TEXT+
;
function
: ID params
;
params
: PAR_OPEN ( param ( COMMA param )* )? PAR_CLOSE
;
param
: NUMBER
| STRING
| function
;
The parser does not fail on $foo($3,'strArg'), because when it encounters the second $ it is already in IN_FUNCTION mode and it is expecting a parameter. It skips the character and reads a NUMBER.
If you want it to fail you need to unskip the dollar signs in the Lexer:
FUNCTION_START : '$' -> pushMode(IN_FUNCTION);
mode IN_FUNCTION;
FUNTION_START : '$' -> pushMode(IN_FUNCTION);
and modify the function rule:
function : FUNCTION_START ID params;
I have a simple grammar as follows:
grammar SampleConfig;
line: ID (WS)* '=' (WS)* string;
ID: [a-zA-Z]+;
string: '"' (ESC|.)*? '"' ;
ESC : '\\"' | '\\\\' ; // 2-char sequences \" and \\
WS: [ \t]+ -> skip;
The spaces in the input are completely ignored, including those in the string literal.
final String input = "key = \"value with spaces in between\"";
final SampleConfigLexer l = new SampleConfigLexer(new ANTLRInputStream(input));
final SampleConfigParser p = new SampleConfigParser(new CommonTokenStream(l));
final LineContext context = p.line();
System.out.println(context.getChildCount() + ": " + context.getText());
This prints the following output:
3: key="valuewithspacesinbetween"
But, I expected the white spaces in the string literal to be retained, i.e.
3: key="value with spaces in between"
Is it possible to correct the grammar to achieve this behavior or should I just override CommonTokenStream to ignore whitespace during the parsing process?
You shouldn't expect any spaces in parser rules since you're skipping them in your lexer.
Either remove the skip command or make string a lexer rule:
STRING : '"' ( '\\' [\\"] | ~[\\"\r\n] )* '"';
One of our web applications regulary dies because its out of memory. The sparse data we gathered from memory dumps suggests there is an issue in our antlr parsing implementation. What we see is a antlr tokenstream containing more than a million items. The input text which causes this has yet to be found.
Is it possible this is somehow related to an zero width item beeing matched?
Could there be another issue in the grammer resulting in excessive memory usage?
Here is the current grammar we use:
grammar AdvancedQueries;
options {
language = Java;
output = AST;
ASTLabelType=CommonTree;
}
tokens {
FOR;
END;
FIELDSEARCH;
TARGETFIELD;
RELATION;
NOTNODE;
ANDNODE;
NEARDISTANCE;
OUTOFPLACE;
}
#header {
package de.bsmo.fast.parsing;
}
#lexer::header {
package de.bsmo.fast.parsing;
}
startExpression : orEx;
expressionLevel4
: LPARENTHESIS! orEx RPARENTHESIS! | atomicExpression | outofplace;
expressionLevel3
: (fieldExpression) | expressionLevel4 ;
expressionLevel2
: (nearExpression) | expressionLevel3 ;
expressionLevel1
: (countExpression) | expressionLevel2 ;
notEx : NOT^? a=expressionLevel1 ;
andEx : (notEx -> notEx)
(AND? a=notEx -> ^(ANDNODE $andEx $a))*;
orEx : andEx (OR^ andEx)*;
countExpression : COUNT LPARENTHESIS countSub RPARENTHESIS RELATION NUMBERS -> ^(COUNT countSub RELATION NUMBERS);
countSub
: orEx;
nearExpression : NEAR LPARENTHESIS (WORD|PHRASE) MULTIPLESEPERATOR (WORD|PHRASE) MULTIPLESEPERATOR NUMBERS RPARENTHESIS -> ^(NEAR WORD* PHRASE* ^(NEARDISTANCE NUMBERS));
fieldExpression : WORD PROPERTYSEPERATOR fieldSub -> ^(FIELDSEARCH ^(TARGETFIELD WORD) fieldSub );
fieldSub
: WORD | PHRASE | LPARENTHESIS! orEx RPARENTHESIS!;
atomicExpression
: WORD
| PHRASE
| NUMBERS
;
//Out of place are elements captured that may be in the parseable input but need to be ommited from output later
//Those unwanted elements are captured here.
//MULTIPLESEPERATOR capture unwanted ","
outofplace
: MULTIPLESEPERATOR -> ^(OUTOFPLACE ^(MULTIPLESEPERATOR));
fragment NUMBER : ('0'..'9');
fragment CHARACTER : ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'?');
fragment QUOTE : ('"');
fragment LESSTHEN : '<';
fragment MORETHEN: '>';
fragment EQUAL: '=';
fragment SPACE : ('\u0009'|'\u0020'|'\u000C'|'\u00A0');
fragment WORDMATTER: ('!'|'0'..'9'|'\u0023'..'\u0027'|'*'|'+'|'\u002D'..'\u0039'|'\u003F'..'\u007E'|'\u00A1'..'\uFFFE');
LPARENTHESIS : '(';
RPARENTHESIS : ')';
AND : ('A'|'a')('N'|'n')('D'|'d');
OR : ('O'|'o')('R'|'r');
ANDNOT : ('A'|'a')('N'|'n')('D'|'d')('N'|'n')('O'|'o')('T'|'t');
NOT : ('N'|'n')('O'|'o')('T'|'t');
COUNT:('C'|'c')('O'|'o')('U'|'u')('N'|'n')('T'|'t');
NEAR:('N'|'n')('E'|'e')('A'|'a')('R'|'r');
PROPERTYSEPERATOR : ':';
MULTIPLESEPERATOR : ',';
WS : (SPACE) { $channel=HIDDEN; };
NUMBERS : (NUMBER)+;
RELATION : (LESSTHEN | MORETHEN)? EQUAL // '<=', '>=', or '='
| (LESSTHEN | MORETHEN); // '<' or '>'
PHRASE : (QUOTE)(.)*(QUOTE);
WORD : WORDMATTER* ;
The most common cause of this is a token that can have length 0. There can be an infinite number of such a token between any two other tokens in the file. Defining a token like this now results in a compiler warning in ANTLR 4.
The following rule can match the empty string:
WORD : WORDMATTER*;
Perhaps you meant to use the following instead?
WORD : WORDMATTER+;
I'm trying to write a program in ANTLR (Java) concerning simplifying regular expression. I have already written some code (grammar file contents below)
grammar Regexp_v7;
options{
language = Java;
output = AST;
ASTLabelType = CommonTree;
backtrack = true;
}
tokens{
DOT;
REPEAT;
RANGE;
NULL;
}
fragment
ZERO
: '0'
;
fragment
DIGIT
: '1'..'9'
;
fragment
EPSILON
: '#'
;
fragment
FI
: '%'
;
ID
: EPSILON
| FI
| 'a'..'z'
| 'A'..'Z'
;
NUMBER
: ZERO
| DIGIT (ZERO | DIGIT)*
;
WHITESPACE
: ('\r' | '\n' | ' ' | '\t' ) + {$channel = HIDDEN;}
;
list
: (reg_exp ';'!)*
;
term
: ID -> ID
| '('! reg_exp ')'!
;
repeat_exp
: term ('{' range_exp '}')+ -> ^(REPEAT term (range_exp)+)
| term -> term
;
range_exp
: NUMBER ',' NUMBER -> ^(RANGE NUMBER NUMBER)
| NUMBER (',') -> ^(RANGE NUMBER NULL)
| ',' NUMBER -> ^(RANGE NULL NUMBER)
| NUMBER -> ^(RANGE NUMBER NUMBER)
;
kleene_exp
: repeat_exp ('*'^)*
;
concat_exp
: kleene_exp (kleene_exp)+ -> ^(DOT kleene_exp (kleene_exp)+)
| kleene_exp -> kleene_exp
;
reg_exp
: concat_exp ('|'^ concat_exp)*
;
My next goal is to write down tree grammar code, which is able to simplify regular expressions (e.g. a|a -> a , etc.). I have done some coding (see text below), but I have troubles with defining rule that treats nodes as subtrees (in order to simplify following kind of expressions e.g.: (a|a)|(a|a) to a, etc.)
tree grammar Regexp_v7Walker;
options{
language = Java;
tokenVocab = Regexp_v7;
ASTLabelType = CommonTree;
output=AST;
backtrack = true;
}
tokens{
NULL;
}
bottomup
: ^('*' ^('*' e=.)) -> ^('*' $e) //a** -> a*
| ^('|' i=.* j=.* {$i.tree.toStringTree() == $j.tree.toStringTree()} )
-> $i // There are 3 errors while this line is up and running:
// 1. CommonTree cannot be resolved,
// 2. i.tree cannot be resolved or is not a field,
// 3. i cannot be resolved.
;
Small driver class:
public class Regexp_Test_v7 {
public static void main(String[] args) throws RecognitionException {
CharStream stream = new ANTLRStringStream("a***;a|a;(ab)****;ab|ab;ab|aa;");
Regexp_v7Lexer lexer = new Regexp_v7Lexer(stream);
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
Regexp_v7Parser parser = new Regexp_v7Parser(tokenStream);
list_return list = parser.list();
CommonTree t = (CommonTree) list.getTree();
System.out.println("Original tree: " + t.toStringTree());
CommonTreeNodeStream nodes = new CommonTreeNodeStream(t);
Regexp_v7Walker s = new Regexp_v7Walker(nodes);
t = (CommonTree)s.downup(t);
System.out.println("Simplified tree: " + t.toStringTree());
Can anyone help me with solving this case?
Thanks in advance and regards.
Now, I'm no expert, but in your tree grammar:
add filter=true
change the second line of bottomup rule to:
^('|' i=. j=. {i.toStringTree().equals(j.toStringTree()) }? ) -> $i }
If I'm not mistaken by using i=.* you're allowing i to be non-existent and you'll get a NullPointerException on conversion to a String.
Both i and j are of type CommonTree because you've set it up this way: ASTLabelType = CommonTree, so you should call i.toStringTree().
And since it's Java and you're comparing Strings, use equals().
Also to make the expression in curly brackets a predicate, you need a question mark after the closing one.