disambiguate tokens without using tokenizer state - java

I cannot get JavaCC to properly disambiguate tokens by their place in a grammar. I have the following JJTree file (I'll call it bug.jjt):
options
{
LOOKAHEAD = 3;
CHOICE_AMBIGUITY_CHECK = 2;
OTHER_AMBIGUITY_CHECK = 1;
SANITY_CHECK = true;
FORCE_LA_CHECK = true;
}
PARSER_BEGIN(MyParser)
import java.util.*;
public class MyParser {
public static void main(String[] args) throws ParseException {
MyParser parser = new MyParser(new java.io.StringReader(args[0]));
SimpleNode root = parser.production();
root.dump("");
}
}
PARSER_END(MyParser)
SKIP:
{
" "
}
TOKEN:
{
<STATE: ("state")>
|<PROD_NAME: (["a"-"z"])+ >
}
SimpleNode production():
{}
{
(
<PROD_NAME>
<STATE>
<EOF>
)
{return jjtThis;}
}
Generate the parser code with the following:
java -cp C:\path\to\javacc.jar jjtree bug.jjt
java -cp C:\path\to\javacc.jar javacc bug.jj
Now after compiling this, you can give run MyParser from the command line with a string to parse as the argument. It prints production if successful and spews an error if it fails.
I tried two simple inputs: foo state and state state. The first one parses, but the second one does not, since both state strings are tokenized as <STATE>. As I set LOOKAHEAD to 3, I expected it to use the grammar and see that one string state must be <STATE> and the other must be <PROD_NAME. However, no such luck. I have tried changing the various lookahead parameters to no avail. I am also not able to use tokenizer states (where you define different tokens allowable in different states), as this example is part of a more complicated system that will probably have a lot of these types of ambiguities.
Can anyone tell me how to make JavaCC properly disambiguate these tokens, without using tokenizer states?

This is covered in the FAQ under question 4.19.
There are three strategies outlined there
Putting choices in the grammar. See Bart Kiers's answer.
Using semantic look ahead. For this approach you get rid of the production defining STATE and write your grammar like this
void SimpleNode production():
{}
{
(
<PROD_NAME>
( LOOKAHEAD({getToken(1).kind == PROD_NAME && getToken(1).image.equals("state")})
<PROD_NAME>
...
|
...other choices...
)
)
{return jjtThis;}
}
If there are no other choices, then
void SimpleNode production():
{}
{
(
<PROD_NAME>
( LOOKAHEAD({getToken(1).kind == PROD_NAME && getToken(1).image.equals("state")})
<PROD_NAME>
...
|
{ int[][] expTokSeqs = new int[][] { new int[] {STATE } } ;
throw new ParseException(token, expTokSeqs, tokenImage) ; }
)
)
{return jjtThis;}
}
But, in this case, you need a production for STATE, as it is mentioned in the initialization of expTokSeqs. So you need a production.
< DUMMY > TOKEN : { < STATE : "state" > }
where DUMMY is a state that is never gone to.
Using lexical states. The title of the OP's question suggests he doesn't want to do this, but not why. It can be done if the state switching can be contained in the token manager. Suppose a file is a sequence of productions and each of production look like this.
name state : "a" | "b" name ;
That is it starts with a name, then the keyword "state" a colon, some tokens and finally a semicolon. (I'm just making this up as I have no idea what sort of language the OP is trying to parse.) Then you can use three lexical states DEFAULT, S0, and S1.
In the DEFAULT any sequence of letters (including "state") is a PROD_NAME. In DEFAULT, recognizing a PROD_NAME switches the state to S0.
In S0 any sequence of letters except "state" is a PROD_NAME and "state" is a STATE. In S0, recognizing a STATE token causes the tokenizer to switch to state S1.
In S1 any any sequence of letters (including "state") is a PROD_NAME. In S1, recognizing a SEMICOLON switches the state to DEFAULT.
So our example is tokenized like this
name state : "a" | "b" name ;
|__||______||_________________||_________
DEF- S0 S1 DEFAULT
AULT
The productions are written like this
<*> SKIP: { " " }
<S0> TOKEN: { <STATE: "state"> : S1 }
<DEFAULT> TOKEN:{ <PROD_NAME: (["a"-"z"])+ > : S0 }
<S0,S1> TOKEN:{ <PROD_NAME: (["a"-"z"])+ > }
<S1> TOKEN: { <SEMICOLON : ";" > : DEFAULT
<S0, DEFAULT> TOKEN : { <SEMICOLON : ";" > }
<*> TOKEN {
COLON : ":"
| ...etc...
}
It is possible for the parser to send state switching commands back to the tokenizer, but it is tricky to get it right and fragile. Se question 3.12 of the FAQ.

Lookahead does not concern the lexer while it composes characters to tokens. It is used by the parser when it matches non-terminals as composed from terminals (tokens).
If you define "state" to result in a token STATE, well, then that's what it is.
I agree with you, that tokenizer states aren't a good solution for permitting keywords to be used as identifiers. Is this really necessary? There's a good reason for HLL's not to permit this.
OTOH, if you can rewrite your grammar using just <PROD_NAME>s you might postpone the recognitions of the keywords during semantic analysis.

The LOOKAHEAD option only applies to the parser (production rules). The tokenizer is not affected by this: it will produce tokens without worrying what a production rule is trying to match. The input "state" will always be tokenized as a STATE, even if the parser is trying to match a PROD_NAME.
You could do something like this (untested, pseudo-ish grammar code ahead!):
SimpleNode production():
{}
{
(
prod_name_or_state()
<STATE>
<EOF>
)
{return jjtThis;}
}
SimpleNode prod_name_or_state():
{}
{
(
<PROD_NAME>
| <STATE>
)
{return jjtThis;}
}
which would match both "foo state" and "state state".
Or the equivalent, but more compact:
SimpleNode production():
{}
{
(
( <PROD_NAME> | <STATE> )
<STATE>
<EOF>
)
{return jjtThis;}
}

Related

Regular expression fo parsing JSON arrays in Java [duplicate]

This question already has answers here:
Regular expression to match balanced parentheses
(21 answers)
Closed 3 years ago.
Is it possible to write a regular expression that matches a nested pattern that occurs an unknown number of times? For example, can a regular expression match an opening and closing brace when there are an unknown number of open/close braces nested within the outer braces?
For example:
public MyMethod()
{
if (test)
{
// More { }
}
// More { }
} // End
Should match:
{
if (test)
{
// More { }
}
// More { }
}
No. It's that easy. A finite automaton (which is the data structure underlying a regular expression) does not have memory apart from the state it's in, and if you have arbitrarily deep nesting, you need an arbitrarily large automaton, which collides with the notion of a finite automaton.
You can match nested/paired elements up to a fixed depth, where the depth is only limited by your memory, because the automaton gets very large. In practice, however, you should use a push-down automaton, i.e a parser for a context-free grammar, for instance LL (top-down) or LR (bottom-up). You have to take the worse runtime behavior into account: O(n^3) vs. O(n), with n = length(input).
There are many parser generators avialable, for instance ANTLR for Java. Finding an existing grammar for Java (or C) is also not difficult.
For more background: Automata Theory at Wikipedia
Using regular expressions to check for nested patterns is very easy.
'/(\((?>[^()]+|(?1))*\))/'
Probably working Perl solution, if the string is on one line:
my $NesteD ;
$NesteD = qr/ \{( [^{}] | (??{ $NesteD }) )* \} /x ;
if ( $Stringy =~ m/\b( \w+$NesteD )/x ) {
print "Found: $1\n" ;
}
HTH
EDIT: check:
http://dev.perl.org/perl6/rfc/145.html
ruby information: http://www.ruby-forum.com/topic/112084
more perl: http://www.perlmonks.org/?node_id=660316
even more perl: https://metacpan.org/pod/Text::Balanced
perl, perl, perl: http://perl.plover.com/yak/regex/samples/slide083.html
And one more thing by Torsten Marek (who had pointed out correctly, that it's not a regex anymore):
http://coding.derkeiler.com/Archive/Perl/comp.lang.perl.misc/2008-03/msg01047.html
The Pumping lemma for regular languages is the reason why you can't do that.
The generated automaton will have a finite number of states, say k, so a string of k+1 opening braces is bound to have a state repeated somewhere (as the automaton processes the characters). The part of the string between the same state can be duplicated infinitely many times and the automaton will not know the difference.
In particular, if it accepts k+1 opening braces followed by k+1 closing braces (which it should) it will also accept the pumped number of opening braces followed by unchanged k+1 closing brases (which it shouldn't).
Yes, if it is .NET RegEx-engine. .Net engine supports finite state machine supplied with an external stack. see details
Proper Regular expressions would not be able to do it as you would leave the realm of Regular Languages to land in the Context Free Languages territories.
Nevertheless the "regular expression" packages that many languages offer are strictly more powerful.
For example, Lua regular expressions have the "%b()" recognizer that will match balanced parenthesis. In your case you would use "%b{}"
Another sophisticated tool similar to sed is gema, where you will match balanced curly braces very easily with {#}.
So, depending on the tools you have at your disposal your "regular expression" (in a broader sense) may be able to match nested parenthesis.
YES
...assuming that there is some maximum number of nestings you'd be happy to stop at.
Let me explain.
#torsten-marek is right that a regular expression cannot check for nested patterns like this, BUT it is possible to define a nested regex pattern which will allow you to capture nested structures like this up to some maximum depth. I created one to capture EBNF-style comments (try it out here), like:
(* This is a comment (* this is nested inside (* another level! *) hey *) yo *)
The regex (for single-depth comments) is the following:
m{1} = \(+\*+(?:[^*(]|(?:\*+[^)*])|(?:\(+[^*(]))*\*+\)+
This could easily be adapted for your purposes by replacing the \(+\*+ and \*+\)+ with { and } and replacing everything in between with a simple [^{}]:
p{1} = \{(?:[^{}])*\}
(Here's the link to try that out.)
To nest, just allow this pattern within the block itself:
p{2} = \{(?:(?:p{1})|(?:[^{}]))*\}
...or...
p{2} = \{(?:(?:\{(?:[^{}])*\})|(?:[^{}]))*\}
To find triple-nested blocks, use:
p{3} = \{(?:(?:p{2})|(?:[^{}]))*\}
...or...
p{3} = \{(?:(?:\{(?:(?:\{(?:[^{}])*\})|(?:[^{}]))*\})|(?:[^{}]))*\}
A clear pattern has emerged. To find comments nested to a depth of N, simply use the regex:
p{N} = \{(?:(?:p{N-1})|(?:[^{}]))*\}
where N > 1 and
p{1} = \{(?:[^{}])*\}
A script could be written to recursively generate these regexes, but that's beyond the scope of what I need this for. (This is left as an exercise for the reader. 😉)
Using the recursive matching in the PHP regex engine is massively faster than procedural matching of brackets. especially with longer strings.
http://php.net/manual/en/regexp.reference.recursive.php
e.g.
$patt = '!\( (?: (?: (?>[^()]+) | (?R) )* ) \)!x';
preg_match_all( $patt, $str, $m );
vs.
matchBrackets( $str );
function matchBrackets ( $str, $offset = 0 ) {
$matches = array();
list( $opener, $closer ) = array( '(', ')' );
// Return early if there's no match
if ( false === ( $first_offset = strpos( $str, $opener, $offset ) ) ) {
return $matches;
}
// Step through the string one character at a time storing offsets
$paren_score = -1;
$inside_paren = false;
$match_start = 0;
$offsets = array();
for ( $index = $first_offset; $index < strlen( $str ); $index++ ) {
$char = $str[ $index ];
if ( $opener === $char ) {
if ( ! $inside_paren ) {
$paren_score = 1;
$match_start = $index;
}
else {
$paren_score++;
}
$inside_paren = true;
}
elseif ( $closer === $char ) {
$paren_score--;
}
if ( 0 === $paren_score ) {
$inside_paren = false;
$paren_score = -1;
$offsets[] = array( $match_start, $index + 1 );
}
}
while ( $offset = array_shift( $offsets ) ) {
list( $start, $finish ) = $offset;
$match = substr( $str, $start, $finish - $start );
$matches[] = $match;
}
return $matches;
}
as zsolt mentioned, some regex engines support recursion -- of course, these are typically the ones that use a backtracking algorithm so it won't be particularly efficient. example: /(?>[^{}]*){(?>[^{}]*)(?R)*(?>[^{}]*)}/sm
No, you are getting into the realm of Context Free Grammars at that point.
This seems to work: /(\{(?:\{.*\}|[^\{])*\})/m

Positive lookahead using token types in lexer rules

I am migrating a grammar that I initially wrote using GrammarKit (GrammarKit uses Flex as its lexer).
I am struggling finding the best way to write a positive lookahead using token types in lexer rules.
Below my first experiment with a (very) simplified version of my problem using lookahead based on characters in the stream:
grammar PossitiveLookAheadCharacters;
#header {
package lookahead;
}
#lexer::members {
private boolean isChar(int charPosition, char testChar) {
return _input.LA(charPosition) == testChar;
}
}
r : CONS | DOT | LEFT_PAR | RIGHT_PAR;
LEFT_PAR : '(';
RIGHT_PAR : ')';
CONS : DOT {isChar(1, '(')}? {isChar(2, ')')}?;
DOT : '.';
WS : [ \t\r\n]+ -> skip ;
This works ok because the lookahead is just based on character comparison.
If I test this using the test rig, I obtain the following expected output:
> grun lookahead.PossitiveLookAheadCharacters r -tokens
.()
[#0,0:0='.',<CONS>,1:0]
[#1,1:1='(',<'('>,1:1]
[#2,2:2=')',<')'>,1:2]
[#3,4:3='<EOF>',<EOF>,2:0]
But I cannot make this work correctly if I want to write the look ahead based on token types and not in characters from the stream (as I can easily do in Flex). After some trial and error this is the closest I have arrived:
grammar PossitiveLookAheadTokenType;
#header {
package lookahead;
}
#lexer::members {
private boolean isToken(int tokenPosition, int tokenId) {
int tokenAtPosition = new UnbufferedTokenStream(this).LA(tokenPosition);
System.out.println("LA(" + tokenPosition + ") = " + tokenAtPosition);
return tokenAtPosition == tokenId;
}
}
r : CONS | DOT | LEFT_PAR | RIGHT_PAR;
LEFT_PAR : '(';
RIGHT_PAR : ')';
CONS : DOT {isToken(1, LEFT_PAR)}? {isToken(2, RIGHT_PAR)}?;
DOT : '.';
WS : [ \t\r\n]+ -> skip ;
If I test this using the test rig, I see that the tests expressions are evaluated correctly (in short, this predicate is true: LA(1) == LEFT_PAR && LA(2) == RIGHT_PAR). But the first recognized token is not [#0,0:0='.',<CONS>,1:0] as expected, but [#0,2:2=')',<')'>,1:2] instead. Below the full output of my test:
? grun lookahead.PossitiveLookAheadTokenType r -tokens
.()
LA(1) = 1
LA(2) = 2
[#0,2:2=')',<')'>,1:2]
[#1,1:1='(',<'('>,1:1]
[#2,2:2=')',<')'>,1:2]
[#3,4:3='<EOF>',<EOF>,2:0]
I thought that the problem may be that the input stream is not in the right position anymore, so I tried resetting its position, as illustrates this new version of the isToken method:
private boolean isToken(int tokenPosition, int tokenId) {
int streamPosition = _input.index();
int tokenAtPosition = new UnbufferedTokenStream(this).LA(tokenPosition);
_input.seek(streamPosition);
return tokenAtPosition == tokenId;
}
But this did not help.
So my ANTLR4 question is: What is the recommended way to write a positive lookahead in lexer rules using token types and not plain characters ?
In Flex this is fully possible and it is as simple as writing something like:
{MY_MATCH}/{TOKEN_TO_THE_RIGHT}
What I love about the Flex approach here is that it is fully declarative and based on token types, not chars. I am wondering if something similar is possible in ANTLR4.
This cannot work the way you imagine it because what you are trying to do is to use a token (which is the result of a lexer rule) in an ongoing lexer rule. That means the lexer is in the process of determining the current token and therefore cannot determine another token at the same time.
What you probably want is a parser rule. In this scenario the lexer has done all the work already and you can easily look around for other tokens.
cons: DOT {isToken(1, LEFT_PAR) && isToken(2, RIGHT_PAR)}?;
r : cons | DOT | LEFT_PAR | RIGHT_PAR;
#parser::members {
private boolean isToken(int position, int tokenType) {
return _input.LT(position).getType() == tokenType;
}
}

ANTLR4 JAVA -Is it possible to extract fragments from the lexer at the Parser Listener point?

I have a Lexer Rule as follows:
PREFIX : [abcd]'_';
EXTRA : ('xyz' | 'XYZ' );
SUFFIX : [ab];
TCHAN : PREFIX EXTRA? DIGIT+ SUFFIX?;
and a parser rule:
tpin : TCHAN
;
In the exit_tpin() Listiner method, is there a syntax where I can extract the DIGIT component of the token? Right now I can get the ctx.TCHAN() element, but this is a string. I just want the digit portion of TCHAN.
Or should I remove TCHAN as a TOKEN and move that rule to be tpin (i.e)
tpin : PREFIX EXTRA? DIGIT+ SUFFIX?
Where I know how to extract DIGIT from the listener.
My guess is that by the time the TOKEN is presented to the parser it is too late to deconstruct it... but I was wondering if some ANTLR guru's out there knew of a technique.
If I re-write my TOKENIZER, there is a possiblity that TCHAN tokens will be missed for INT/ID tokens (I think thats why I ended up parsing as I do).
I can always do some regexp work in the listener method... but that seemed like bad form ... as I had the individual components earlier. I'm just lazy, and was wondering if a techniqe other than refactoring the parsing grammar was possible.
In The Definitive ANTLR Reference you can find examples of complex lexers where much of the work is done. But when learning ANTLR, I would advise to consider the lexer mostly for its splitting function of the input stream into small tokens. Then do the big work in the parser. In the present case I would do :
grammar Question;
/* extract digit */
question
: tpin EOF
;
tpin
// : PREFIX EXTRA? DIGIT+ SUFFIX?
// {System.out.println("The only useful information is " + $DIGIT.text);}
: PREFIX EXTRA? number SUFFIX?
{System.out.println("The only useful information is " + $number.text);}
;
number
: DIGIT+
;
PREFIX : [abcd]'_';
EXTRA : ('xyz' | 'XYZ' );
DIGIT : [0-9] ;
SUFFIX : [ab];
WS : [ \t\r\n]+ -> skip ;
Say the input is d_xyz123456b. With the first version
: PREFIX EXTRA? DIGIT+ SUFFIX?
you get
$ grun Question question -tokens data.txt
[#0,0:1='d_',<PREFIX>,1:0]
[#1,2:4='xyz',<EXTRA>,1:2]
[#2,5:5='1',<DIGIT>,1:5]
[#3,6:6='2',<DIGIT>,1:6]
[#4,7:7='3',<DIGIT>,1:7]
[#5,8:8='4',<DIGIT>,1:8]
[#6,9:9='5',<DIGIT>,1:9]
[#7,10:10='6',<DIGIT>,1:10]
[#8,11:11='b',<SUFFIX>,1:11]
[#9,13:12='<EOF>',<EOF>,2:0]
The only useful information is 6
Because the parsing of DIGIT+ translates to a loop which reuses DIGIT
setState(12);
_errHandler.sync(this);
_la = _input.LA(1);
do {
{
{
setState(11);
((TpinContext)_localctx).DIGIT = match(DIGIT);
}
}
setState(14);
_errHandler.sync(this);
_la = _input.LA(1);
} while ( _la==DIGIT );
and $DIGIT.text translates to ((TpinContext)_localctx).DIGIT.getText(), only the last digit is retained. That's why I define a subrule number
: PREFIX EXTRA? number SUFFIX?
which makes it easy to capture the value :
[#0,0:1='d_',<PREFIX>,1:0]
[#1,2:4='xyz',<EXTRA>,1:2]
[#2,5:5='1',<DIGIT>,1:5]
[#3,6:6='2',<DIGIT>,1:6]
[#4,7:7='3',<DIGIT>,1:7]
[#5,8:8='4',<DIGIT>,1:8]
[#6,9:9='5',<DIGIT>,1:9]
[#7,10:10='6',<DIGIT>,1:10]
[#8,11:11='b',<SUFFIX>,1:11]
[#9,13:12='<EOF>',<EOF>,2:0]
The only useful information is 123456
You can even make it simpler :
tpin
: PREFIX EXTRA? INT SUFFIX?
{System.out.println("The only useful information is " + $INT.text);}
;
PREFIX : [abcd]'_';
EXTRA : ('xyz' | 'XYZ' );
INT : [0-9]+ ;
SUFFIX : [ab];
WS : [ \t\r\n]+ -> skip ;
$ grun Question question -tokens data.txt
[#0,0:1='d_',<PREFIX>,1:0]
[#1,2:4='xyz',<EXTRA>,1:2]
[#2,5:10='123456',<INT>,1:5]
[#3,11:11='b',<SUFFIX>,1:11]
[#4,13:12='<EOF>',<EOF>,2:0]
The only useful information is 123456
In the listener you have a direct access to these values through the rule context TpinContext :
public static class TpinContext extends ParserRuleContext {
public Token INT;
public TerminalNode PREFIX() { return getToken(QuestionParser.PREFIX, 0); }
public TerminalNode INT() { return getToken(QuestionParser.INT, 0); }
public TerminalNode EXTRA() { return getToken(QuestionParser.EXTRA, 0); }
public TerminalNode SUFFIX() { return getToken(QuestionParser.SUFFIX, 0); }

How to get line number in ANTLR3 tree-parser #init action

In ANTLR, version 3, how can the line number be obtained in the #init action of a high-level tree-parser rule?
For example, in the #init action below, I'd like to push the line number along with the sentence text.
sentence
#init { myNodeVisitor.pushScriptContext( new MyScriptContext( $sentence.text )); }
: assignCommand
| actionCommand;
finally {
m_nodeVisitor.popScriptContext();
}
I need to push the context before the execution of the actions associated with symbols in the rules.
Some things that don't work:
Using $sentence.line -- it's not defined, even though $sentence.text is.
Moving the paraphrase push into the rule actions. Placed before the rule, no token in the rule is available. Placed after the rule, the action happens after actions associated with the rule symbols.
Using this expression in the #init action, which compiles but returns the value 0: getTreeNodeStream().getTreeAdaptor().getToken( $sentence.start ).getLine(). EDIT: Actually, this does work, if $sentence.start is either a real token or an imaginary with a reference -- see Bart Kiers answer below.
It seems like if I can easily get, in the #init rule, the matched text and the first matched token, there should be an easy way to get the line number as well.
You can look 1 step ahead in the token/tree-stream of a tree grammar using the following: CommonTree ahead = (CommonTree)input.LT(1), which you can place in the #init section.
Every CommonTree (the default Tree implementation in ANTLR) has a getToken() method which return the Token associated with this tree. And each Token has a getLine() method, which, not surprisingly, returns the line number of this token.
So, if you do the following:
sentence
#init {
CommonTree ahead = (CommonTree)input.LT(1);
int line = ahead.getToken().getLine();
System.out.println("line=" + line);
}
: assignCommand
| actionCommand
;
you should be able to see some correct line numbers being printed. I say some, because this won't go as planned in all cases. Let me demonstrate using a simple example grammar:
grammar ASTDemo;
options {
output=AST;
}
tokens {
ROOT;
ACTION;
}
parse
: sentence+ EOF -> ^(ROOT sentence+)
;
sentence
: assignCommand
| actionCommand
;
assignCommand
: ID ASSIGN NUMBER -> ^(ASSIGN ID NUMBER)
;
actionCommand
: action ID -> ^(ACTION action ID)
;
action
: START
| STOP
;
ASSIGN : '=';
START : 'start';
STOP : 'stop';
ID : ('a'..'z' | 'A'..'Z')+;
NUMBER : '0'..'9'+;
SPACE : (' ' | '\t' | '\r' | '\n')+ {skip();};
whose tree grammar looks like:
tree grammar ASTDemoWalker;
options {
output=AST;
tokenVocab=ASTDemo;
ASTLabelType=CommonTree;
}
walk
: ^(ROOT sentence+)
;
sentence
#init {
CommonTree ahead = (CommonTree)input.LT(1);
int line = ahead.getToken().getLine();
System.out.println("line=" + line);
}
: assignCommand
| actionCommand
;
assignCommand
: ^(ASSIGN ID NUMBER)
;
actionCommand
: ^(ACTION action ID)
;
action
: START
| STOP
;
And if you run the following test class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
public class Main {
public static void main(String[] args) throws Exception {
String src = "\n\n\nABC = 123\n\nstart ABC";
ASTDemoLexer lexer = new ASTDemoLexer(new ANTLRStringStream(src));
ASTDemoParser parser = new ASTDemoParser(new CommonTokenStream(lexer));
CommonTree root = (CommonTree)parser.parse().getTree();
ASTDemoWalker walker = new ASTDemoWalker(new CommonTreeNodeStream(root));
walker.walk();
}
}
you will see the following being printed:
line=4
line=0
As you can see, "ABC = 123" produced the expected output (line 4), but "start ABC" didn't (line 0). This is because the root of the action rule is a ACTION token and this token is never defined in the lexer, only in the tokens{...} block. And because it doesn't really exist in the input, by default the line 0 is attached to it. If you want to change the line number, you need to provide a "reference" token as a parameter to this so called imaginary ACTION token which it uses to copy attributes into itself.
So, if you change the actionCommand rule in the combined grammar into:
actionCommand
: ref=action ID -> ^(ACTION[$ref.start] action ID)
;
the line number would be as expected (line 6).
Note that every parser rule has a start and end attribute which is a reference to the first and last token, respectively. If action was a lexer rule (say FOO), then you could have omitted the .start from it:
actionCommand
: ref=FOO ID -> ^(ACTION[$ref] action ID)
;
Now the ACTION token has copied all attributes from whatever $ref is pointing to, except the type of the token, which is of course int ACTION. But this also means that it copied the text attribute, so in my example, the AST created by ref=action ID -> ^(ACTION[$ref.start] action ID) could look like:
[text=START,type=ACTION]
/ \
/ \
/ \
[text=START,type=START] [text=ABC,type=ID]
Of course, it's a proper AST because the types of the nodes are unique, but it makes debugging confusing since ACTION and START share the same .text attribute.
You can copy all attributes to an imaginary token except the .text and .type by providing a second string parameter, like this:
actionCommand
: ref=action ID -> ^(ACTION[$ref.start, "Action"] action ID)
;
And if you now run the same test class again, you will see the following printed:
line=4
line=6
And if you inspect the tree that is generated, it'll look like this:
[type=ROOT, text='ROOT']
[type=ASSIGN, text='=']
[type=ID, text='ABC']
[type=NUMBER, text='123']
[type=ACTION, text='Action']
[type=START, text='start']
[type=ID, text='ABC']

Java regex: Repeating capturing groups

An item is a comma delimited list of one or more strings of numbers or characters e.g.
"12"
"abc"
"12,abc,3"
I'm trying to match a bracketed list of zero or more items in Java e.g.
""
"(12)"
"(abc,12)"
"(abc,12),(30,asdf)"
"(qqq,pp),(abc,12),(30,asdf,2),"
which should return the following matching groups respectively for the last example
qqq,pp
abc,12
30,asdf,2
I've come up with the following (incorrect)pattern
\((.+?)\)(?:,\((.+?)\))*
which matches only the following for the last example
qqq,pp
30,asdf,2
Tips? Thanks
That's right. You can't have a "variable" number of capturing groups in a Java regular expression. Your Pattern has two groups:
\((.+?)\)(?:,\((.+?)\))*
|___| |___|
group 1 group 2
Each group will contain the content of the last match for that group. I.e., abc,12 will get overridden by 30,asdf,2.
Related question:
Regular expression with variable number of groups?
The solution is to use one expression (something like \((.+?)\)) and use matcher.find to iterate over the matches.
You can use regular expression like ([^,]+) in loop or just str.split(",") to get all elements at once. This version: str.split("\\s*,\\s*") even allows spaces.
(^|\s+)(\S*)(($|\s+)\2)+ with ignore case option /i
She left LEft leFT now
example here - https://regex101.com/r/FEmXui/2
Match 1
Full match 3-23 ` left LEft leFT LEFT`
Group 1. 3-4 ` `
Group 2. 4-8 `left`
Group 3. 18-23 ` LEFT`
Group 4. 18-19 ` `
Using an ANTLR grammar can solve this problem. This is really beyond the reasonable capabilities of RegExp, although I believe some newer versions of Microsoft's implementation in .Net support this behavior. See this other SO question. If you're stuck with everything but .Net your best option is going to be a parser-generator (you don't have to use ANTLR, that's just my personal preference). Going through the ANTLR4 GitHub page can help get someone started on matching on more complex expressions with things like repeating match groups. Another option that doesn't require a whole lot of new learning is to tokenize the string input that you're wanting to match on and pull out the pieces that you want, but this can prove to be extremely messy and create nightmarish chunks of parsing code that are better-suited to a generated parser.
This may be the solution :
package com.drl.fw.sch;
import java.util.regex.Pattern;
public class AngularJSMatcher extends SimpleStringMatcher {
Matcher delegate;
public AngularJSMatcher(String lookFor){
super(lookFor);
// ng-repeat
int ind = lookFor.indexOf('-');
if(ind >= 0 ){
StringBuilder sb = new StringBuilder();
boolean first = true;
for (String s : lookFor.split("-")){
if(first){
sb.append(s);
first = false;
}else{
if(s.length() >1){
sb.append(s.substring(0,1).toUpperCase());
sb.append(s.substring(1));
}else{
sb.append(s.toUpperCase());
}
}
}
delegate = new SimpleStringMatcher(sb.toString());
}else {
String words[] = lookFor.split("(?<!(^|[A-Z]))(?=[A-Z])|(?<!^)(?=[A-Z][a-z])");
if(words.length > 1 ){
StringBuilder sb = new StringBuilder();
for (int i=0;i < words.length;i++) {
sb.append(words[i].toLowerCase());
if(i < words.length-1) sb.append("-");
}
delegate = new SimpleStringMatcher(sb.toString());
}
}
}
#Override
public boolean match(String in) {
if(super.match(in)) return true;
if(delegate != null && delegate.match(in)) return true;
return false;
}
public static void main(String[] args){
String lookfor="ngRepeatStart";
Matcher matcher = new AngularJSMatcher(lookfor);
System.out.println(matcher.match( "<header ng-repeat-start=\"item in items\">"));
System.out.println(matcher.match( "var ngRepeatStart=\"item in items\">"));
}
}

Categories

Resources