how to write ANTLR grammar for parsing plain text file

how to write ANTLR grammar for parsing plain text file - java

am very new to this ANTLR tool, Need help on writing grammar rules in ANTRL, for converting/parsing plaint text to equivalent .xml file, using java.
please any one help me on this.
i tried as below as per my understanding, and it works for single line(parser) not for full configList(parser)
ANTLR grammar rules below is my grammar .g4
grammar MyTest;
acl : 'acl number' INT configList ('#' configList)* ;
configList : config ('\n' config)*;
config : line ('\n' line)* ;
line : line WORD INT (WORD)+ ((SOURCE_LOW_IP)* |(WORD)* |(SOURCE_LOW_IP)*)+
|WORD INT (WORD)+
;
fragment
DIGIT : ('0'..'9');
INT : [0-9]+ ; // Define token INT as one or more digits
//WORD : [A-Za-z][A-Za-z_\-]* ;
WORD : [A-Za-z][A-Za-z_\-]* ;
NEWLINE:'\r'? '\n' ; // return newlines to parser (is end-statement signal)
WS : [ \t\r\n]+ -> skip ; // toss out whitespace
SOURCE_LOW_IP : INT '.' INT '.' INT '.' INT ; // match IPs in parser
sample input of config list:
acl number 3001
rule 0 permit ip source any
rule 1 permit ip source 172.16.10.1
#
rule 2 permit ip source 172.16.10.2 0.0.0.255
rule 3 deny destination any
rule 4 deny destination 172.16.10.4
rule 5 deny destination 172.16.10.5 0.0.0.255
#
rule 6 permit ip source any destination 172.16.10.6 0.0.0.255
rule 7 permit ip source 172.16.10.7 0.0.0.255 destination 172.16.11.7
#
expected for output format as below( this will be taken care using java once antlr generates .java and other files)
<filterRuleLists>
<filterRuleList id='3001'>
<filterRule action='ALLOW' protocol='ANY'>
<sourceIPRange low='0.0.0.0' high='255.255.255.255' />
<destinationIPRange low='0.0.0.0' high='255.255.255.255' />
<fileLine file='config' startLine='4' stopLine='4' />
</filterRule>
<filterRule action='ALLOW' protocol='ANY'>
<sourceIPRange low='172.16.10.1' high='172.16.10.1' />
<destinationIPRange low='0.0.0.0' high='255.255.255.255' />
<fileLine file='config' startLine='5' stopLine='5' />
</filterRule>
</filterRuleList>
</filterRuleLists>

I am familiar with parser generators, but not ANTLR4 specifically, so this is a best-guess: I strongly suspect the grammar rules
configList : config ('\n' config)*;
config : line ('\n' line)* ;
should be rewritten as
configList : config (NEWLINE config)*;
config : line (NEWLINE line)* ;
as the fragment rule
NEWLINE:'\r'? '\n' ; // return newlines to parser (is end-statement signal)
will cause any '\n' characters to be processed into NEWLINE tokens.

Related

ANTLR4 idle org.antlr.v4.gui.TestRig doesn’t output result

i compiled the :
grammar Hello; // Define a grammar called Hello
r : 'hello' ID ; // match keyword hello followed by an identifier
ID : [a-z]+ ; // match lower-case identifiers
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines, \r (Windows)r
the command to generate the class files (notice im creating it with -package hellogrammer ) :
java -jar antlr-4.9.2-complete.jar -package hellogrammer -o c:\Dev\my\java\ANTLR\test_project\core\src\main\java\hellogrammer c:\Dev\my\java\ANTLR\test_project\core\src\main\java\Hello.g4
and it creates the files just fine , then i compile the files and it looks like this :
c:\Dev\my\java\ANTLR\test_project\core\target\classes\hellogrammer>ls -1
HelloBaseListener.class
HelloLexer.class
HelloListener.class
'HelloParser$RContext.class'
HelloParser.class
now when I try to execute the TestRig command im getting no response from the command line :
c:\Dev\my\java\ANTLR\test_project\core\target\classes>java -cp ".;C:/Dev/my/java/ANTLR/antlr-4.9.2-complete.jar" org.antlr.v4.gui.TestRig hellogrammer.Hello -tree
it just stacks with no error or any response ...

The TestRig requires two separate parameters, first the grammar name and the the start rule name. This TestRig then begins parsing input from the input stream, so, you can either type input (with a Ctrl-D for signal EOF), or you can redirect your input to stdin with <
Try:
c:\Dev\my\java\ANTLR\test_project\core\target\classes>java -cp ".;C:/Dev/my/java/ANTLR/antlr-4.9.2-complete.jar" org.antlr.v4.gui.TestRig hellogrammar.Hello r -tree < “your input file”

ANTLR4 Token recognition at whitespace

I am new to working with ANTLR parser.
Here is my grammar:
grammar Commands;
file_ : expression EOF;
expression : Command WhiteSpace Shape ;
WhiteSpace : [\t]+ -> skip;
NewLine : ('\r'?'\n'|'\r') -> skip;
Shape : ('square'|'triangle'|'circle'|'hexagon'|'line');
Command : ('fill'|'draw'|'delete');
I am trying to parse a list of sentences such as:
draw circle;
draw triangle;
delete circle;
I'm getting
token recognition error at:' '
Can anyone tell me what is the problem?
PS: I'm working in java 15
UPDATE
file_ : expressions EOF;
expressions
: expressions expression
| expression
;
expression : Command WhiteSpace Shape NewLine ;
WhiteSpace : [\t]+ -> skip;
NewLine : ('\r'?'\n'|'\r') -> skip;
Shape : ('square'|'triangle'|'circle'|'hexagon'|'line');
Command : ('fill'|'draw'|'delete');
Added support for multiple expressions.
I'm getting the same error.
UPDATE
grammar Commands;
file_ : expressions EOF;
expressions
: expressions expression
| expression
;
expression : Command Shape;
WhiteSpace : [\t]+ -> skip;
NewLine : ('\r'?'\n'|'\r') -> skip;
Shape : ('square'|'triangle'|'circle'|'hexagon'|'line');
Command : ('fill'|'draw'|'delete');
Even if I don't include WhiteSpace, I get the same token recognition error.

OK, the errors:
line 3:6 token recognition error at: ' '
line 3:13 token recognition error at: ';'
mean that the lexer encountered a white space char (or semi colon), but there is no lexer rule that matches any of these characters. You must include them in your grammar. Let's say you add them like this (note: still incorrect!):
Semi : ';';
WhiteSpace : [ \t]+ -> skip;
When trying with the rules above, you'd get the error:
line 1:5 missing WhiteSpace at 'circle'
This means the parser cannot match the rule expression : Command WhiteSpace Shape ; to the input draw circle;. This is because inside the lexer, you're skipping all white space characters. This means these tokens will not be available inside a parser rule. Remove them from your parser.
You'll also see the error:
line 1:11 mismatched input ';' expecting <EOF>
which means the input contains a Semi token, and the parser did not expect that. Include the Semi token in your expression rule:
grammar Commands;
file_ : expression EOF;
expression : Command Shape Semi;
Semi : ';';
WhiteSpace : [ \t]+ -> skip;
NewLine : ('\r'?'\n'|'\r') -> skip;
Shape : ('square'|'triangle'|'circle'|'hexagon'|'line');
Command : ('fill'|'draw'|'delete');
The grammar above will work for single expressions. If you want to match multiple expressions, you could do:
expressions
: expressions expression
| expression
;
but given that ANTLR generates LL parsers (not LR as the name ANTLR suggests), it is easier (and makes the parse tree easier to traverse later on) to do this:
expressions
: expression+
;
If you're going to skip all white space chars, you might as well remove the NewLine rule and do this:
WhiteSpace : [ \t\r\n]+ -> skip;
One more thing, the lexer now creates Shape and Command tokens which all have the same type. I'd do something like this instead:
shape : Square | Triangle | ...;
Square : 'square';
Triangle : 'triangle';
...
which will make your life easier while traversing the parse tree when you want to evaluate the input (if that is what you're going to do).
I'd go for something like this:
grammar Commands;
file_ : expressions EOF;
expressions : expression+;
expression : command shape Semi;
shape : Square | Traingle | Circle | Hexagon | Line;
command : Fill | Draw | Delete;
Semi : ';';
WhiteSpace : [ \t\r\n]+ -> skip;
Square : 'square';
Traingle : 'triangle';
Circle : 'circle';
Hexagon : 'hexagon';
Line : 'line';
Fill : 'fill';
Draw : 'draw';
Delete : 'delete';

Your whitespace token rule WhiteSpace only allows for tabs. add a space to it.
WhiteSpace : [ \t]+ -> skip;
(usually, there's more to a whitespace rule than that, but it should solve your immediate problem.
You also haven't accounted for the ';' in your input. Either add it to a rule, or remove from your test input temporarily.
expression : Command Shape ';' ;
This would fix it, but seems like it might not be what you really need.

antlr4 rule not ignoring standalone open bracket

The situation:
rule : block+ ;
block : '[' String ']' ;
String : ([a-z] | '[' | '\\]')+ ;
Trick is String can contain [ without backslash escape and ] with backslasash escape, so in this example:
[hello\]world][hello[[world]
First block can be parsed correctly, but the second one... parser is trying find ] for every [. Is there way to say antlr parser to ignore this standalone [? I can't change format, but i need to find some workaround with antlr.
PS: Without antlr there is algorythm to avoid this, something like: collect [ in queue before we will find first ] and use only head of queue. But I really need antlr =_=

You can use Lexer modes.
Lexical modes allow us to split a single lexer grammar into multiple
sublexers. The lexer can only return tokens matched by rules from the
current mode.
You can read more about lexer rules in antlr documentation here.
First you will need to divide you grammar into separate lexer and parser. Than just use another mode after you see open bracket.
Parser grammar:
parser grammar TestParser;
options { tokenVocab=TestLexer; }
rul : block+ ;
block : LBR STRING RBR ;
Lexer grammar:
lexer grammar TestLexer;
LBR: '[' -> pushMode(InString);
mode InString;
STRING : ([a-z] | '\\]' | '[')+ ;
RBR: ']' -> popMode;
Working example is here.
You can read the documentation on lexer modes

Youtube complete Java Regex

I need to parse several pages to get all of their Youtube IDs.
I found many regular expressions on the web, but : the Java ones are not complete (they either give me garbage in addition to the IDs, or they miss some IDs).
The one that I found that seems to be complete is hosted here. But it is written in JavaScript and PHP. Unfortunately I couldn't translate them into JAVA.
Can somebody help me rewrite this PHP regex or the following JavaScript one in Java?
'~
https?:// # Required scheme. Either http or https.
(?:[0-9A-Z-]+\.)? # Optional subdomain.
(?: # Group host alternatives.
youtu\.be/ # Either youtu.be,
| youtube\.com # or youtube.com followed by
\S* # Allow anything up to VIDEO_ID,
[^\w\-\s] # but char before ID is non-ID char.
) # End host alternatives.
([\w\-]{11}) # $1: VIDEO_ID is exactly 11 chars.
(?=[^\w\-]|$) # Assert next char is non-ID or EOS.
(?! # Assert URL is not pre-linked.
[?=&+%\w]* # Allow URL (query) remainder.
(?: # Group pre-linked alternatives.
[\'"][^<>]*> # Either inside a start tag,
| </a> # or inside <a> element text contents.
) # End recognized pre-linked alts.
) # End negative lookahead assertion.
[?=&+%\w]* # Consume any URL (query) remainder.
~ix'
/https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube\.com\S*[^\w\-\s])([\w\-]{11})(?=[^\w\-]|$)(?![?=&+%\w]*(?:['"][^<>]*>|<\/a>))[?=&+%\w]*/ig;

First of all you need to insert and extra backslash \ foreach backslash in the old regex, else java thinks you escapes some other special characters in the string, which you are not doing.
https?:\\/\\/(?:[0-9A-Z-]+\\.)?(?:youtu\\.be\\/|youtube\\.com\\S*[^\\w\\-\\s])([\\w\\-]{11})(?=[^\\w\\-]|$)(?![?=&+%\\w]*(?:['\"][^<>]*>|<\\/a>))[?=&+%\\w]*
Next when you compile your pattern you need to add the CASE_INSENSITIVE flag. Here's an example:
String pattern = "https?:\\/\\/(?:[0-9A-Z-]+\\.)?(?:youtu\\.be\\/|youtube\\.com\\S*[^\\w\\-\\s])([\\w\\-]{11})(?=[^\\w\\-]|$)(?![?=&+%\\w]*(?:['\"][^<>]*>|<\\/a>))[?=&+%\\w]*";
Pattern compiledPattern = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);
Matcher matcher = compiledPattern.matcher(link);
while(matcher.find()) {
System.out.println(matcher.group());
}

Marcus above has a good regex, but i found that it doesn't recognize youtube links that have "www" but not "http(s)" in them
for example www.youtube....
i have an update:
^(?:https?:\\/\\/)?(?:[0-9A-Z-]+\\.)?(?:youtu\\.be\\/|youtube\\.com\\S*[^\\w\\-\\s])([\\w\\-]{11})(?=[^\\w\\-]|$)(?![?=&+%\\w]*(?:['\"][^<>]*>|<\\/a>))[?=&+%\\w]*
it's the same except for the start

Need regex to format file in php

I have a java file that I want to post online. I am using php to format the file.
Does anyone know the regex to turn the comments blue?
INPUT:
/*****
*This is the part
*I want to turn blue
*for my class
*******************/
class MyClass{
String s;
}
Thanks.

Naiive version:
$formatted = preg_replace('|(/\*.*?\*/)|m', '<span class="blue">$1</span>', $java_code_here);
... not tested, YMMV, etc...

In general, you won't be able to parse specific parts of a Java file using only regular expressions - Java is not a regular language. If your file has additional structure (such as "it always begins with a comment followed by a newline, followed by a class definition"), you can generate a regular expression for such a case. For instance, you'd match /\*+(.*?)\*+/$, where . is assumed to match multiple lines, and $ matches the end of a line.
In general, to make a regex work, you first define what patterns you want to find (rigorously, but in spoken language), and then translate that to standard regular expression notation.
Good luck.

A regex that can parse simple quotes should be able to find comments in C/C++ style languages.
I assume Java is of that type.
This is a Perl faq sample by someone else, although I added the part about // style comments (with or without line continuation) and reformated.
It basically does a global search and replace. Data is replaced verbatim if non a comment, otherwise replace the comment with your color formatting tags.
You should be able to adapt this to php, and it is expanded for clarity (maybe too much clarity though).
s{
## Comments, group 1:
(
/\* ## Start of /* ... */ comment
[^*]*\*+ ## Non-* followed by 1-or-more *'s
(?:
[^/*][^*]*\*+
)* ## 0-or-more things which don't start with /
## but do end with '*'
/ ## End of /* ... */ comment
|
// ## Start of // ... comment
(?:
[^\\] ## Any Non-Continuation character ^\
| ## OR
\\\n? ## Any Continuation character followed by 0-1 newline \n
)*? ## To be done 0-many times, stopping at the first end of comment
\n ## End of // comment
)
| ## OR, various things which aren't comments, group 2:
(
" (?: \\. | [^"\\] )* " ## Double quoted text
|
' (?: \\. | [^'\\] )* ' ## Single quoted text
|
. ## Any other char
[^/"'\\]* ## Chars which doesn't start a comment, string, escape
) ## or continuation (escape + newline)
}
{defined $2 ? $2 : "<some color>$1</some color>"}gxse;

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

how to write ANTLR grammar for parsing plain text file - java

Related

ANTLR4 idle org.antlr.v4.gui.TestRig doesn’t output result

ANTLR4 Token recognition at whitespace

antlr4 rule not ignoring standalone open bracket

Youtube complete Java Regex

Need regex to format file in php

Categories

Resources