I am parsing an InputStream for certain patterns to extract values from it, e.g. I would have something like
<span class="filename">foo
I don't want to use a full fledged html parser as I am not interested in the document structure but only in some well defined bits of information. (Only their order is important)
Currently I am using a very simple approach, I have an Object for each Pattern that contains a char[] of the opening and closing 'tag' (in the example opening would be <span class="filename"><a href="and closing " to get the url) and a position marker. For each character read by of the InputStream, I iterate over all Patterns and call the match(char) function that returns true once the opening pattern does match, from then on I collect the following chars in a StringBuilder until the now active pattern does match() again. I then call a function with the ID of the Pattern and the String read, to process it further.
While this works fine in most cases, I wanted to include regular expressions in the pattern, so I could also match something like
<span class="filename" id="234217">foo
At this point I was sure I would reinvent the wheel as this most certainly would have been done before, and I don't really want to write my own regex parser to begin with. However, I could not find anything that would do what I was looking for.
Unfortunately the Scanner class only matches one pattern, not a list of patterns, what alternatives could I use? It should not be heavy and work with Android.
You mean you want to match any <span> element with a given class attribute, irrespective of other attributes it may have? That's easy enough:
Scanner sc = new Scanner(new File("test.txt"), "UTF-8");
Pattern p = Pattern.compile(
"<span[^>]*class=\"filename\"[^>]*>\\s*<a[^>]*href=\"([^\"]+)\""
);
while (sc.findWithinHorizon(p, 0) != null)
{
MatchResult m = sc.match();
System.out.println(m.group(1));
}
The file "test.txt" contains the text of your question, and the output is:
http://example.com/foo
and closing
http://example.com/foo
the Scanner.useDelimiter(Pattern) API seems to be what you're looking for. You would have to use an OR (|) separated pattern string.
This pattern can get really complicated really quickly though.
You are right to think this has all been done before :) What you are talking about is a problem of tokenizing and parsing and I therefore suggest you consider JavaCC.
There is something of a learning curve with JavaCC as you learn to understand it's grammar, so below is an implementation to get you started.
The grammar is a chopped down version of the standard JavaCC grammar for HTML. You can add more productions for matching other patterns.
options {
JDK_VERSION = "1.5";
static = false;
}
PARSER_BEGIN(eg1)
import java.util.*;
public class eg1 {
private String currentTag;
private String currentSpanClass;
private String currentHref;
public static void main(String args []) throws ParseException {
System.out.println("Starting parse");
eg1 parser = new eg1(System.in);
parser.parse();
System.out.println("Finishing parse");
}
}
PARSER_END(eg1)
SKIP :
{
< ( " " | "\t" | "\n" | "\r" )+ >
| < "<!" ( ~[">"] )* ">" >
}
TOKEN :
{
<STAGO: "<" > : TAG
| <ETAGO: "</" > : TAG
| <PCDATA: ( ~["<"] )+ >
}
<TAG> TOKEN [IGNORE_CASE] :
{
<A: "a" > : ATTLIST
| <SPAN: "span" > : ATTLIST
| <DONT_CARE: (["a"-"z"] | ["0"-"9"])+ > : ATTLIST
}
<ATTLIST> SKIP :
{
< " " | "\t" | "\n" | "\r" >
| < "--" > : ATTCOMM
}
<ATTLIST> TOKEN :
{
<TAGC: ">" > : DEFAULT
| <A_EQ: "=" > : ATTRVAL
| <#ALPHA: ["a"-"z","A"-"Z","_","-","."] >
| <#NUM: ["0"-"9"] >
| <#ALPHANUM: <ALPHA> | <NUM> >
| <A_NAME: <ALPHA> ( <ALPHANUM> )* >
}
<ATTRVAL> TOKEN :
{
<CDATA: "'" ( ~["'"] )* "'"
| "\"" ( ~["\""] )* "\""
| ( ~[">", "\"", "'", " ", "\t", "\n", "\r"] )+
> : ATTLIST
}
<ATTCOMM> SKIP :
{
< ( ~["-"] )+ >
| < "-" ( ~["-"] )+ >
| < "--" > : ATTLIST
}
void attribute(Map<String,String> attrs) :
{
Token n, v = null;
}
{
n=<A_NAME> [ <A_EQ> v=<CDATA> ]
{
String attval;
if (v == null) {
attval = "#DEFAULT";
} else {
attval = v.image;
if( attval.startsWith("\"") && attval.endsWith("\"") ) {
attval = attval.substring(1,attval.length()-1);
} else if( attval.startsWith("'") && attval.endsWith("'") ) {
attval = attval.substring(1,attval.length()-1);
}
}
if( attrs!=null ) attrs.put(n.image.toLowerCase(),attval);
}
}
void attList(Map<String,String> attrs) : {}
{
( attribute(attrs) )+
}
void tagAStart() : {
Map<String,String> attrs = new HashMap<String,String>();
}
{
<STAGO> <A> [ attList(attrs) ] <TAGC>
{
currentHref=attrs.get("href");
if( currentHref != null && "filename".equals(currentSpanClass) )
{
System.out.println("Found URL: "+currentHref);
}
}
}
void tagAEnd() : {}
{
<ETAGO> <A> <TAGC>
{
currentHref=null;
}
}
void tagSpanStart() : {
Map<String,String> attrs = new HashMap<String,String>();
}
{
<STAGO> <SPAN> [ attList(attrs) ] <TAGC>
{
currentSpanClass=attrs.get("class");
}
}
void tagSpanEnd() : {}
{
<ETAGO> <SPAN> <TAGC>
{
currentSpanClass=null;
}
}
void tagDontCareStart() : {}
{
<STAGO> <DONT_CARE> [ attList(null) ] <TAGC>
}
void tagDontCareEnd() : {}
{
<ETAGO> <DONT_CARE> <TAGC>
}
void parse() : {}
{
(
LOOKAHEAD(2) tagAStart() |
LOOKAHEAD(2) tagAEnd() |
LOOKAHEAD(2) tagSpanStart() |
LOOKAHEAD(2) tagSpanEnd() |
LOOKAHEAD(2) tagDontCareStart() |
LOOKAHEAD(2) tagDontCareEnd() |
<PCDATA>
)*
}
Related
I am new to JavaCC, and have read multiple lookahead tutorials. However when testing lookahead on a simple grammar file have left me puzzled. In this grammar file I just made two parsing rules, 1->double, 2->integers.
The program is supposed to choose on of them, if the input suits the context.
options
{
STATIC =false;
debug_parser = true;
debug_lookahead = true;
}
PARSER_BEGIN(testin)
public class testin
{
public void parse() throws ParseException
{
testin b = new testin(System.in);
b.go();
}
}
PARSER_END(testin)
TOKEN:
{
//testing lookahead
<NUMBER : (["0"-"9"])+>|
<DOT : ".">
}
void go()
:
{}
{
LOOKAHEAD(2) doub() | number()
}
void doub()
:
{
Token bool;
}
{
bool = <NUMBER><DOT><NUMBER>
{
System.out.println(Double.parseDouble(bool.image));
}
}
void number()
:
{
Token mo;
}
{
mo = <NUMBER>
{
System.out.println(Integer.parseInt(mo.image));
}
}
After testing with this code when typing with decimals the input, it works, but when typing with an integer it doesn't and it doesn't output anything. Here is the debug output:
Call: go
Call: doub(LOOKING AHEAD...)
75
Visited token: <<NUMBER>: "75" at line 1 column 1>; Expected token:<<NUMBER>>
Try to specify the whitespace characters:
SKIP : {
< " " | "\t" | "\r" | "\n" >
}
Also, try to specify where the <EOF> is expected:
void go()
:
{}
{
(LOOKAHEAD(2) doub() | number())
<EOF>
}
I'm implementing a python interpreter using ANTLR4 like lexer and parser generator. I used the BNF defined at this link:
https://github.com/antlr/grammars-v4/blob/master/python3/Python3.g4.
However the implementation of indentation with the INDENT and DEDENT tokens within the lexer::members do not work when i define a compound statement.
For example if i define the following statement:
x=10
while x>2 :
print("hello")
x=x-3
So in the line when i reassign the value of x variable i should have an indentation error that i don't have in my currest state.
Should i edit something into the lexer code or what?
This is the BNF that i'm using with the lexer::members and the NEWLINE rules defined in the above link.
grammar python;
tokens { INDENT, DEDENT }
#lexer::members {
// A queue where extra tokens are pushed on (see the NEWLINE lexer rule).
private java.util.LinkedList<Token> tokens = new java.util.LinkedList<>();
// The stack that keeps track of the indentation level.
private java.util.Stack<Integer> indents = new java.util.Stack<>();
// The amount of opened braces, brackets and parenthesis.
private int opened = 0;
// The most recently produced token.
private Token lastToken = null;
#Override
public void emit(Token t) {
super.setToken(t);
tokens.offer(t);
}
#Override
public Token nextToken() {
// Check if the end-of-file is ahead and there are still some DEDENTS expected.
if (_input.LA(1) == EOF && !this.indents.isEmpty()) {
// Remove any trailing EOF tokens from our buffer.
for (int i = tokens.size() - 1; i >= 0; i--) {
if (tokens.get(i).getType() == EOF) {
tokens.remove(i);
}
}
// First emit an extra line break that serves as the end of the statement.
this.emit(commonToken(pythonParser.NEWLINE, "\n"));
// Now emit as much DEDENT tokens as needed.
while (!indents.isEmpty()) {
this.emit(createDedent());
indents.pop();
}
// Put the EOF back on the token stream.
this.emit(commonToken(pythonParser.EOF, "<EOF>"));
//throw new Exception("indentazione inaspettata in riga "+this.getLine());
}
Token next = super.nextToken();
if (next.getChannel() == Token.DEFAULT_CHANNEL) {
// Keep track of the last token on the default channel.
this.lastToken = next;
}
return tokens.isEmpty() ? next : tokens.poll();
}
private Token createDedent() {
CommonToken dedent = commonToken(pythonParser.DEDENT, "");
dedent.setLine(this.lastToken.getLine());
return dedent;
}
private CommonToken commonToken(int type, String text) {
int stop = this.getCharIndex() - 1;
int start = text.isEmpty() ? stop : stop - text.length() + 1;
return new CommonToken(this._tokenFactorySourcePair, type, DEFAULT_TOKEN_CHANNEL, start, stop);
}
// Calculates the indentation of the provided spaces, taking the
// following rules into account:
//
// "Tabs are replaced (from left to right) by one to eight spaces
// such that the total number of characters up to and including
// the replacement is a multiple of eight [...]"
//
// -- https://docs.python.org/3.1/reference/lexical_analysis.html#indentation
static int getIndentationCount(String spaces) {
int count = 0;
for (char ch : spaces.toCharArray()) {
switch (ch) {
case '\t':
count += 8 - (count % 8);
break;
default:
// A normal space char.
count++;
}
}
return count;
}
boolean atStartOfInput() {
return super.getCharPositionInLine() == 0 && super.getLine() == 1;
}
}
parse
:( NEWLINE parse
| block ) EOF
;
block
: (statement NEWLINE?| functionDecl)*
;
statement
: assignment
| functionCall
| ifStatement
| forStatement
| whileStatement
| arithmetic_expression
;
assignment
: IDENTIFIER indexes? '=' expression
;
functionCall
: IDENTIFIER OPAREN exprList? CPAREN #identifierFunctionCall
| PRINT OPAREN? exprList? CPAREN? #printFunctionCall
;
arithmetic_expression
: expression
;
ifStatement
: ifStat elifStat* elseStat?
;
ifStat
: IF expression COLON NEWLINE INDENT block DEDENT
;
elifStat
: ELIF expression COLON NEWLINE INDENT block DEDENT
;
elseStat
: ELSE COLON NEWLINE INDENT block DEDENT
;
functionDecl
: DEF IDENTIFIER OPAREN idList? CPAREN COLON NEWLINE INDENT block DEDENT
;
forStatement
: FOR IDENTIFIER IN expression COLON NEWLINE INDENT block DEDENT elseStat?
;
whileStatement
: WHILE expression COLON NEWLINE INDENT block DEDENT elseStat?
;
idList
: IDENTIFIER (',' IDENTIFIER)*
;
exprList
: expression (COMMA expression)*
;
expression
: '-' expression #unaryMinusExpression
| '!' expression #notExpression
| expression '**' expression #powerExpression
| expression '*' expression #multiplyExpression
| expression '/' expression #divideExpression
| expression '%' expression #modulusExpression
| expression '+' expression #addExpression
| expression '-' expression #subtractExpression
| expression '>=' expression #gtEqExpression
| expression '<=' expression #ltEqExpression
| expression '>' expression #gtExpression
| expression '<' expression #ltExpression
| expression '==' expression #eqExpression
| expression '!=' expression #notEqExpression
| expression '&&' expression #andExpression
| expression '||' expression #orExpression
| expression '?' expression ':' expression #ternaryExpression
| expression IN expression #inExpression
| NUMBER #numberExpression
| BOOL #boolExpression
| NULL #nullExpression
| functionCall indexes? #functionCallExpression
| list indexes? #listExpression
| IDENTIFIER indexes? #identifierExpression
| STRING indexes? #stringExpression
| '(' expression ')' indexes? #expressionExpression
| INPUT '(' STRING? ')' #inputExpression
;
list
: '[' exprList? ']'
;
indexes
: ('[' expression ']')+
;
PRINT : 'print';
INPUT : 'input';
DEF : 'def';
IF : 'if';
ELSE : 'else';
ELIF : 'elif';
RETURN : 'return';
FOR : 'for';
WHILE : 'while';
IN : 'in';
NULL : 'null';
OR : '||';
AND : '&&';
EQUALS : '==';
NEQUALS : '!=';
GTEQUALS : '>=';
LTEQUALS : '<=';
POW : '**';
EXCL : '!';
GT : '>';
LT : '<';
ADD : '+';
SUBTRACT : '-';
MULTIPLY : '*';
DIVIDE : '/';
MODULE : '%';
OBRACE : '{' {opened++;};
CBRACE : '}' {opened--;};
OBRACKET : '[' {opened++;};
CBRACKET : ']' {opened--;};
OPAREN : '(' {opened++;};
CPAREN : ')' {opened--;};
SCOLON : ';';
ASSIGN : '=';
COMMA : ',';
QMARK : '?';
COLON : ':';
BOOL
: 'true'
| 'false'
;
NUMBER
: INT ('.' DIGIT*)?
;
IDENTIFIER
: [a-zA-Z_] [a-zA-Z_0-9]*
;
STRING
: ["] (~["\r\n] | '\\\\' | '\\"')* ["]
| ['] (~['\r\n] | '\\\\' | '\\\'')* [']
;
SKIPS
: ( SPACES | COMMENT | LINE_JOINING ){firstLine();} -> skip
;
NEWLINE
: ( {atStartOfInput()}? SPACES
| ( '\r'? '\n' | '\r' | '\f' ) SPACES?
)
{
String newLine = getText().replaceAll("[^\r\n\f]+", "");
String spaces = getText().replaceAll("[\r\n\f]+", "");
int next = _input.LA(1);
if (opened > 0 || next == '\r' || next == '\n' || next == '\f' || next == '#') {
// If we're inside a list or on a blank line, ignore all indents,
// dedents and line breaks.
skip();
}
else {
emit(commonToken(NEWLINE, newLine));
int indent = getIndentationCount(spaces);
int previous = indents.isEmpty() ? 0 : indents.peek();
if (indent == previous) {
// skip indents of the same size as the present indent-size
skip();
}
else if (indent > previous) {
indents.push(indent);
emit(commonToken(pythonParser.INDENT, spaces));
}
else {
// Possibly emit more than 1 DEDENT token.
while(!indents.isEmpty() && indents.peek() > indent) {
this.emit(createDedent());
indents.pop();
}
}
}
}
;
fragment INT
: [1-9] DIGIT*
| '0'
;
fragment DIGIT
: [0-9]
;
fragment SPACES
: [ \t]+
;
fragment COMMENT
: '#' ~[\r\n\f]*
;
fragment LINE_JOINING
: '\\' SPACES? ( '\r'? '\n' | '\r' | '\f' )
;
No, this should not be handled in the grammar. The lexer should simply emit the (faulty) INDENT token. The parser should, at runtime, produce an error. Something like this:
String source = "x=10\n" +
"while x>2 :\n" +
" print(\"hello\")\n" +
" x=x-3\n";
Python3Lexer lexer = new Python3Lexer(CharStreams.fromString(source));
Python3Parser parser = new Python3Parser(new CommonTokenStream(lexer));
// Remove default error-handling
parser.removeErrorListeners();
// Add custom error-handling
parser.addErrorListener(new BaseErrorListener() {
#Override
public void syntaxError(Recognizer<?, ?> recognizer, Object o, int i, int i1, String s, RecognitionException e) {
CommonToken token = (CommonToken) o;
if (token.getType() == Python3Parser.INDENT) {
// The parser encountered an unexpected INDENT token
// TODO throw your exception
}
// TODO handle other errors
}
});
// Trigger the error
parser.file_input();
Here is a short javaCC code:
PARSER_BEGIN(TestParser)
public class TestParser
{
}
PARSER_END(TestParser)
SKIP :
{
" "
| "\t"
| "\n"
| "\r"
}
TOKEN : /* LITERALS */
{
<VOID: "void">
| <LPAR: "("> | <RPAR: ")">
| <LBRAC: "{"> | <RBRAC: "}">
| <COMMA: ",">
| <DATATYPE: "int">
| <#LETTER: ["_","a"-"z","A"-"Z"] >
| <#DIGIT: ["0"-"9"] >
| <DOUBLE_QUOTE_LITERAL: "\"" (~["\""])*"\"" >
| <IDENTIFIER: <LETTER> (<LETTER>|<DIGIT>)* >
| <VARIABLE: "$"<IDENTIFIER> >
}
public void input():{} { (statement())+ <EOF> }
private void statement():{}
{
<VOID> <IDENTIFIER> <LPAR> (<DATATYPE> <IDENTIFIER> (<COMMA> <DATATYPE> <IDENTIFIER>)*)? <RPAR>
<LBRAC>
<RBRAC>
}
I'd like this parser to handle the following kind of input with a "grammar-free" section (character '}' would be the end of the section ):
void fun(int i, int j)
{
Hello world the value of i is ${i}
and j=${j}.
}
the grammar-free section would return a
java.util.List<String_or_VariableReference>
How should I modify my javacc parser to handle this section ?
Thanks.
If I understand the question correctly, you want to allow essentially arbitrary input for a while and then switch back to your language. If you can decide when to make the switch based purely on tokens, then this is easy to do using two lexical states. Use the default state for your programming language. When a "{" is seen in the DEFAULT state, switch to the other state
TOKEN: { <LBRACE : "{" > : FREE }
In the FREE state, when a "}" is seen, switch back to the DEFAULT state; when any other character is seen, pass it on to the parser.
<FREE> TOKEN { <RBRACE : "}" > : DEFAULT }
<FREE> TOKEN { <OTHER : ~["}"] > : FREE }
In the parser you can have
void freeSection() : {} { <LBRACE> (<OTHER>)* <RBRACE> }
If you want to do something with all those OTHER characters, see question 5.2 in the FAQ. http://www.engr.mun.ca/~theo/JavaCC-FAQ
If you want to capture variable references such as "${i}" in the FREE state, you can to that too. Add
<FREE> TOKEN { <VARREF : "${" (["a"-"Z"]|["A"-"Z"])* "}" > }
I have an assignment to use JavaCC to make a Top-Down Parser with Semantic Analysis for a language supplied by the lecturer. I have the production rules written out and no errors.
I'm completely stuck on how to use JJTree for my code and my hours of scouring the internet for tutorials hasn't gotten me anywhere.
Just wondering could anyone take some time out to explain how to implement JJTree in the code?
Or if there's a hidden step-by-step tutorial out there somewhere that would be a great help!
Here are some of my production rules in case they help.
Thanks in advance!
void program() : {}
{
(decl())* (function())* main_prog()
}
void decl() #void : {}
{
(
var_decl() | const_decl()
)
}
void var_decl() #void : {}
{
<VAR> ident_list() <COLON> type()
(<COMMA> ident_list() <COLON> type())* <SEMIC>
}
void const_decl() #void : {}
{
<CONSTANT> identifier() <COLON> type() <EQUAL> expression()
( <COMMA> identifier() <COLON> type() <EQUAL > expression())* <SEMIC>
}
void function() #void : {}
{
type() identifier() <LBR> param_list() <RBR>
<CBL>
(decl())*
(statement() <SEMIC> )*
returnRule() (expression() | {} )<SEMIC>
<CBR>
}
Creating an AST using JavaCC looks a lot like creating a "normal" parser (defined in a jj file). If you already have a working grammar, it's (relatively) easy :)
Here are the steps needed to create an AST:
rename your jj grammar file to jjt
decorate it with root-labels (the italic words are my own terminology...)
invoke jjtree on your jjt grammar, which will generate a jj file for you
invoke javacc on your generated jj grammar
compile the generated java source files
test it
Here's a quick step-by-step tutorial, assuming you're using MacOS or *nix, have the javacc.jar file in the same directory as your grammar file(s) and java and javac are on your system's PATH:
1
Assuming your jj grammar file is called TestParser.jj, rename it:
mv TestParser.jj TestParser.jjt
2
Now the tricky part: decorating your grammar so that the proper AST structure is created. You decorate an AST (or node, or production rule (all the same)) by adding a # followed by an identifier after it (and before the :). In your original question, you have a lot of #void in different productions, meaning you're creating the same type of AST's for different production rules: this is not what you want.
If you don't decorate your production, the name of the production is used as the type of the node (so, you can remove the #void):
void decl() :
{}
{
var_decl()
| const_decl()
}
Now the rule simply returns whatever AST the rule var_decl() or const_decl() returned.
Let's now have a look at the (simplified) var_decl rule:
void var_decl() #VAR :
{}
{
<VAR> id() <COL> id() <EQ> expr() <SCOL>
}
void id() #ID :
{}
{
<ID>
}
void expr() #EXPR :
{}
{
<ID>
}
which I decorated with the #VAR type. This now means that this rule will return the following tree structure:
VAR
/ | \
/ | \
ID ID EXPR
As you can see, the terminals are discarded from the AST! This also means that the id and expr rules loose the text their <ID> terminal matched. Of course, this is not what you want. For the rules that need to keep the inner text the terminal matched, you need to explicitly set the .value of the tree to the .image of the matched terminal:
void id() #ID :
{Token t;}
{
t=<ID> {jjtThis.value = t.image;}
}
void expr() #EXPR :
{Token t;}
{
t=<ID> {jjtThis.value = t.image;}
}
causing the input "var x : int = i;" to look like this:
VAR
|
.---+------.
/ | \
/ | \
ID["x"] ID["int"] EXPR["i"]
This is how you create a proper structure for your AST. Below follows a small grammar that is a very simple version of your own grammar including a small main method to test it all:
// TestParser.jjt
PARSER_BEGIN(TestParser)
public class TestParser {
public static void main(String[] args) throws ParseException {
TestParser parser = new TestParser(new java.io.StringReader(args[0]));
SimpleNode root = parser.program();
root.dump("");
}
}
PARSER_END(TestParser)
TOKEN :
{
< OPAR : "(" >
| < CPAR : ")" >
| < OBR : "{" >
| < CBR : "}" >
| < COL : ":" >
| < SCOL : ";" >
| < COMMA : "," >
| < VAR : "var" >
| < EQ : "=" >
| < CONST : "const" >
| < ID : ("_" | <LETTER>) ("_" | <ALPHANUM>)* >
}
TOKEN :
{
< #DIGIT : ["0"-"9"] >
| < #LETTER : ["a"-"z","A"-"Z"] >
| < #ALPHANUM : <LETTER> | <DIGIT> >
}
SKIP : { " " | "\t" | "\r" | "\n" }
SimpleNode program() #PROGRAM :
{}
{
(decl())* (function())* <EOF> {return jjtThis;}
}
void decl() :
{}
{
var_decl()
| const_decl()
}
void var_decl() #VAR :
{}
{
<VAR> id() <COL> id() <EQ> expr() <SCOL>
}
void const_decl() #CONST :
{}
{
<CONST> id() <COL> id() <EQ> expr() <SCOL>
}
void function() #FUNCTION :
{}
{
type() id() <OPAR> params() <CPAR> <OBR> /* ... */ <CBR>
}
void type() #TYPE :
{Token t;}
{
t=<ID> {jjtThis.value = t.image;}
}
void id() #ID :
{Token t;}
{
t=<ID> {jjtThis.value = t.image;}
}
void params() #PARAMS :
{}
{
(param() (<COMMA> param())*)?
}
void param() #PARAM :
{Token t;}
{
t=<ID> {jjtThis.value = t.image;}
}
void expr() #EXPR :
{Token t;}
{
t=<ID> {jjtThis.value = t.image;}
}
3
Let the jjtree class (included in javacc.jar) create a jj file for you:
java -cp javacc.jar jjtree TestParser.jjt
4
The previous step has created the file TestParser.jj (if everything went okay). Let javacc (also present in javacc.jar) process it:
java -cp javacc.jar javacc TestParser.jj
5
To compile all source files, do:
javac -cp .:javacc.jar *.java
(on Windows, do: javac -cp .;javacc.jar *.java)
6
The moment of truth has arrived: let's see if everything actually works! To let the parser process the input:
var n : int = I;
const x : bool = B;
double f(a,b,c)
{
}
execute the following:
java -cp . TestParser "var n : int = I; const x : bool = B; double f(a,b,c) { }"
and you should see the following being printed on your console:
PROGRAM
decl
VAR
ID
ID
EXPR
decl
CONST
ID
ID
EXPR
FUNCTION
TYPE
ID
PARAMS
PARAM
PARAM
PARAM
Note that you don't see the text the ID's matched, but believe me, they're there. The method dump() simply does not show it.
HTH
EDIT
For a working grammar including expressions, you could have a look at the following expression evaluator of mine: https://github.com/bkiers/Curta (the grammar is in src/grammar). You might want to have a look at how to create root-nodes in case of binary expressions.
Here is an example that uses JJTree
http://anandsekar.github.io/writing-an-interpretter-using-javacc/
There are two style of comments , C-style and C++ style, how to recognize them?
/* comments */
// comments
I am feel free to use any methods and 3rd-libraries.
To reliably find all comments in a Java source file, I wouldn't use regex, but a real lexer (aka tokenizer).
Two popular choices for Java are:
JFlex: http://jflex.de
ANTLR: http://www.antlr.org
Contrary to popular belief, ANTLR can also be used to create only a lexer without the parser.
Here's a quick ANTLR demo. You need the following files in the same directory:
antlr-3.2.jar
JavaCommentLexer.g (the grammar)
Main.java
Test.java (a valid (!) java source file with exotic comments)
JavaCommentLexer.g
lexer grammar JavaCommentLexer;
options {
filter=true;
}
SingleLineComment
: FSlash FSlash ~('\r' | '\n')*
;
MultiLineComment
: FSlash Star .* Star FSlash
;
StringLiteral
: DQuote
( (EscapedDQuote)=> EscapedDQuote
| (EscapedBSlash)=> EscapedBSlash
| Octal
| Unicode
| ~('\\' | '"' | '\r' | '\n')
)*
DQuote {skip();}
;
CharLiteral
: SQuote
( (EscapedSQuote)=> EscapedSQuote
| (EscapedBSlash)=> EscapedBSlash
| Octal
| Unicode
| ~('\\' | '\'' | '\r' | '\n')
)
SQuote {skip();}
;
fragment EscapedDQuote
: BSlash DQuote
;
fragment EscapedSQuote
: BSlash SQuote
;
fragment EscapedBSlash
: BSlash BSlash
;
fragment FSlash
: '/' | '\\' ('u002f' | 'u002F')
;
fragment Star
: '*' | '\\' ('u002a' | 'u002A')
;
fragment BSlash
: '\\' ('u005c' | 'u005C')?
;
fragment DQuote
: '"'
| '\\u0022'
;
fragment SQuote
: '\''
| '\\u0027'
;
fragment Unicode
: '\\u' Hex Hex Hex Hex
;
fragment Octal
: '\\' ('0'..'3' Oct Oct | Oct Oct | Oct)
;
fragment Hex
: '0'..'9' | 'a'..'f' | 'A'..'F'
;
fragment Oct
: '0'..'7'
;
Main.java
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
JavaCommentLexer lexer = new JavaCommentLexer(new ANTLRFileStream("Test.java"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
for(Object o : tokens.getTokens()) {
CommonToken t = (CommonToken)o;
if(t.getType() == JavaCommentLexer.SingleLineComment) {
System.out.println("SingleLineComment :: " + t.getText().replace("\n", "\\n"));
}
if(t.getType() == JavaCommentLexer.MultiLineComment) {
System.out.println("MultiLineComment :: " + t.getText().replace("\n", "\\n"));
}
}
}
}
Test.java
\u002f\u002a <- multi line comment start
multi
line
comment // not a single line comment
\u002A/
public class Test {
// single line "not a string"
String s = "\u005C" \242 not // a comment \\\" \u002f \u005C\u005C \u0022;
/*
regular multi line comment
*/
char c = \u0027"'; // the " is not the start of a string
char q1 = '\u005c''; // == '\''
char q2 = '\u005c\u0027'; // == '\''
char q3 = \u0027\u005c\u0027\u0027; // == '\''
char c4 = '\047';
String t = "/*";
\u002f\u002f another single line comment
String u = "*/";
}
Now, to run the demo, do:
bart#hades:~/Programming/ANTLR/Demos/JavaComment$ java -cp antlr-3.2.jar org.antlr.Tool JavaCommentLexer.g
bart#hades:~/Programming/ANTLR/Demos/JavaComment$ javac -cp antlr-3.2.jar *.java
bart#hades:~/Programming/ANTLR/Demos/JavaComment$ java -cp .:antlr-3.2.jar Main
and you'll see the following being printed to the console:
MultiLineComment :: \u002f\u002a <- multi line comment start\nmulti\nline\ncomment // not a single line comment\n\u002A/
SingleLineComment :: // single line "not a string"
SingleLineComment :: // a comment \\\" \u002f \u005C\u005C \u0022;
MultiLineComment :: /*\n regular multi line comment\n */
SingleLineComment :: // the " is not the start of a string
SingleLineComment :: // == '\''
SingleLineComment :: // == '\''
SingleLineComment :: // == '\''
SingleLineComment :: \u002f\u002f another single line comment
EDIT
You can create a sort of lexer with regex yourself, of course. The following demo does not handle Unicode literals inside source files, however:
Test2.java
/* <- multi line comment start
multi
line
comment // not a single line comment
*/
public class Test2 {
// single line "not a string"
String s = "\" \242 not // a comment \\\" ";
/*
regular multi line comment
*/
char c = '"'; // the " is not the start of a string
char q1 = '\''; // == '\''
char c4 = '\047';
String t = "/*";
// another single line comment
String u = "*/";
}
Main2.java
import java.util.*;
import java.io.*;
import java.util.regex.*;
public class Main2 {
private static String read(File file) throws IOException {
StringBuilder b = new StringBuilder();
Scanner scan = new Scanner(file);
while(scan.hasNextLine()) {
String line = scan.nextLine();
b.append(line).append('\n');
}
return b.toString();
}
public static void main(String[] args) throws Exception {
String contents = read(new File("Test2.java"));
String slComment = "//[^\r\n]*";
String mlComment = "/\\*[\\s\\S]*?\\*/";
String strLit = "\"(?:\\\\.|[^\\\\\"\r\n])*\"";
String chLit = "'(?:\\\\.|[^\\\\'\r\n])+'";
String any = "[\\s\\S]";
Pattern p = Pattern.compile(
String.format("(%s)|(%s)|%s|%s|%s", slComment, mlComment, strLit, chLit, any)
);
Matcher m = p.matcher(contents);
while(m.find()) {
String hit = m.group();
if(m.group(1) != null) {
System.out.println("SingleLine :: " + hit.replace("\n", "\\n"));
}
if(m.group(2) != null) {
System.out.println("MultiLine :: " + hit.replace("\n", "\\n"));
}
}
}
}
If you run Main2, the following is printed to the console:
MultiLine :: /* <- multi line comment start\nmulti\nline\ncomment // not a single line comment\n*/
SingleLine :: // single line "not a string"
MultiLine :: /*\n regular multi line comment\n */
SingleLine :: // the " is not the start of a string
SingleLine :: // == '\''
SingleLine :: // another single line comment
EDIT: I've been searching for a while, but here is the real working regex:
String regex = "((//[^\n\r]*)|(/\\*(.+?)\\*/))"; // New Regex
List<String> comments = new ArrayList<String>();
Pattern p = Pattern.compile(regex, Pattern.DOTALL);
Matcher m = p.matcher(code);
// code is the C-Style code, in which you want to serach
while (m.find())
{
System.out.println(m.group(1));
comments.add(m.group(1));
}
With this input:
import Blah;
//Comment one//
line();
/* Blah */
line2(); // something weird
/* Multiline
another line for the comment
*/
It generates this output:
//Comment one//
/* Blah */
line2(); // something weird
/* Multiline
another line for the comment
*/
Notice that the last three lines of the output are one single print.
Have you tried regular expressions? Here is a nice wrap-up with Java example. It might need some tweaking However using only regular expressions won't be sufficient for more complicated structures (nested comments, "comments" in strings) but it is a nice start.