grammar-free section in javaCC - java

Here is a short javaCC code:
PARSER_BEGIN(TestParser)
public class TestParser
{
}
PARSER_END(TestParser)
SKIP :
{
" "
| "\t"
| "\n"
| "\r"
}
TOKEN : /* LITERALS */
{
<VOID: "void">
| <LPAR: "("> | <RPAR: ")">
| <LBRAC: "{"> | <RBRAC: "}">
| <COMMA: ",">
| <DATATYPE: "int">
| <#LETTER: ["_","a"-"z","A"-"Z"] >
| <#DIGIT: ["0"-"9"] >
| <DOUBLE_QUOTE_LITERAL: "\"" (~["\""])*"\"" >
| <IDENTIFIER: <LETTER> (<LETTER>|<DIGIT>)* >
| <VARIABLE: "$"<IDENTIFIER> >
}
public void input():{} { (statement())+ <EOF> }
private void statement():{}
{
<VOID> <IDENTIFIER> <LPAR> (<DATATYPE> <IDENTIFIER> (<COMMA> <DATATYPE> <IDENTIFIER>)*)? <RPAR>
<LBRAC>
<RBRAC>
}
I'd like this parser to handle the following kind of input with a "grammar-free" section (character '}' would be the end of the section ):
void fun(int i, int j)
{
Hello world the value of i is ${i}
and j=${j}.
}
the grammar-free section would return a
java.util.List<String_or_VariableReference>
How should I modify my javacc parser to handle this section ?
Thanks.

If I understand the question correctly, you want to allow essentially arbitrary input for a while and then switch back to your language. If you can decide when to make the switch based purely on tokens, then this is easy to do using two lexical states. Use the default state for your programming language. When a "{" is seen in the DEFAULT state, switch to the other state
TOKEN: { <LBRACE : "{" > : FREE }
In the FREE state, when a "}" is seen, switch back to the DEFAULT state; when any other character is seen, pass it on to the parser.
<FREE> TOKEN { <RBRACE : "}" > : DEFAULT }
<FREE> TOKEN { <OTHER : ~["}"] > : FREE }
In the parser you can have
void freeSection() : {} { <LBRACE> (<OTHER>)* <RBRACE> }
If you want to do something with all those OTHER characters, see question 5.2 in the FAQ. http://www.engr.mun.ca/~theo/JavaCC-FAQ
If you want to capture variable references such as "${i}" in the FREE state, you can to that too. Add
<FREE> TOKEN { <VARREF : "${" (["a"-"Z"]|["A"-"Z"])* "}" > }

Related

How to parse this grammar?

I want to create a recursive descendant parser in java for following grammar (I have managed to create tokens). This is the relevant part of the grammar:
expression ::= numeric_expression | identifier | "null"
identifier ::= "a..z,$,_"
numeric_expression ::= ( ( "-" | "++" | "--" ) expression )
| ( expression ( "++" | "--" ) )
| ( expression ( "+" | "+=" | "-" | "-=" | "*" | "*=" | "/" | "/=" | "%" | "%=" ) expression )
arglist ::= expression { "," expression }
I have written code for parsing numeric_expression (assuming if invalid token, return null):
NumericAST<? extends OpAST> parseNumericExpr() {
OpAST op;
if (token.getCodes() == Lexer.CODES.UNARY_OP) { //Check for unary operator like "++" or "--" etc
op = new UnaryOpAST(token.getValue());
token = getNextToken();
AST expr = parseExpr(); // Method that returns expression node.
if (expr == null) {
op = null;
return null;
} else {
if (checkSemi()) {
System.out.println("UNARY AST CREATED");
return new NumericAST<OpAST>(expr, op, false);
}
else {
return null;
}
}
} else { // Binary operation like "a+b", where a,b ->expression
AST expr = parseExpr();
if (expr == null) {
return null;
} else {
token = getNextToken();
if (token.getCodes() == Lexer.CODES.UNARY_OP) {
op = new UnaryOpAST(token.getValue());
return new NumericAST<OpAST>(expr, op, true);
} else if (token.getCodes() == Lexer.CODES.BIN_OP) {
op = new BinaryOpAST(token.getValue());
token = getNextToken();
AST expr2 = parseExpr();
if (expr2 == null) {
op = null;
expr = null;
return null;
} else {
if (checkSemi()) {
System.out.println("BINARY AST CREATED");
return new NumericAST<OpAST>(expr, op, expr2);
}
else {
return null;
}
}
} else {
expr = null;
return null;
}
}
}
}
Now, if i get a unary operator like ++ i can directly call this method, but I dont know to recognize other grammar, starting with same productions, like arglist and numeric_expression having "expression" as start production.
My question is:
How to recognize whether to call parseNumericExpr() or parseArgList() (method not mentioned above) if i get an expression token?
In order to write a recursive descent parser, you need an appropriate top-down grammar, normally an LL(1) grammar, although it's common to write the grammar using EBNF operators, as shown in the example grammar on Wikipedia's page on recursive descent grammars.
Unfortunately, your grammar is not LL(1), and the question you raise is a consequence of that fact. An LL(1) grammar has the property that the parser can always determine which production to use by examining only the next input token, which puts some severe constraints on the grammar, including:
No two productions for the same non-terminal can start with the same symbol.
No production can be left-recursive (i.e. the first symbol on the right-hand side is the defining non-terminal).
Here's a small rearrangement of your grammar which will work:
-- I added number here in order to be explicit.
atom ::= identifier | number | "null" | "(" expression ")"
-- I added function calls here, but it's arguable that this syntax accepts
-- a lot of invalid expressions
primary ::= atom { "++" | "--" | "(" [ arglist ] ")" }
factor ::= [ "-" | "++" | "--" ] primary
term ::= factor { ( "*" | "/" | "%" ) factor }
value ::= term { ( "+" | "-" ) term }
-- This adds the ordinary "=" assignment to the list in case it was
-- omitted by accident. Also, see the note below.
expression ::= { value ( "=" | "+#" | "-=" | "*=" | "/=" | "%=" ) } value
arglist ::= expression { "," expression }
The last expression rule is an attempt to capture the usual syntax of assignment operators (which associate to the right, not to the left), but it suffers from a classic problem address by this highly related question. I don't think I have a better answer to this issue than the one I wrote three years ago, so I hope it is still useful.

Cannot parse System.out.println() in JavaCC

I was trying to parse System.out.println() statement as an OutputStatement for Java Grammar. Here's the production rule in EBNF:
Statement::=( LabeledStatement | AssertStatement | Block | EmptyStatement | StatementExpression | SwitchStatement | IfStatement | WhileStatement | DoStatement | ForStatement | BreakStatement | ContinueStatement | ReturnStatement | ThrowStatement | SynchronizedStatement | TryStatement|OutputStatement)
OutputStatement::="System.out.print"["ln"]"("Arguments")" ";"
This is strictly according to the Java Grammar as specified in the javacc folder file C:\javacc-6.0\examples\JavaGrammars\Java 1.0.2.jj
Now when I coded the production rule in JavaCC it came as:
OutputStmt OutputStatement():
{
Token tk;
Expression args;
boolean ln=false;
int line;
int column;
}
{
{line=token.beginLine;column=token.beginColumn;args=null;ln=false;}
tk=<STRING_LITERAL> LOOKAHEAD({tk.image.equals("System")})
"."
tk=<STRING_LITERAL> LOOKAHEAD({tk.image.equals("out")})
"."
tk=<STRING_LITERAL> LOOKAHEAD({tk.image.equals("print")})
[
tk=<STRING_LITERAL> LOOKAHEAD({tk.image.equals("ln")})
{
ln=true;
}
]
"("
args=Expression()
")" ";"
{
return new OutputStmt(line,column,token.endLine,token.endColumn,ln,args);
}
}
Now this throws LOOKAHEAD Warnings and Errors in the Parser generated.Can anyone please help?
EDIT: The main problem as it seems is that JavaCC is generating methods which are not initializing Token tk and which is giving me the error tk not resolved.
The following will work.
OutputStmt OutputStatement() :
{
Token tk;
Expression args;
boolean ln;
int line;
int column;
}
{
{line=token.beginLine;column=token.beginColumn;args=null;ln=false;}
LOOKAHEAD({getToken(1).image.equals("System")})
<ID>
"."
LOOKAHEAD({getToken(1).image.equals("out")})
<ID>
"."
LOOKAHEAD({getToken(1).image.equals("println") || getToken(1).image.equals("print") })
tk=<ID> { ln = tk.image.equals("println" ) ; }
"("
args=Expression()
")" ";"
{ return new OutputStmt(line,column,token.endLine,token.endColumn,ln,args); }
}
Note that I changed STRING_LITERAL to the more traditional ID.

how can I make this JAVACC grammar work with [ ]?

I'm trying to change a grammar in the JSqlParser project, which deals with a javacc grammar file .jj specifying the standard SQL syntax. I had difficulty getting one section to work, I narrowed it down to the following , much simplified grammar.
basically I have a def of Column : [table ] . field
but table itself could also contain the "." char, which causes confusion.
I think intuitively the following grammar should accept all the following sentences:
select mytable.myfield
select myfield
select mydb.mytable.myfield
but in practice it only accepts the 2nd and 3rd above. whenever it sees the ".", it progresses to demanding the 2-dot version of table (i.e. the first derivation rule for table)
how can I make this grammar work?
Thanks a lot
Yang
options{
IGNORE_CASE=true ;
STATIC=false;
DEBUG_PARSER=true;
DEBUG_LOOKAHEAD=true;
DEBUG_TOKEN_MANAGER=false;
// FORCE_LA_CHECK=true;
UNICODE_INPUT=true;
}
PARSER_BEGIN(TT)
import java.util.*;
public class TT {
}
PARSER_END(TT)
///////////////////////////////////////////// main stuff concerned
void Statement() :
{ }
{
<K_SELECT> Column()
}
void Column():
{
}
{
[LOOKAHEAD(3) Table() "." ]
//[
//LOOKAHEAD(2) (
// LOOKAHEAD(5) <S_IDENTIFIER> "." <S_IDENTIFIER>
// |
// LOOKAHEAD(3) <S_IDENTIFIER>
//)
//
//
//
//]
Field()
}
void Field():
{}{
<S_IDENTIFIER>
}
void Table():
{}{
LOOKAHEAD(5) <S_IDENTIFIER> "." <S_IDENTIFIER>
|
LOOKAHEAD(3) <S_IDENTIFIER>
}
////////////////////////////////////////////////////////
SKIP:
{
" "
| "\t"
| "\r"
| "\n"
}
TOKEN: /* SQL Keywords. prefixed with K_ to avoid name clashes */
{
<K_CREATE: "CREATE">
|
<K_SELECT: "SELECT">
}
TOKEN : /* Numeric Constants */
{
< S_DOUBLE: ((<S_LONG>)? "." <S_LONG> ( ["e","E"] (["+", "-"])? <S_LONG>)?
|
<S_LONG> "." (["e","E"] (["+", "-"])? <S_LONG>)?
|
<S_LONG> ["e","E"] (["+", "-"])? <S_LONG>
)>
| < S_LONG: ( <DIGIT> )+ >
| < #DIGIT: ["0" - "9"] >
}
TOKEN:
{
< S_IDENTIFIER: ( <LETTER> | <ADDITIONAL_LETTERS> )+ ( <DIGIT> | <LETTER> | <ADDITIONAL_LETTERS> | <SPECIAL_CHARS>)* >
| < #LETTER: ["a"-"z", "A"-"Z", "_", "$"] >
| < #SPECIAL_CHARS: "$" | "_" | "#" | "#">
| < S_CHAR_LITERAL: "'" (~["'"])* "'" ("'" (~["'"])* "'")*>
| < S_QUOTED_IDENTIFIER: "\"" (~["\n","\r","\""])+ "\"" | ("`" (~["\n","\r","`"])+ "`") | ( "[" ~["0"-"9","]"] (~["\n","\r","]"])* "]" ) >
/*
To deal with database names (columns, tables) using not only latin base characters, one
can expand the following rule to accept additional letters. Here is the addition of german umlauts.
There seems to be no way to recognize letters by an external function to allow
a configurable addition. One must rebuild JSqlParser with this new "Letterset".
*/
| < #ADDITIONAL_LETTERS: ["ä","ö","ü","Ä","Ö","Ü","ß"] >
}
You could rewrite your grammar like this
Statement --> "select" Column
Column --> Prefix <ID>
Prefix --> (<ID> ".")*
Now the only choice is whether to iterate or not. Assuming a "." can't follow a Column, this is easily done with a lookahead of 2:
Statement --> "select" Column
Column --> Prefix <ID>
Prefix --> (LOOKAHEAD( <ID> ".") <ID> ".")*
indeed the following grammar in flex+bison (LR parser) works fine , recognizing all the following sentences correctly:
create mydb.mytable
create mytable
select mydb.mytable.myfield
select mytable.myfield
select myfield
so it is indeed due to limitation of LL parser
%%
statement:
create_sentence
|
select_sentence
;
create_sentence: CREATE table
;
select_sentence: SELECT table '.' ID
|
SELECT ID
;
table : table '.' ID
|
ID
;
%%
If you need Table to be its own nonterminal, you can do this by using a boolean parameter that says whether the table is expected to be followed by a dot.
void Statement():{}{
"select" Column() | "create" "table" Table(false) }
void Column():{}{
[LOOKAHEAD(<ID> ".") Table(true) "."] <ID> }
void Table(boolean expectDot):{}{
<ID> MoreTable(expectDot) }
void MoreTable(boolean expectDot) {
LOOKAHEAD("." <ID> ".", {expectDot}) "." <ID> MoreTable(expectDot)
|
LOOKAHEAD(".", {!expectDot}) "." <ID> MoreTable(expectDot)
|
{}
}
Doing it this way precludes using Table in any syntactic lookahead specifications either directly or indirectly. E.g. you shouldn't have LOOKAHEAD( Table()) anywhere in your grammar, because semantic lookahead is not used during syntactic lookahead. See the FAQ for more information on that.
Your examples are parsed perfectly well using JSqlParser V0.9.x (https://github.com/JSQLParser/JSqlParser)
CCJSqlParserUtil.parse("SELECT mycolumn");
CCJSqlParserUtil.parse("SELECT mytable.mycolumn");
CCJSqlParserUtil.parse("SELECT mydatabase.mytable.mycolumn");

How to implement JJTree on grammar

I have an assignment to use JavaCC to make a Top-Down Parser with Semantic Analysis for a language supplied by the lecturer. I have the production rules written out and no errors.
I'm completely stuck on how to use JJTree for my code and my hours of scouring the internet for tutorials hasn't gotten me anywhere.
Just wondering could anyone take some time out to explain how to implement JJTree in the code?
Or if there's a hidden step-by-step tutorial out there somewhere that would be a great help!
Here are some of my production rules in case they help.
Thanks in advance!
void program() : {}
{
(decl())* (function())* main_prog()
}
void decl() #void : {}
{
(
var_decl() | const_decl()
)
}
void var_decl() #void : {}
{
<VAR> ident_list() <COLON> type()
(<COMMA> ident_list() <COLON> type())* <SEMIC>
}
void const_decl() #void : {}
{
<CONSTANT> identifier() <COLON> type() <EQUAL> expression()
( <COMMA> identifier() <COLON> type() <EQUAL > expression())* <SEMIC>
}
void function() #void : {}
{
type() identifier() <LBR> param_list() <RBR>
<CBL>
(decl())*
(statement() <SEMIC> )*
returnRule() (expression() | {} )<SEMIC>
<CBR>
}
Creating an AST using JavaCC looks a lot like creating a "normal" parser (defined in a jj file). If you already have a working grammar, it's (relatively) easy :)
Here are the steps needed to create an AST:
rename your jj grammar file to jjt
decorate it with root-labels (the italic words are my own terminology...)
invoke jjtree on your jjt grammar, which will generate a jj file for you
invoke javacc on your generated jj grammar
compile the generated java source files
test it
Here's a quick step-by-step tutorial, assuming you're using MacOS or *nix, have the javacc.jar file in the same directory as your grammar file(s) and java and javac are on your system's PATH:
1
Assuming your jj grammar file is called TestParser.jj, rename it:
mv TestParser.jj TestParser.jjt
2
Now the tricky part: decorating your grammar so that the proper AST structure is created. You decorate an AST (or node, or production rule (all the same)) by adding a # followed by an identifier after it (and before the :). In your original question, you have a lot of #void in different productions, meaning you're creating the same type of AST's for different production rules: this is not what you want.
If you don't decorate your production, the name of the production is used as the type of the node (so, you can remove the #void):
void decl() :
{}
{
var_decl()
| const_decl()
}
Now the rule simply returns whatever AST the rule var_decl() or const_decl() returned.
Let's now have a look at the (simplified) var_decl rule:
void var_decl() #VAR :
{}
{
<VAR> id() <COL> id() <EQ> expr() <SCOL>
}
void id() #ID :
{}
{
<ID>
}
void expr() #EXPR :
{}
{
<ID>
}
which I decorated with the #VAR type. This now means that this rule will return the following tree structure:
VAR
/ | \
/ | \
ID ID EXPR
As you can see, the terminals are discarded from the AST! This also means that the id and expr rules loose the text their <ID> terminal matched. Of course, this is not what you want. For the rules that need to keep the inner text the terminal matched, you need to explicitly set the .value of the tree to the .image of the matched terminal:
void id() #ID :
{Token t;}
{
t=<ID> {jjtThis.value = t.image;}
}
void expr() #EXPR :
{Token t;}
{
t=<ID> {jjtThis.value = t.image;}
}
causing the input "var x : int = i;" to look like this:
VAR
|
.---+------.
/ | \
/ | \
ID["x"] ID["int"] EXPR["i"]
This is how you create a proper structure for your AST. Below follows a small grammar that is a very simple version of your own grammar including a small main method to test it all:
// TestParser.jjt
PARSER_BEGIN(TestParser)
public class TestParser {
public static void main(String[] args) throws ParseException {
TestParser parser = new TestParser(new java.io.StringReader(args[0]));
SimpleNode root = parser.program();
root.dump("");
}
}
PARSER_END(TestParser)
TOKEN :
{
< OPAR : "(" >
| < CPAR : ")" >
| < OBR : "{" >
| < CBR : "}" >
| < COL : ":" >
| < SCOL : ";" >
| < COMMA : "," >
| < VAR : "var" >
| < EQ : "=" >
| < CONST : "const" >
| < ID : ("_" | <LETTER>) ("_" | <ALPHANUM>)* >
}
TOKEN :
{
< #DIGIT : ["0"-"9"] >
| < #LETTER : ["a"-"z","A"-"Z"] >
| < #ALPHANUM : <LETTER> | <DIGIT> >
}
SKIP : { " " | "\t" | "\r" | "\n" }
SimpleNode program() #PROGRAM :
{}
{
(decl())* (function())* <EOF> {return jjtThis;}
}
void decl() :
{}
{
var_decl()
| const_decl()
}
void var_decl() #VAR :
{}
{
<VAR> id() <COL> id() <EQ> expr() <SCOL>
}
void const_decl() #CONST :
{}
{
<CONST> id() <COL> id() <EQ> expr() <SCOL>
}
void function() #FUNCTION :
{}
{
type() id() <OPAR> params() <CPAR> <OBR> /* ... */ <CBR>
}
void type() #TYPE :
{Token t;}
{
t=<ID> {jjtThis.value = t.image;}
}
void id() #ID :
{Token t;}
{
t=<ID> {jjtThis.value = t.image;}
}
void params() #PARAMS :
{}
{
(param() (<COMMA> param())*)?
}
void param() #PARAM :
{Token t;}
{
t=<ID> {jjtThis.value = t.image;}
}
void expr() #EXPR :
{Token t;}
{
t=<ID> {jjtThis.value = t.image;}
}
3
Let the jjtree class (included in javacc.jar) create a jj file for you:
java -cp javacc.jar jjtree TestParser.jjt
4
The previous step has created the file TestParser.jj (if everything went okay). Let javacc (also present in javacc.jar) process it:
java -cp javacc.jar javacc TestParser.jj
5
To compile all source files, do:
javac -cp .:javacc.jar *.java
(on Windows, do: javac -cp .;javacc.jar *.java)
6
The moment of truth has arrived: let's see if everything actually works! To let the parser process the input:
var n : int = I;
const x : bool = B;
double f(a,b,c)
{
}
execute the following:
java -cp . TestParser "var n : int = I; const x : bool = B; double f(a,b,c) { }"
and you should see the following being printed on your console:
PROGRAM
decl
VAR
ID
ID
EXPR
decl
CONST
ID
ID
EXPR
FUNCTION
TYPE
ID
PARAMS
PARAM
PARAM
PARAM
Note that you don't see the text the ID's matched, but believe me, they're there. The method dump() simply does not show it.
HTH
EDIT
For a working grammar including expressions, you could have a look at the following expression evaluator of mine: https://github.com/bkiers/Curta (the grammar is in src/grammar). You might want to have a look at how to create root-nodes in case of binary expressions.
Here is an example that uses JJTree
http://anandsekar.github.io/writing-an-interpretter-using-javacc/

Parse an InputStream for multiple patterns

I am parsing an InputStream for certain patterns to extract values from it, e.g. I would have something like
<span class="filename">foo
I don't want to use a full fledged html parser as I am not interested in the document structure but only in some well defined bits of information. (Only their order is important)
Currently I am using a very simple approach, I have an Object for each Pattern that contains a char[] of the opening and closing 'tag' (in the example opening would be <span class="filename"><a href="and closing " to get the url) and a position marker. For each character read by of the InputStream, I iterate over all Patterns and call the match(char) function that returns true once the opening pattern does match, from then on I collect the following chars in a StringBuilder until the now active pattern does match() again. I then call a function with the ID of the Pattern and the String read, to process it further.
While this works fine in most cases, I wanted to include regular expressions in the pattern, so I could also match something like
<span class="filename" id="234217">foo
At this point I was sure I would reinvent the wheel as this most certainly would have been done before, and I don't really want to write my own regex parser to begin with. However, I could not find anything that would do what I was looking for.
Unfortunately the Scanner class only matches one pattern, not a list of patterns, what alternatives could I use? It should not be heavy and work with Android.
You mean you want to match any <span> element with a given class attribute, irrespective of other attributes it may have? That's easy enough:
Scanner sc = new Scanner(new File("test.txt"), "UTF-8");
Pattern p = Pattern.compile(
"<span[^>]*class=\"filename\"[^>]*>\\s*<a[^>]*href=\"([^\"]+)\""
);
while (sc.findWithinHorizon(p, 0) != null)
{
MatchResult m = sc.match();
System.out.println(m.group(1));
}
The file "test.txt" contains the text of your question, and the output is:
http://example.com/foo
and closing
http://example.com/foo
the Scanner.useDelimiter(Pattern) API seems to be what you're looking for. You would have to use an OR (|) separated pattern string.
This pattern can get really complicated really quickly though.
You are right to think this has all been done before :) What you are talking about is a problem of tokenizing and parsing and I therefore suggest you consider JavaCC.
There is something of a learning curve with JavaCC as you learn to understand it's grammar, so below is an implementation to get you started.
The grammar is a chopped down version of the standard JavaCC grammar for HTML. You can add more productions for matching other patterns.
options {
JDK_VERSION = "1.5";
static = false;
}
PARSER_BEGIN(eg1)
import java.util.*;
public class eg1 {
private String currentTag;
private String currentSpanClass;
private String currentHref;
public static void main(String args []) throws ParseException {
System.out.println("Starting parse");
eg1 parser = new eg1(System.in);
parser.parse();
System.out.println("Finishing parse");
}
}
PARSER_END(eg1)
SKIP :
{
< ( " " | "\t" | "\n" | "\r" )+ >
| < "<!" ( ~[">"] )* ">" >
}
TOKEN :
{
<STAGO: "<" > : TAG
| <ETAGO: "</" > : TAG
| <PCDATA: ( ~["<"] )+ >
}
<TAG> TOKEN [IGNORE_CASE] :
{
<A: "a" > : ATTLIST
| <SPAN: "span" > : ATTLIST
| <DONT_CARE: (["a"-"z"] | ["0"-"9"])+ > : ATTLIST
}
<ATTLIST> SKIP :
{
< " " | "\t" | "\n" | "\r" >
| < "--" > : ATTCOMM
}
<ATTLIST> TOKEN :
{
<TAGC: ">" > : DEFAULT
| <A_EQ: "=" > : ATTRVAL
| <#ALPHA: ["a"-"z","A"-"Z","_","-","."] >
| <#NUM: ["0"-"9"] >
| <#ALPHANUM: <ALPHA> | <NUM> >
| <A_NAME: <ALPHA> ( <ALPHANUM> )* >
}
<ATTRVAL> TOKEN :
{
<CDATA: "'" ( ~["'"] )* "'"
| "\"" ( ~["\""] )* "\""
| ( ~[">", "\"", "'", " ", "\t", "\n", "\r"] )+
> : ATTLIST
}
<ATTCOMM> SKIP :
{
< ( ~["-"] )+ >
| < "-" ( ~["-"] )+ >
| < "--" > : ATTLIST
}
void attribute(Map<String,String> attrs) :
{
Token n, v = null;
}
{
n=<A_NAME> [ <A_EQ> v=<CDATA> ]
{
String attval;
if (v == null) {
attval = "#DEFAULT";
} else {
attval = v.image;
if( attval.startsWith("\"") && attval.endsWith("\"") ) {
attval = attval.substring(1,attval.length()-1);
} else if( attval.startsWith("'") && attval.endsWith("'") ) {
attval = attval.substring(1,attval.length()-1);
}
}
if( attrs!=null ) attrs.put(n.image.toLowerCase(),attval);
}
}
void attList(Map<String,String> attrs) : {}
{
( attribute(attrs) )+
}
void tagAStart() : {
Map<String,String> attrs = new HashMap<String,String>();
}
{
<STAGO> <A> [ attList(attrs) ] <TAGC>
{
currentHref=attrs.get("href");
if( currentHref != null && "filename".equals(currentSpanClass) )
{
System.out.println("Found URL: "+currentHref);
}
}
}
void tagAEnd() : {}
{
<ETAGO> <A> <TAGC>
{
currentHref=null;
}
}
void tagSpanStart() : {
Map<String,String> attrs = new HashMap<String,String>();
}
{
<STAGO> <SPAN> [ attList(attrs) ] <TAGC>
{
currentSpanClass=attrs.get("class");
}
}
void tagSpanEnd() : {}
{
<ETAGO> <SPAN> <TAGC>
{
currentSpanClass=null;
}
}
void tagDontCareStart() : {}
{
<STAGO> <DONT_CARE> [ attList(null) ] <TAGC>
}
void tagDontCareEnd() : {}
{
<ETAGO> <DONT_CARE> <TAGC>
}
void parse() : {}
{
(
LOOKAHEAD(2) tagAStart() |
LOOKAHEAD(2) tagAEnd() |
LOOKAHEAD(2) tagSpanStart() |
LOOKAHEAD(2) tagSpanEnd() |
LOOKAHEAD(2) tagDontCareStart() |
LOOKAHEAD(2) tagDontCareEnd() |
<PCDATA>
)*
}

Categories

Resources