There are two style of comments , C-style and C++ style, how to recognize them?
/* comments */
// comments
I am feel free to use any methods and 3rd-libraries.
To reliably find all comments in a Java source file, I wouldn't use regex, but a real lexer (aka tokenizer).
Two popular choices for Java are:
JFlex: http://jflex.de
ANTLR: http://www.antlr.org
Contrary to popular belief, ANTLR can also be used to create only a lexer without the parser.
Here's a quick ANTLR demo. You need the following files in the same directory:
antlr-3.2.jar
JavaCommentLexer.g (the grammar)
Main.java
Test.java (a valid (!) java source file with exotic comments)
JavaCommentLexer.g
lexer grammar JavaCommentLexer;
options {
filter=true;
}
SingleLineComment
: FSlash FSlash ~('\r' | '\n')*
;
MultiLineComment
: FSlash Star .* Star FSlash
;
StringLiteral
: DQuote
( (EscapedDQuote)=> EscapedDQuote
| (EscapedBSlash)=> EscapedBSlash
| Octal
| Unicode
| ~('\\' | '"' | '\r' | '\n')
)*
DQuote {skip();}
;
CharLiteral
: SQuote
( (EscapedSQuote)=> EscapedSQuote
| (EscapedBSlash)=> EscapedBSlash
| Octal
| Unicode
| ~('\\' | '\'' | '\r' | '\n')
)
SQuote {skip();}
;
fragment EscapedDQuote
: BSlash DQuote
;
fragment EscapedSQuote
: BSlash SQuote
;
fragment EscapedBSlash
: BSlash BSlash
;
fragment FSlash
: '/' | '\\' ('u002f' | 'u002F')
;
fragment Star
: '*' | '\\' ('u002a' | 'u002A')
;
fragment BSlash
: '\\' ('u005c' | 'u005C')?
;
fragment DQuote
: '"'
| '\\u0022'
;
fragment SQuote
: '\''
| '\\u0027'
;
fragment Unicode
: '\\u' Hex Hex Hex Hex
;
fragment Octal
: '\\' ('0'..'3' Oct Oct | Oct Oct | Oct)
;
fragment Hex
: '0'..'9' | 'a'..'f' | 'A'..'F'
;
fragment Oct
: '0'..'7'
;
Main.java
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
JavaCommentLexer lexer = new JavaCommentLexer(new ANTLRFileStream("Test.java"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
for(Object o : tokens.getTokens()) {
CommonToken t = (CommonToken)o;
if(t.getType() == JavaCommentLexer.SingleLineComment) {
System.out.println("SingleLineComment :: " + t.getText().replace("\n", "\\n"));
}
if(t.getType() == JavaCommentLexer.MultiLineComment) {
System.out.println("MultiLineComment :: " + t.getText().replace("\n", "\\n"));
}
}
}
}
Test.java
\u002f\u002a <- multi line comment start
multi
line
comment // not a single line comment
\u002A/
public class Test {
// single line "not a string"
String s = "\u005C" \242 not // a comment \\\" \u002f \u005C\u005C \u0022;
/*
regular multi line comment
*/
char c = \u0027"'; // the " is not the start of a string
char q1 = '\u005c''; // == '\''
char q2 = '\u005c\u0027'; // == '\''
char q3 = \u0027\u005c\u0027\u0027; // == '\''
char c4 = '\047';
String t = "/*";
\u002f\u002f another single line comment
String u = "*/";
}
Now, to run the demo, do:
bart#hades:~/Programming/ANTLR/Demos/JavaComment$ java -cp antlr-3.2.jar org.antlr.Tool JavaCommentLexer.g
bart#hades:~/Programming/ANTLR/Demos/JavaComment$ javac -cp antlr-3.2.jar *.java
bart#hades:~/Programming/ANTLR/Demos/JavaComment$ java -cp .:antlr-3.2.jar Main
and you'll see the following being printed to the console:
MultiLineComment :: \u002f\u002a <- multi line comment start\nmulti\nline\ncomment // not a single line comment\n\u002A/
SingleLineComment :: // single line "not a string"
SingleLineComment :: // a comment \\\" \u002f \u005C\u005C \u0022;
MultiLineComment :: /*\n regular multi line comment\n */
SingleLineComment :: // the " is not the start of a string
SingleLineComment :: // == '\''
SingleLineComment :: // == '\''
SingleLineComment :: // == '\''
SingleLineComment :: \u002f\u002f another single line comment
EDIT
You can create a sort of lexer with regex yourself, of course. The following demo does not handle Unicode literals inside source files, however:
Test2.java
/* <- multi line comment start
multi
line
comment // not a single line comment
*/
public class Test2 {
// single line "not a string"
String s = "\" \242 not // a comment \\\" ";
/*
regular multi line comment
*/
char c = '"'; // the " is not the start of a string
char q1 = '\''; // == '\''
char c4 = '\047';
String t = "/*";
// another single line comment
String u = "*/";
}
Main2.java
import java.util.*;
import java.io.*;
import java.util.regex.*;
public class Main2 {
private static String read(File file) throws IOException {
StringBuilder b = new StringBuilder();
Scanner scan = new Scanner(file);
while(scan.hasNextLine()) {
String line = scan.nextLine();
b.append(line).append('\n');
}
return b.toString();
}
public static void main(String[] args) throws Exception {
String contents = read(new File("Test2.java"));
String slComment = "//[^\r\n]*";
String mlComment = "/\\*[\\s\\S]*?\\*/";
String strLit = "\"(?:\\\\.|[^\\\\\"\r\n])*\"";
String chLit = "'(?:\\\\.|[^\\\\'\r\n])+'";
String any = "[\\s\\S]";
Pattern p = Pattern.compile(
String.format("(%s)|(%s)|%s|%s|%s", slComment, mlComment, strLit, chLit, any)
);
Matcher m = p.matcher(contents);
while(m.find()) {
String hit = m.group();
if(m.group(1) != null) {
System.out.println("SingleLine :: " + hit.replace("\n", "\\n"));
}
if(m.group(2) != null) {
System.out.println("MultiLine :: " + hit.replace("\n", "\\n"));
}
}
}
}
If you run Main2, the following is printed to the console:
MultiLine :: /* <- multi line comment start\nmulti\nline\ncomment // not a single line comment\n*/
SingleLine :: // single line "not a string"
MultiLine :: /*\n regular multi line comment\n */
SingleLine :: // the " is not the start of a string
SingleLine :: // == '\''
SingleLine :: // another single line comment
EDIT: I've been searching for a while, but here is the real working regex:
String regex = "((//[^\n\r]*)|(/\\*(.+?)\\*/))"; // New Regex
List<String> comments = new ArrayList<String>();
Pattern p = Pattern.compile(regex, Pattern.DOTALL);
Matcher m = p.matcher(code);
// code is the C-Style code, in which you want to serach
while (m.find())
{
System.out.println(m.group(1));
comments.add(m.group(1));
}
With this input:
import Blah;
//Comment one//
line();
/* Blah */
line2(); // something weird
/* Multiline
another line for the comment
*/
It generates this output:
//Comment one//
/* Blah */
line2(); // something weird
/* Multiline
another line for the comment
*/
Notice that the last three lines of the output are one single print.
Have you tried regular expressions? Here is a nice wrap-up with Java example. It might need some tweaking However using only regular expressions won't be sufficient for more complicated structures (nested comments, "comments" in strings) but it is a nice start.
Related
I am currently creating a compiler with antlr4 which should allow java code to be parsed.
How do i allow:
public void =(Integer value) => java { this.value = value; }
that the code between java { } is not being parsed by antlr, but should have a visitor in my parser.
Currently i have
javaStatementBody: KWJAVA LCURLY .*? RCURLY
but this obviously does not work and .*? parses the whole file.
Please do not answer with "use quotes", thats not gonna be my solution, because i want to allow java code highlighting.
You could create separate lexer and parser grammars so that you can use lexical modes. Whenever the lexer "sees" the input java {, it moves to the JAVA_MODE. And when in the Java mode, you tokenise comments, string- and char literals. Also when in this mode, you encounter a {, you push the same JAVA_MODE so that the lexer knows it's nested once. And when you encounter a }, you pop a mode from the stack (resulting in either going back to the default mode, or staying in the Java mode but one level less deep).
A quick demo:
IslandLexer.g4
lexer grammar IslandLexer;
JAVA_START
: 'java' SPACES '{' -> pushMode(JAVA_MODE)
;
OTHER
: .
;
fragment SPACES : [ \t\r\n]+;
mode JAVA_MODE;
JAVA_CHAR : '\'' ( ~[\\'\r\n] | '\\' [tbnrf'\\] ) '\'';
JAVA_STRING : '"' ( ~[\\"\r\n] | '\\' [tbnrf"\\] )* '"';
JAVA_LINE_COMMENT : '//' ~[\r\n]*;
JAVA_BLOCK_COMMENT : '/*' .*? '*/';
JAVA_OPEN_BRACE : '{' -> pushMode(JAVA_MODE);
JAVA_CLOSE_BRACE : '}' -> popMode;
JAVA_OTHER : ~[{}];
IslandParser.g4
parser grammar IslandParser;
options { tokenVocab=IslandLexer; }
parse
: unit* EOF
;
unit
: base_language
| java_janguage
;
base_language
: OTHER+
;
java_janguage
: JAVA_START java_atom+
;
java_atom
: JAVA_CHAR
| JAVA_STRING
| JAVA_LINE_COMMENT
| JAVA_BLOCK_COMMENT
| JAVA_OPEN_BRACE
| JAVA_CLOSE_BRACE
| JAVA_OTHER
;
Test it with the following code:
String source = "foo \n" +
"\n" +
"java { \n" +
" char foo() { \n" +
" /* a quote in a comment \\\" */ \n" +
" String s = \"java {...}\"; \n" +
" return '}'; \n" +
" }\n" +
"}\n" +
"\n" +
"bar";
IslandLexer lexer = new IslandLexer(CharStreams.fromString(source));
IslandParser parser = new IslandParser(new CommonTokenStream(lexer));
System.out.println(parser.parse().toStringTree(parser));
which is the following parse tree:
I'm implementing a python interpreter using ANTLR4 like lexer and parser generator. I used the BNF defined at this link:
https://github.com/antlr/grammars-v4/blob/master/python3/Python3.g4.
However the implementation of indentation with the INDENT and DEDENT tokens within the lexer::members do not work when i define a compound statement.
For example if i define the following statement:
x=10
while x>2 :
print("hello")
x=x-3
So in the line when i reassign the value of x variable i should have an indentation error that i don't have in my currest state.
Should i edit something into the lexer code or what?
This is the BNF that i'm using with the lexer::members and the NEWLINE rules defined in the above link.
grammar python;
tokens { INDENT, DEDENT }
#lexer::members {
// A queue where extra tokens are pushed on (see the NEWLINE lexer rule).
private java.util.LinkedList<Token> tokens = new java.util.LinkedList<>();
// The stack that keeps track of the indentation level.
private java.util.Stack<Integer> indents = new java.util.Stack<>();
// The amount of opened braces, brackets and parenthesis.
private int opened = 0;
// The most recently produced token.
private Token lastToken = null;
#Override
public void emit(Token t) {
super.setToken(t);
tokens.offer(t);
}
#Override
public Token nextToken() {
// Check if the end-of-file is ahead and there are still some DEDENTS expected.
if (_input.LA(1) == EOF && !this.indents.isEmpty()) {
// Remove any trailing EOF tokens from our buffer.
for (int i = tokens.size() - 1; i >= 0; i--) {
if (tokens.get(i).getType() == EOF) {
tokens.remove(i);
}
}
// First emit an extra line break that serves as the end of the statement.
this.emit(commonToken(pythonParser.NEWLINE, "\n"));
// Now emit as much DEDENT tokens as needed.
while (!indents.isEmpty()) {
this.emit(createDedent());
indents.pop();
}
// Put the EOF back on the token stream.
this.emit(commonToken(pythonParser.EOF, "<EOF>"));
//throw new Exception("indentazione inaspettata in riga "+this.getLine());
}
Token next = super.nextToken();
if (next.getChannel() == Token.DEFAULT_CHANNEL) {
// Keep track of the last token on the default channel.
this.lastToken = next;
}
return tokens.isEmpty() ? next : tokens.poll();
}
private Token createDedent() {
CommonToken dedent = commonToken(pythonParser.DEDENT, "");
dedent.setLine(this.lastToken.getLine());
return dedent;
}
private CommonToken commonToken(int type, String text) {
int stop = this.getCharIndex() - 1;
int start = text.isEmpty() ? stop : stop - text.length() + 1;
return new CommonToken(this._tokenFactorySourcePair, type, DEFAULT_TOKEN_CHANNEL, start, stop);
}
// Calculates the indentation of the provided spaces, taking the
// following rules into account:
//
// "Tabs are replaced (from left to right) by one to eight spaces
// such that the total number of characters up to and including
// the replacement is a multiple of eight [...]"
//
// -- https://docs.python.org/3.1/reference/lexical_analysis.html#indentation
static int getIndentationCount(String spaces) {
int count = 0;
for (char ch : spaces.toCharArray()) {
switch (ch) {
case '\t':
count += 8 - (count % 8);
break;
default:
// A normal space char.
count++;
}
}
return count;
}
boolean atStartOfInput() {
return super.getCharPositionInLine() == 0 && super.getLine() == 1;
}
}
parse
:( NEWLINE parse
| block ) EOF
;
block
: (statement NEWLINE?| functionDecl)*
;
statement
: assignment
| functionCall
| ifStatement
| forStatement
| whileStatement
| arithmetic_expression
;
assignment
: IDENTIFIER indexes? '=' expression
;
functionCall
: IDENTIFIER OPAREN exprList? CPAREN #identifierFunctionCall
| PRINT OPAREN? exprList? CPAREN? #printFunctionCall
;
arithmetic_expression
: expression
;
ifStatement
: ifStat elifStat* elseStat?
;
ifStat
: IF expression COLON NEWLINE INDENT block DEDENT
;
elifStat
: ELIF expression COLON NEWLINE INDENT block DEDENT
;
elseStat
: ELSE COLON NEWLINE INDENT block DEDENT
;
functionDecl
: DEF IDENTIFIER OPAREN idList? CPAREN COLON NEWLINE INDENT block DEDENT
;
forStatement
: FOR IDENTIFIER IN expression COLON NEWLINE INDENT block DEDENT elseStat?
;
whileStatement
: WHILE expression COLON NEWLINE INDENT block DEDENT elseStat?
;
idList
: IDENTIFIER (',' IDENTIFIER)*
;
exprList
: expression (COMMA expression)*
;
expression
: '-' expression #unaryMinusExpression
| '!' expression #notExpression
| expression '**' expression #powerExpression
| expression '*' expression #multiplyExpression
| expression '/' expression #divideExpression
| expression '%' expression #modulusExpression
| expression '+' expression #addExpression
| expression '-' expression #subtractExpression
| expression '>=' expression #gtEqExpression
| expression '<=' expression #ltEqExpression
| expression '>' expression #gtExpression
| expression '<' expression #ltExpression
| expression '==' expression #eqExpression
| expression '!=' expression #notEqExpression
| expression '&&' expression #andExpression
| expression '||' expression #orExpression
| expression '?' expression ':' expression #ternaryExpression
| expression IN expression #inExpression
| NUMBER #numberExpression
| BOOL #boolExpression
| NULL #nullExpression
| functionCall indexes? #functionCallExpression
| list indexes? #listExpression
| IDENTIFIER indexes? #identifierExpression
| STRING indexes? #stringExpression
| '(' expression ')' indexes? #expressionExpression
| INPUT '(' STRING? ')' #inputExpression
;
list
: '[' exprList? ']'
;
indexes
: ('[' expression ']')+
;
PRINT : 'print';
INPUT : 'input';
DEF : 'def';
IF : 'if';
ELSE : 'else';
ELIF : 'elif';
RETURN : 'return';
FOR : 'for';
WHILE : 'while';
IN : 'in';
NULL : 'null';
OR : '||';
AND : '&&';
EQUALS : '==';
NEQUALS : '!=';
GTEQUALS : '>=';
LTEQUALS : '<=';
POW : '**';
EXCL : '!';
GT : '>';
LT : '<';
ADD : '+';
SUBTRACT : '-';
MULTIPLY : '*';
DIVIDE : '/';
MODULE : '%';
OBRACE : '{' {opened++;};
CBRACE : '}' {opened--;};
OBRACKET : '[' {opened++;};
CBRACKET : ']' {opened--;};
OPAREN : '(' {opened++;};
CPAREN : ')' {opened--;};
SCOLON : ';';
ASSIGN : '=';
COMMA : ',';
QMARK : '?';
COLON : ':';
BOOL
: 'true'
| 'false'
;
NUMBER
: INT ('.' DIGIT*)?
;
IDENTIFIER
: [a-zA-Z_] [a-zA-Z_0-9]*
;
STRING
: ["] (~["\r\n] | '\\\\' | '\\"')* ["]
| ['] (~['\r\n] | '\\\\' | '\\\'')* [']
;
SKIPS
: ( SPACES | COMMENT | LINE_JOINING ){firstLine();} -> skip
;
NEWLINE
: ( {atStartOfInput()}? SPACES
| ( '\r'? '\n' | '\r' | '\f' ) SPACES?
)
{
String newLine = getText().replaceAll("[^\r\n\f]+", "");
String spaces = getText().replaceAll("[\r\n\f]+", "");
int next = _input.LA(1);
if (opened > 0 || next == '\r' || next == '\n' || next == '\f' || next == '#') {
// If we're inside a list or on a blank line, ignore all indents,
// dedents and line breaks.
skip();
}
else {
emit(commonToken(NEWLINE, newLine));
int indent = getIndentationCount(spaces);
int previous = indents.isEmpty() ? 0 : indents.peek();
if (indent == previous) {
// skip indents of the same size as the present indent-size
skip();
}
else if (indent > previous) {
indents.push(indent);
emit(commonToken(pythonParser.INDENT, spaces));
}
else {
// Possibly emit more than 1 DEDENT token.
while(!indents.isEmpty() && indents.peek() > indent) {
this.emit(createDedent());
indents.pop();
}
}
}
}
;
fragment INT
: [1-9] DIGIT*
| '0'
;
fragment DIGIT
: [0-9]
;
fragment SPACES
: [ \t]+
;
fragment COMMENT
: '#' ~[\r\n\f]*
;
fragment LINE_JOINING
: '\\' SPACES? ( '\r'? '\n' | '\r' | '\f' )
;
No, this should not be handled in the grammar. The lexer should simply emit the (faulty) INDENT token. The parser should, at runtime, produce an error. Something like this:
String source = "x=10\n" +
"while x>2 :\n" +
" print(\"hello\")\n" +
" x=x-3\n";
Python3Lexer lexer = new Python3Lexer(CharStreams.fromString(source));
Python3Parser parser = new Python3Parser(new CommonTokenStream(lexer));
// Remove default error-handling
parser.removeErrorListeners();
// Add custom error-handling
parser.addErrorListener(new BaseErrorListener() {
#Override
public void syntaxError(Recognizer<?, ?> recognizer, Object o, int i, int i1, String s, RecognitionException e) {
CommonToken token = (CommonToken) o;
if (token.getType() == Python3Parser.INDENT) {
// The parser encountered an unexpected INDENT token
// TODO throw your exception
}
// TODO handle other errors
}
});
// Trigger the error
parser.file_input();
I'm trying to write a program in ANTLR (Java) concerning simplifying regular expression. I have already written some code (grammar file contents below)
grammar Regexp_v7;
options{
language = Java;
output = AST;
ASTLabelType = CommonTree;
backtrack = true;
}
tokens{
DOT;
REPEAT;
RANGE;
NULL;
}
fragment
ZERO
: '0'
;
fragment
DIGIT
: '1'..'9'
;
fragment
EPSILON
: '#'
;
fragment
FI
: '%'
;
ID
: EPSILON
| FI
| 'a'..'z'
| 'A'..'Z'
;
NUMBER
: ZERO
| DIGIT (ZERO | DIGIT)*
;
WHITESPACE
: ('\r' | '\n' | ' ' | '\t' ) + {$channel = HIDDEN;}
;
list
: (reg_exp ';'!)*
;
term
: ID -> ID
| '('! reg_exp ')'!
;
repeat_exp
: term ('{' range_exp '}')+ -> ^(REPEAT term (range_exp)+)
| term -> term
;
range_exp
: NUMBER ',' NUMBER -> ^(RANGE NUMBER NUMBER)
| NUMBER (',') -> ^(RANGE NUMBER NULL)
| ',' NUMBER -> ^(RANGE NULL NUMBER)
| NUMBER -> ^(RANGE NUMBER NUMBER)
;
kleene_exp
: repeat_exp ('*'^)*
;
concat_exp
: kleene_exp (kleene_exp)+ -> ^(DOT kleene_exp (kleene_exp)+)
| kleene_exp -> kleene_exp
;
reg_exp
: concat_exp ('|'^ concat_exp)*
;
My next goal is to write down tree grammar code, which is able to simplify regular expressions (e.g. a|a -> a , etc.). I have done some coding (see text below), but I have troubles with defining rule that treats nodes as subtrees (in order to simplify following kind of expressions e.g.: (a|a)|(a|a) to a, etc.)
tree grammar Regexp_v7Walker;
options{
language = Java;
tokenVocab = Regexp_v7;
ASTLabelType = CommonTree;
output=AST;
backtrack = true;
}
tokens{
NULL;
}
bottomup
: ^('*' ^('*' e=.)) -> ^('*' $e) //a** -> a*
| ^('|' i=.* j=.* {$i.tree.toStringTree() == $j.tree.toStringTree()} )
-> $i // There are 3 errors while this line is up and running:
// 1. CommonTree cannot be resolved,
// 2. i.tree cannot be resolved or is not a field,
// 3. i cannot be resolved.
;
Small driver class:
public class Regexp_Test_v7 {
public static void main(String[] args) throws RecognitionException {
CharStream stream = new ANTLRStringStream("a***;a|a;(ab)****;ab|ab;ab|aa;");
Regexp_v7Lexer lexer = new Regexp_v7Lexer(stream);
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
Regexp_v7Parser parser = new Regexp_v7Parser(tokenStream);
list_return list = parser.list();
CommonTree t = (CommonTree) list.getTree();
System.out.println("Original tree: " + t.toStringTree());
CommonTreeNodeStream nodes = new CommonTreeNodeStream(t);
Regexp_v7Walker s = new Regexp_v7Walker(nodes);
t = (CommonTree)s.downup(t);
System.out.println("Simplified tree: " + t.toStringTree());
Can anyone help me with solving this case?
Thanks in advance and regards.
Now, I'm no expert, but in your tree grammar:
add filter=true
change the second line of bottomup rule to:
^('|' i=. j=. {i.toStringTree().equals(j.toStringTree()) }? ) -> $i }
If I'm not mistaken by using i=.* you're allowing i to be non-existent and you'll get a NullPointerException on conversion to a String.
Both i and j are of type CommonTree because you've set it up this way: ASTLabelType = CommonTree, so you should call i.toStringTree().
And since it's Java and you're comparing Strings, use equals().
Also to make the expression in curly brackets a predicate, you need a question mark after the closing one.
Given a String like..
(a+(a+b)), (d*e) :- (e-f)
Note: (d*e) and (e-f) are different expressions. How can I fetch the expressions from this string. I have the grammar defined as..
parse returns [String value]
: addExp {$value=$addExp.value;} EOF
;
addExp returns [String value]
: multExp {$value=$multExp.value;} (('+' | '-' | '*') multExp{$value+= '+' + $multExp.value;})*
;
multExp returns [String value]
: atom {$value=$atom.value;} (('*' | '/') atom {$value+=$atom.value;)*
;
atom returns [String value]
: x=ID {$value=$x.text;}
| '(' addExp ')' {$value='('+$addExp.value+')';}
;
ID : 'a'..'z' | 'A'..'Z';
I tried..
ANTLRStringStream a=new ANTLRStringStream("(a+(a+b)), (d*e) :- (e-f)");
SLexer l=new SLexer(a);
CommonTokenStream c=new CommonTokenStream(l);
SParser p=new Sparser(c);
String exp;
while(exp = p.parse())
{
System.out.println(exp);
}
I'm thinking of something like hasNext() and then fetching.
Your lexer rules TEXT possibly matches an empty string, causing the lexer to create an infinite amount of tokens. Also, you don't need all those return statements after your rule: you can simply grab what a parser (or lexer) rule matched by adding .text after it.
You could let your parser return a List<String>, or let it return a single String repeatedly invoke that parser rule until EOF is encountered.
A little demo:
grammar T;
#parser::members {
public static void main(String[] args) throws Exception {
String src = "likes(a, b) :- likes(a, X), likes(X, b). hates(a, b) " +
":- hates(a,X), hates(X,b). likes(a,b) :- says(god, likes(a,b)).";
TLexer lexer = new TLexer(new ANTLRStringStream(src));
TParser parser = new TParser(new CommonTokenStream(lexer));
List<String> statements = parser.parse();
for(String s : statements) {
System.out.println(s);
}
}
}
parse returns [List<String> statements]
#init{$statements = new ArrayList<String>();}
: (statement {$statements.add($statement.text);} ~TEXT+)+ EOF
;
statement
: TEXT OPAR params CPAR
;
params
: (param (COMMA param)*)?
;
param
: TEXT
| statement
;
COMMA : ',';
OPAR : '(';
CPAR : ')';
TEXT : ('a'..'z' | 'A'..'Z')+;
SPACE : (' ' | '\t') {$channel=HIDDEN;};
OTHER : . ;
Note that ~TEXT+ in the parse rule matches one or more tokens other than TEXT.
If you now create a lexer and parser and run the TParser class:
*nix/MacOS
java -cp antlr-3.3.jar org.antlr.Tool T.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar TParser
or
Windows
java -cp antlr-3.3.jar org.antlr.Tool T.g
javac -cp antlr-3.3.jar *.java
java -cp .;antlr-3.3.jar TParser
you will see the following being printed to your console:
likes(a, b)
likes(a, X)
likes(X, b)
hates(a, b)
hates(a,X)
hates(X,b)
likes(a,b)
says(god, likes(a,b))
EDIT
And here's how to return a single String opposed to a List<String>:
#parser::members {
public static void main(String[] args) throws Exception {
String src = "likes(a, b) :- likes(a, X), likes(X, b). hates(a, b) " +
":- hates(a,X), hates(X,b). likes(a,b) :- says(god, likes(a,b)).";
TLexer lexer = new TLexer(new ANTLRStringStream(src));
TParser parser = new TParser(new CommonTokenStream(lexer));
String s;
while((s = parser.parse()) != null) {
System.out.println(s);
}
}
}
parse returns [String s]
: statement ~(TEXT| EOF)* {$s = $statement.text;}
| EOF {$s = null;}
;
You should just be able to call sentence() repeatedly until you hit the end of input.
So I have some string:
//Blah blah blach
// sdfkjlasdf
"Another //thing"
And I'm using java regex to replace all the lines that have double slashes like so:
theString = Pattern.compile("//(.*?)\\n", Pattern.DOTALL).matcher(theString).replaceAll("");
And it works for the most part, but the problem is it removes all the occurrences and I need to find a way to have it not remove the quoted occurrence. How would I go about doing that?
Instead of using a parser that parses an entire Java source file, or writing something yourself that parses only those parts you're interested in, you could use some 3rd party tool like ANTLR.
ANTLR has the ability to define only those tokens you are interested in (and of course the tokens that can mess up your token-stream like multi-line comments and String- and char literals). So you only need to define a lexer (another word for tokenizer) that correctly handles those tokens.
This is called a grammar. In ANTLR, such a grammar could look like this:
lexer grammar FuzzyJavaLexer;
options{filter=true;}
SingleLineComment
: '//' ~( '\r' | '\n' )*
;
MultiLineComment
: '/*' .* '*/'
;
StringLiteral
: '"' ( '\\' . | ~( '"' | '\\' ) )* '"'
;
CharLiteral
: '\'' ( '\\' . | ~( '\'' | '\\' ) )* '\''
;
Save the above in a file called FuzzyJavaLexer.g. Now download ANTLR 3.2 here and save it in the same folder as your FuzzyJavaLexer.g file.
Execute the following command:
java -cp antlr-3.2.jar org.antlr.Tool FuzzyJavaLexer.g
which will create a FuzzyJavaLexer.java source class.
Of course you need to test the lexer, which you can do by creating a file called FuzzyJavaLexerTest.java and copying the code below in it:
import org.antlr.runtime.*;
public class FuzzyJavaLexerTest {
public static void main(String[] args) throws Exception {
String source =
"class Test { \n"+
" String s = \" ... \\\" // no comment \"; \n"+
" /* \n"+
" * also no comment: // foo \n"+
" */ \n"+
" char quote = '\"'; \n"+
" // yes, a comment, finally!!! \n"+
" int i = 0; // another comment \n"+
"} \n";
System.out.println("===== source =====");
System.out.println(source);
System.out.println("==================");
ANTLRStringStream in = new ANTLRStringStream(source);
FuzzyJavaLexer lexer = new FuzzyJavaLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
for(Object obj : tokens.getTokens()) {
Token token = (Token)obj;
if(token.getType() == FuzzyJavaLexer.SingleLineComment) {
System.out.println("Found a SingleLineComment on line "+token.getLine()+
", starting at column "+token.getCharPositionInLine()+
", text: "+token.getText());
}
}
}
}
Next, compile your FuzzyJavaLexer.java and FuzzyJavaLexerTest.java by doing:
javac -cp .:antlr-3.2.jar *.java
and finally execute the FuzzyJavaLexerTest.class file:
// *nix/MacOS
java -cp .:antlr-3.2.jar FuzzyJavaLexerTest
or:
// Windows
java -cp .;antlr-3.2.jar FuzzyJavaLexerTest
after which you'll see the following being printed to your console:
===== source =====
class Test {
String s = " ... \" // no comment ";
/*
* also no comment: // foo
*/
char quote = '"';
// yes, a comment, finally!!!
int i = 0; // another comment
}
==================
Found a SingleLineComment on line 7, starting at column 2, text: // yes, a comment, finally!!!
Found a SingleLineComment on line 8, starting at column 13, text: // another comment
Pretty easy, eh? :)
Use a parser, determine it char-by-char.
Kickoff example:
StringBuilder builder = new StringBuilder();
boolean quoted = false;
for (String line : string.split("\\n")) {
for (int i = 0; i < line.length(); i++) {
char c = line.charAt(i);
if (c == '"') {
quoted = !quoted;
}
if (!quoted && c == '/' && i + 1 < line.length() && line.charAt(i + 1) == '/') {
break;
} else {
builder.append(c);
}
}
builder.append("\n");
}
String parsed = builder.toString();
System.out.println(parsed);
(This is in answer to the question #finnw asked in the comment under his answer. It's not so much an answer to the OP's question as an extended explanation of why a regex is the wrong tool.)
Here's my test code:
String r0 = "(?m)^((?:[^\"]|\"(?:[^\"]|\\\")*\")*)//.*$";
String r1 = "(?m)^((?:[^\"\r\n]|\"(?:[^\"\r\n]|\\\")*\")*)//.*$";
String r2 = "(?m)^((?:[^\"\r\n]|\"(?:[^\"\r\n\\\\]|\\\\\")*\")*)//.*$";
String test =
"class Test { \n"+
" String s = \" ... \\\" // no comment \"; \n"+
" /* \n"+
" * also no comment: // but no harm \n"+
" */ \n"+
" /* no comment: // much harm */ \n"+
" char quote = '\"'; // comment \n"+
" // another comment \n"+
" int i = 0; // and another \n"+
"} \n"
.replaceAll(" +$", "");
System.out.printf("%n%s%n", test);
System.out.printf("%n%s%n", test.replaceAll(r0, "$1"));
System.out.printf("%n%s%n", test.replaceAll(r1, "$1"));
System.out.printf("%n%s%n", test.replaceAll(r2, "$1"));
r0 is the edited regex from your answer; it removes only the final comment (// and another), because everything else is matched in group(1). Setting multiline mode ((?m)) is necessary for ^ and $ to work right, but it doesn't solve this problem because your character classes can still match newlines.
r1 deals with the newline problem, but it still incorrectly matches // no comment in the string literal, for two reasons: you didn't include a backslash in the first part of (?:[^\"\r\n]|\\\"); and you only used two of them to match the backslash in the second part.
r2 fixes that, but it makes no attempt to deal with the quote in the char literal, or single-line comments inside the multiline comments. They can probably be handled too, but this regex is already Baby Godzilla; do you really want to see it all grown up?.
The following is from a grep-like program I wrote (in Perl) a few years ago. It has an option to strip java comments before processing the file:
# ============================================================================
# ============================================================================
#
# strip_java_comments
# -------------------
#
# Strip the comments from a Java-like file. Multi-line comments are
# replaced with the equivalent number of blank lines so that all text
# left behind stays on the same line.
#
# Comments are replaced by at least one space .
#
# The text for an entire file is assumed to be in $_ and is returned
# in $_
#
# ============================================================================
# ============================================================================
sub strip_java_comments
{
s!( (?: \" [^\"\\]* (?: \\. [^\"\\]* )* \" )
| (?: \' [^\'\\]* (?: \\. [^\'\\]* )* \' )
| (?: \/\/ [^\n] *)
| (?: \/\* .*? \*\/)
)
!
my $x = $1;
my $first = substr($x, 0, 1);
if ($first eq '/')
{
"\n" x ($x =~ tr/\n//);
}
else
{
$x;
}
!esxg;
}
This code does actually work properly and can't be fooled by tricky comment/quote combinations. It will probably be fooled by unicode escapes (\u0022 etc), but you can easily deal with those first if you want to.
As it's Perl, not java, the replacement code will have to change. I'll have a quick crack at producing equivalent java. Stand by...
EDIT: I've just whipped this up. Will probably need work:
// The trick is to search for both comments and quoted strings.
// That way we won't notice a (partial or full) comment withing a quoted string
// or a (partial or full) quoted-string within a comment.
// (I may not have translated the back-slashes accurately. You'll figure it out)
Pattern p = Pattern.compile(
"( (?: \" [^\"\\\\]* (?: \\\\. [^\"\\\\]* )* \" )" + // " ... "
" | (?: ' [^'\\\\]* (?: \\\\. [^'\\\\]* )* ' )" + // or ' ... '
" | (?: // [^\\n] * )" + // or // ...
" | (?: /\\* .*? \\* / )" + // or /* ... */
")",
Pattern.DOTALL | Pattern.COMMENTS
);
Matcher m = p.matcher(entireInputFileAsAString);
StringBuilder output = new StringBuilder();
while (m.find())
{
if (m.group(1).startsWith("/"))
{
// This is a comment. Replace it with a space...
m.appendReplacement(output, " ");
// ... or replace it with an equivalent number of newlines
// (exercise for reader)
}
else
{
// We matched a quoted string. Put it back
m.appendReplacement(output, "$1");
}
}
m.appendTail(output);
return output.toString();
You can't tell using regex if you are in double quoted string or not. In the end regex is just a state machine (sometimes extended abit). I would use a parser as provided by BalusC or this one.
If you want know why the regex are limited read about formal grammars. A wikipedia article is a good start.