I am currently creating a compiler with antlr4 which should allow java code to be parsed.
How do i allow:
public void =(Integer value) => java { this.value = value; }
that the code between java { } is not being parsed by antlr, but should have a visitor in my parser.
Currently i have
javaStatementBody: KWJAVA LCURLY .*? RCURLY
but this obviously does not work and .*? parses the whole file.
Please do not answer with "use quotes", thats not gonna be my solution, because i want to allow java code highlighting.
You could create separate lexer and parser grammars so that you can use lexical modes. Whenever the lexer "sees" the input java {, it moves to the JAVA_MODE. And when in the Java mode, you tokenise comments, string- and char literals. Also when in this mode, you encounter a {, you push the same JAVA_MODE so that the lexer knows it's nested once. And when you encounter a }, you pop a mode from the stack (resulting in either going back to the default mode, or staying in the Java mode but one level less deep).
A quick demo:
IslandLexer.g4
lexer grammar IslandLexer;
JAVA_START
: 'java' SPACES '{' -> pushMode(JAVA_MODE)
;
OTHER
: .
;
fragment SPACES : [ \t\r\n]+;
mode JAVA_MODE;
JAVA_CHAR : '\'' ( ~[\\'\r\n] | '\\' [tbnrf'\\] ) '\'';
JAVA_STRING : '"' ( ~[\\"\r\n] | '\\' [tbnrf"\\] )* '"';
JAVA_LINE_COMMENT : '//' ~[\r\n]*;
JAVA_BLOCK_COMMENT : '/*' .*? '*/';
JAVA_OPEN_BRACE : '{' -> pushMode(JAVA_MODE);
JAVA_CLOSE_BRACE : '}' -> popMode;
JAVA_OTHER : ~[{}];
IslandParser.g4
parser grammar IslandParser;
options { tokenVocab=IslandLexer; }
parse
: unit* EOF
;
unit
: base_language
| java_janguage
;
base_language
: OTHER+
;
java_janguage
: JAVA_START java_atom+
;
java_atom
: JAVA_CHAR
| JAVA_STRING
| JAVA_LINE_COMMENT
| JAVA_BLOCK_COMMENT
| JAVA_OPEN_BRACE
| JAVA_CLOSE_BRACE
| JAVA_OTHER
;
Test it with the following code:
String source = "foo \n" +
"\n" +
"java { \n" +
" char foo() { \n" +
" /* a quote in a comment \\\" */ \n" +
" String s = \"java {...}\"; \n" +
" return '}'; \n" +
" }\n" +
"}\n" +
"\n" +
"bar";
IslandLexer lexer = new IslandLexer(CharStreams.fromString(source));
IslandParser parser = new IslandParser(new CommonTokenStream(lexer));
System.out.println(parser.parse().toStringTree(parser));
which is the following parse tree:
Related
I am supporting an Open Source project and my ANTLR based parse is returning a truncated ParseTree. I believe I've provided what is needed to reproduce the problem.
Given a parser created using ANTLR 4.8-1 and configured as follows:
public static Expressions parse(String mappingExpression) throws ParseException, IOException {
// Expressions can include references to properties within an
// application interface ("state"),
// properties within an event, and various operators and functions.
InputStream targetStream = new ByteArrayInputStream(mappingExpression.getBytes());
CharStream input = CharStreams.fromStream(targetStream,Charset.forName("UTF-8"));
MappingExpressionLexer lexer = new MappingExpressionLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
MappingExpressionParser parser = new MappingExpressionParser(tokens);
ParseTree tree = null;
BufferingErrorListener errorListener = new BufferingErrorListener();
try {
// remove the default error listeners which print to stderr
parser.removeErrorListeners();
lexer.removeErrorListeners();
// replace with error listener that buffer errors and allow us to retrieve them
// later
parser.addErrorListener(errorListener);
lexer.addErrorListener(errorListener);
tree = parser.expr();
And I provide the following statement to be parsed:
results.( $y := "test"; $bta := function($x) {( $count($x.billToAccounts) > 1 ? ($contains($join($x.billToAccounts, ','), "super") ? "Super" : "Standard") : ($contains($x.billToAccounts[0], "super") ? "Super" : "Standard") )}; { "users": $filter($, function($v, $i, $a) { $v.status = "PROVISIONED" }) { "firstName": $.profile.firstName, "lastName": $.profile.lastName, "email": $.profile.login, "lastLogin": $.lastLogin, "id" : $.id, "userType": $bta($.profile) } } )
the parse tree returned only holds the "result" token even though all the tokens are parsed (as shown in the _input.tokens array) and all seem to show channel 0.
Where I expect the parser to continue building out the _localCtx, the MappingExpressionParser statement:
_alt = getInterpreter().adaptivePredict(_input,17,_ctx);
returns 2 so no further buildout of the _localCtx occurs and it only holds a TerminalNodeContext with "result".
I've tried rearranging the various rules, and suspect it is related to the parens rule location vis a vis the expr rule, but I'm missing something.
What causes the adaptivePredict to return 2 so soon?
/**
* (c) Copyright 2018, 2019 IBM Corporation
* 1 New Orchard Road,
* Armonk, New York, 10504-1722
* United States
* +1 914 499 1900
* support: Nathaniel Mills wnm3#us.ibm.com
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
*/
/* Antlr grammar defining the mapping expression language */
grammar MappingExpression;
/* The start rule; begin parsing here.
operator precedence is implied by the ordering in this list */
// =======================
// = PARSER RULES
// =======================
expr:
ID # id
| '*' ('.' expr)? # field_values
| DESCEND ('.' expr)? # descendant
| DOLLAR (('.' expr) | (ARR_OPEN expr ARR_CLOSE))? # context_ref
| ROOT ('.' expr)? # root_path
| '(' (expr (';' (expr)?)*)? ')' # parens
| ARR_OPEN exprOrSeqList? ARR_CLOSE # array_constructor
| OBJ_OPEN fieldList? OBJ_CLOSE # object_constructor
| expr ARR_OPEN ARR_CLOSE # to_array
| expr '.' expr # path
| expr ARR_OPEN expr ARR_CLOSE # array
| VAR_ID (emptyValues | exprValues) # function_call
| FUNCTIONID varList '{' exprList? '}' # function_decl
| VAR_ID ASSIGN (expr | (FUNCTIONID varList '{' exprList? '}')) # var_assign
| (FUNCTIONID varList '{' exprList? '}') exprValues # function_exec
| op=(TRUE|FALSE) # boolean
| op='-' expr # unary_op
| expr op=('*'|'/'|'%') expr # muldiv_op
| expr op=('+'|'-') expr # addsub_op
| expr '&' expr # concat_op
| expr 'in' expr # membership
| expr 'and' expr # logand
| expr 'or' expr # logor
| expr op=('<'|'<='|'>'|'>='|'!='|'=') expr # comp_op
| expr '?' expr (':' expr)? # conditional
| expr CHAIN expr # fct_chain
| VAR_ID # var_recall
| NUMBER # number
| STRING # string
| 'null' # null
;
fieldList : STRING ':' expr (',' STRING ':' expr)*;
exprList : expr (',' expr)* ;
varList : '(' (VAR_ID (',' VAR_ID)*)* ')' ;
exprValues : '(' exprList ')' ((',' exprOrSeq)* ')')?;
emptyValues : '(' ')' ;
seq : expr '..' expr ;
exprOrSeq : seq | expr ;
exprOrSeqList : exprOrSeq (',' exprOrSeq)* ;
// =======================
// = LEXER RULES
// =======================
TRUE : 'true';
FALSE : 'false';
STRING
: '\'' (ESC | ~['\\])* '\''
| '"' (ESC | ~["\\])* '"'
;
NULL : 'null';
ARR_OPEN : '[';
ARR_CLOSE : ']';
OBJ_OPEN : '{';
OBJ_CLOSE : '}';
DOLLAR : '$';
ROOT : '$$' ;
DESCEND : '**';
NUMBER
: INT '.' [0-9]+ EXP? // 1.35, 1.35E-9, 0.3
| INT EXP // 1e10 3e4
| INT // 3, 45
;
FUNCTIONID : 'function' ;
WS: [ \t\n]+ -> skip ; // ignore whitespace
COMMENT: '/*' .*? '*/' -> skip; // allow comments
// Assign token names used in above grammar
CHAIN : '~>' ;
ASSIGN : ':=' ;
MUL : '*' ;
DIV : '/' ;
ADD : '+' ;
SUB : '-' ;
REM : '%' ;
EQ : '=' ;
NOT_EQ : '!=' ;
LT : '<' ;
LE : '<=' ;
GT : '>' ;
GE : '>=' ;
CONCAT : '&';
VAR_ID : '$' ID ;
ID
: [a-zA-Z] [a-zA-Z0-9_]*
| BACK_QUOTE ~[`]* BACK_QUOTE;
// =======================
// = LEXER FRAGMENTS
// =======================
fragment ESC : '\\' (["'\\/bfnrt] | UNICODE) ;
fragment UNICODE : ([\u0080-\uFFFF] | 'u' HEX HEX HEX HEX) ;
fragment HEX : [0-9a-fA-F] ;
fragment INT : '0' | [1-9] [0-9]* ; // no leading zeros
fragment EXP : [Ee] [+\-]? INT ; // \- since - means "range" inside [...]
fragment SINGLE_QUOTE : '\'';
fragment DOUBLE_QUOTE : '"';
fragment BACK_QUOTE : '`';
Although tokens are created for the entire example input, not all are handled by the parser. If you run this:
String mappingExpression = "results.(\n" +
" $y := \"test\"; \n" +
" $bta := function($x) {\n" +
" (\n" +
" $count($x.billToAccounts) > 1 \n" +
" ? ($contains($join($x.billToAccounts, ','), \"super\") ? \"Super\" : \"Standard\")\n" +
" : ($contains($x.billToAccounts[0], \"super\") ? \"Super\" : \"Standard\") \n" +
" )\n" +
" };\n" +
" { \n" +
" \"users\": $filter($, function($v, $i, $a) { \n" +
" $v.status = \"PROVISIONED\" \n" +
" })\n" +
" { \n" +
" \"firstName\": $.profile.firstName, \n" +
" \"lastName\": $.profile.lastName, \n" +
" \"email\": $.profile.login, \n" +
" \"lastLogin\": $.lastLogin, \n" +
" \"id\" : $.id, \n" +
" \"userType\": $bta($.profile) \n" +
" }\n" +
" } \n" +
")";
InputStream targetStream = new ByteArrayInputStream(mappingExpression.getBytes());
MappingExpressionLexer lexer = new MappingExpressionLexer(CharStreams.fromStream(targetStream, StandardCharsets.UTF_8));
MappingExpressionParser parser = new MappingExpressionParser(new CommonTokenStream(lexer));
ParseTree tree = parser.expr();
System.out.println(tree.toStringTree(parser));
the following will be printed:
(expr results)
This means that expr successfully parses the first alternative, an ID, and then stops.
To force the parser to consume all tokens, introduce the following rule:
expr_to_eof
: expr EOF
;
and change:
ParseTree tree = parser.expr();
into:
ParseTree tree = parser.expr_to_eof();
When you run the code snippet I posted again (with the default error listeners!), you will see some error messages on your console (i.e., the parser did not successfully process the input).
If I try to parse the input:
results.(
$y := "test";
$bta := function($x) {
(
$count($x.billToAccounts) > 1
? ($contains($join($x.billToAccounts, ','), "super") ? "Super" : "Standard")
: ($contains($x.billToAccounts[0], "super") ? "Super" : "Standard")
)
};
{
"users": $filter($, function($v, $i, $a) {
$v.status = "PROVISIONED"
})
}
)
then the parser has no problems with it. Inspecting the tree:
{
"users": $filter($, function($v, $i, $a) {
$v.status = "PROVISIONED"
})
}
I see it is recognised as OBJ_OPEN fieldList? OBJ_CLOSE, where fieldList is defined as follows:
fieldList : STRING ':' expr (',' STRING ':' expr)*;
i.e. a list of key-values separated with commas. So if you feed the parser this:
{
"users": $filter($, function($v, $i, $a) {
$v.status = "PROVISIONED"
})
{
"firstName": $.profile.firstName,
"lastName": $.profile.lastName,
"email": $.profile.login,
"lastLogin": $.lastLogin,
"id" : $.id,
"userType": $bta($.profile)
}
}
it cannot parse it properly since:
{
"firstName": $.profile.firstName,
"lastName": $.profile.lastName,
"email": $.profile.login,
"lastLogin": $.lastLogin,
"id" : $.id,
"userType": $bta($.profile)
}
is not a key-value itself, and there is no comma separating the two.
This would properly parse it:
{
"users": $filter($, function($v, $i, $a) {
$v.status = "PROVISIONED"
}),
"some-key": {
"firstName": $.profile.firstName,
"lastName": $.profile.lastName,
"email": $.profile.login,
"lastLogin": $.lastLogin,
"id" : $.id,
"userType": $bta($.profile)
}
}
Or $filter($, function($v, $i, $a) { $v.status = "PROVISIONED" }) is allowed to have { "firstName": ... } directly after it, but I can't see that being valid from your grammar.
I have a problem with my antlr grammar or(lexer). In my case I need to parse a string with custom text and find functions in it. Format of function $foo($bar(3),'strArg'). I found solution in this post ANTLR Nested Functions and little bit improved it for my needs. But while testing different cases I found one that brakes parser: $foo($3,'strArg'). This will throw IncorectSyntax exception. I tried many variants(for example not to skip $ and include it in parsing tree) but it all these attempts were unsuccessfully
Lexer
lexer grammar TLexer;
TEXT
: ~[$]
;
FUNCTION_START
: '$' -> pushMode(IN_FUNCTION), skip
;
mode IN_FUNCTION;
FUNTION_NESTED : '$' -> pushMode(IN_FUNCTION), skip;
ID : [a-zA-Z_]+;
PAR_OPEN : '(';
PAR_CLOSE : ')' -> popMode;
NUMBER : [0-9]+;
STRING : '\'' ( ~'\'' | '\'\'' )* '\'';
COMMA : ',';
SPACE : [ \t\r\n]-> skip;
Parser
options {
tokenVocab=TLexer;
}
parse
: atom* EOF
;
atom
: text
| function
;
text
: TEXT+
;
function
: ID params
;
params
: PAR_OPEN ( param ( COMMA param )* )? PAR_CLOSE
;
param
: NUMBER
| STRING
| function
;
The parser does not fail on $foo($3,'strArg'), because when it encounters the second $ it is already in IN_FUNCTION mode and it is expecting a parameter. It skips the character and reads a NUMBER.
If you want it to fail you need to unskip the dollar signs in the Lexer:
FUNCTION_START : '$' -> pushMode(IN_FUNCTION);
mode IN_FUNCTION;
FUNTION_START : '$' -> pushMode(IN_FUNCTION);
and modify the function rule:
function : FUNCTION_START ID params;
I have a simple grammar as follows:
grammar SampleConfig;
line: ID (WS)* '=' (WS)* string;
ID: [a-zA-Z]+;
string: '"' (ESC|.)*? '"' ;
ESC : '\\"' | '\\\\' ; // 2-char sequences \" and \\
WS: [ \t]+ -> skip;
The spaces in the input are completely ignored, including those in the string literal.
final String input = "key = \"value with spaces in between\"";
final SampleConfigLexer l = new SampleConfigLexer(new ANTLRInputStream(input));
final SampleConfigParser p = new SampleConfigParser(new CommonTokenStream(l));
final LineContext context = p.line();
System.out.println(context.getChildCount() + ": " + context.getText());
This prints the following output:
3: key="valuewithspacesinbetween"
But, I expected the white spaces in the string literal to be retained, i.e.
3: key="value with spaces in between"
Is it possible to correct the grammar to achieve this behavior or should I just override CommonTokenStream to ignore whitespace during the parsing process?
You shouldn't expect any spaces in parser rules since you're skipping them in your lexer.
Either remove the skip command or make string a lexer rule:
STRING : '"' ( '\\' [\\"] | ~[\\"\r\n] )* '"';
I'm trying to create a very simple grammar to learn to use ANTLR but I get the following message:
"The following alternatives can never be reached: 2"
This is my grammar attempt:
grammar Robot;
file : command+;
command : ( delay|type|move|click|rclick) ;
delay : 'wait' number ';';
type : 'type' id ';';
move : 'move' number ',' number ';';
click : 'click' ;
rclick : 'rlick' ;
id : ('a'..'z'|'A'..'Z')+ ;
number : ('0'..'9')+ ;
WS : (' ' | '\t' | '\r' | '\n' ) { skip();} ;
I'm using ANTLRWorks plugin for IDEA:
The .. (range) inside parser rules means something different than inside lexer rules. Inside lexer rules, it means: "from char X to char Y", and inside parser rule it matches "from token M to token N". And since you made number a parser rule, it does not do what you think it does (and are therefor receiving an obscure error message).
The solution: make number a lexer rule instead (so, capitalize it: Number):
grammar Robot;
file : command+;
command : (delay | type | move | Click | RClick) ;
delay : 'wait' Number ';';
type : 'type' Id ';';
move : 'move' Number ',' Number ';';
Click : 'click' ;
RClick : 'rlick' ;
Id : ('a'..'z'|'A'..'Z')+ ;
Number : ('0'..'9')+ ;
WS : (' ' | '\t' | '\r' | '\n') { skip();} ;
And as you can see, I also made id, click and rclick lexer rules instead. If you're not sure what the difference is between parser- and lexer rules, please say so and I'll add an explanation to this answer.
So I have some string:
//Blah blah blach
// sdfkjlasdf
"Another //thing"
And I'm using java regex to replace all the lines that have double slashes like so:
theString = Pattern.compile("//(.*?)\\n", Pattern.DOTALL).matcher(theString).replaceAll("");
And it works for the most part, but the problem is it removes all the occurrences and I need to find a way to have it not remove the quoted occurrence. How would I go about doing that?
Instead of using a parser that parses an entire Java source file, or writing something yourself that parses only those parts you're interested in, you could use some 3rd party tool like ANTLR.
ANTLR has the ability to define only those tokens you are interested in (and of course the tokens that can mess up your token-stream like multi-line comments and String- and char literals). So you only need to define a lexer (another word for tokenizer) that correctly handles those tokens.
This is called a grammar. In ANTLR, such a grammar could look like this:
lexer grammar FuzzyJavaLexer;
options{filter=true;}
SingleLineComment
: '//' ~( '\r' | '\n' )*
;
MultiLineComment
: '/*' .* '*/'
;
StringLiteral
: '"' ( '\\' . | ~( '"' | '\\' ) )* '"'
;
CharLiteral
: '\'' ( '\\' . | ~( '\'' | '\\' ) )* '\''
;
Save the above in a file called FuzzyJavaLexer.g. Now download ANTLR 3.2 here and save it in the same folder as your FuzzyJavaLexer.g file.
Execute the following command:
java -cp antlr-3.2.jar org.antlr.Tool FuzzyJavaLexer.g
which will create a FuzzyJavaLexer.java source class.
Of course you need to test the lexer, which you can do by creating a file called FuzzyJavaLexerTest.java and copying the code below in it:
import org.antlr.runtime.*;
public class FuzzyJavaLexerTest {
public static void main(String[] args) throws Exception {
String source =
"class Test { \n"+
" String s = \" ... \\\" // no comment \"; \n"+
" /* \n"+
" * also no comment: // foo \n"+
" */ \n"+
" char quote = '\"'; \n"+
" // yes, a comment, finally!!! \n"+
" int i = 0; // another comment \n"+
"} \n";
System.out.println("===== source =====");
System.out.println(source);
System.out.println("==================");
ANTLRStringStream in = new ANTLRStringStream(source);
FuzzyJavaLexer lexer = new FuzzyJavaLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
for(Object obj : tokens.getTokens()) {
Token token = (Token)obj;
if(token.getType() == FuzzyJavaLexer.SingleLineComment) {
System.out.println("Found a SingleLineComment on line "+token.getLine()+
", starting at column "+token.getCharPositionInLine()+
", text: "+token.getText());
}
}
}
}
Next, compile your FuzzyJavaLexer.java and FuzzyJavaLexerTest.java by doing:
javac -cp .:antlr-3.2.jar *.java
and finally execute the FuzzyJavaLexerTest.class file:
// *nix/MacOS
java -cp .:antlr-3.2.jar FuzzyJavaLexerTest
or:
// Windows
java -cp .;antlr-3.2.jar FuzzyJavaLexerTest
after which you'll see the following being printed to your console:
===== source =====
class Test {
String s = " ... \" // no comment ";
/*
* also no comment: // foo
*/
char quote = '"';
// yes, a comment, finally!!!
int i = 0; // another comment
}
==================
Found a SingleLineComment on line 7, starting at column 2, text: // yes, a comment, finally!!!
Found a SingleLineComment on line 8, starting at column 13, text: // another comment
Pretty easy, eh? :)
Use a parser, determine it char-by-char.
Kickoff example:
StringBuilder builder = new StringBuilder();
boolean quoted = false;
for (String line : string.split("\\n")) {
for (int i = 0; i < line.length(); i++) {
char c = line.charAt(i);
if (c == '"') {
quoted = !quoted;
}
if (!quoted && c == '/' && i + 1 < line.length() && line.charAt(i + 1) == '/') {
break;
} else {
builder.append(c);
}
}
builder.append("\n");
}
String parsed = builder.toString();
System.out.println(parsed);
(This is in answer to the question #finnw asked in the comment under his answer. It's not so much an answer to the OP's question as an extended explanation of why a regex is the wrong tool.)
Here's my test code:
String r0 = "(?m)^((?:[^\"]|\"(?:[^\"]|\\\")*\")*)//.*$";
String r1 = "(?m)^((?:[^\"\r\n]|\"(?:[^\"\r\n]|\\\")*\")*)//.*$";
String r2 = "(?m)^((?:[^\"\r\n]|\"(?:[^\"\r\n\\\\]|\\\\\")*\")*)//.*$";
String test =
"class Test { \n"+
" String s = \" ... \\\" // no comment \"; \n"+
" /* \n"+
" * also no comment: // but no harm \n"+
" */ \n"+
" /* no comment: // much harm */ \n"+
" char quote = '\"'; // comment \n"+
" // another comment \n"+
" int i = 0; // and another \n"+
"} \n"
.replaceAll(" +$", "");
System.out.printf("%n%s%n", test);
System.out.printf("%n%s%n", test.replaceAll(r0, "$1"));
System.out.printf("%n%s%n", test.replaceAll(r1, "$1"));
System.out.printf("%n%s%n", test.replaceAll(r2, "$1"));
r0 is the edited regex from your answer; it removes only the final comment (// and another), because everything else is matched in group(1). Setting multiline mode ((?m)) is necessary for ^ and $ to work right, but it doesn't solve this problem because your character classes can still match newlines.
r1 deals with the newline problem, but it still incorrectly matches // no comment in the string literal, for two reasons: you didn't include a backslash in the first part of (?:[^\"\r\n]|\\\"); and you only used two of them to match the backslash in the second part.
r2 fixes that, but it makes no attempt to deal with the quote in the char literal, or single-line comments inside the multiline comments. They can probably be handled too, but this regex is already Baby Godzilla; do you really want to see it all grown up?.
The following is from a grep-like program I wrote (in Perl) a few years ago. It has an option to strip java comments before processing the file:
# ============================================================================
# ============================================================================
#
# strip_java_comments
# -------------------
#
# Strip the comments from a Java-like file. Multi-line comments are
# replaced with the equivalent number of blank lines so that all text
# left behind stays on the same line.
#
# Comments are replaced by at least one space .
#
# The text for an entire file is assumed to be in $_ and is returned
# in $_
#
# ============================================================================
# ============================================================================
sub strip_java_comments
{
s!( (?: \" [^\"\\]* (?: \\. [^\"\\]* )* \" )
| (?: \' [^\'\\]* (?: \\. [^\'\\]* )* \' )
| (?: \/\/ [^\n] *)
| (?: \/\* .*? \*\/)
)
!
my $x = $1;
my $first = substr($x, 0, 1);
if ($first eq '/')
{
"\n" x ($x =~ tr/\n//);
}
else
{
$x;
}
!esxg;
}
This code does actually work properly and can't be fooled by tricky comment/quote combinations. It will probably be fooled by unicode escapes (\u0022 etc), but you can easily deal with those first if you want to.
As it's Perl, not java, the replacement code will have to change. I'll have a quick crack at producing equivalent java. Stand by...
EDIT: I've just whipped this up. Will probably need work:
// The trick is to search for both comments and quoted strings.
// That way we won't notice a (partial or full) comment withing a quoted string
// or a (partial or full) quoted-string within a comment.
// (I may not have translated the back-slashes accurately. You'll figure it out)
Pattern p = Pattern.compile(
"( (?: \" [^\"\\\\]* (?: \\\\. [^\"\\\\]* )* \" )" + // " ... "
" | (?: ' [^'\\\\]* (?: \\\\. [^'\\\\]* )* ' )" + // or ' ... '
" | (?: // [^\\n] * )" + // or // ...
" | (?: /\\* .*? \\* / )" + // or /* ... */
")",
Pattern.DOTALL | Pattern.COMMENTS
);
Matcher m = p.matcher(entireInputFileAsAString);
StringBuilder output = new StringBuilder();
while (m.find())
{
if (m.group(1).startsWith("/"))
{
// This is a comment. Replace it with a space...
m.appendReplacement(output, " ");
// ... or replace it with an equivalent number of newlines
// (exercise for reader)
}
else
{
// We matched a quoted string. Put it back
m.appendReplacement(output, "$1");
}
}
m.appendTail(output);
return output.toString();
You can't tell using regex if you are in double quoted string or not. In the end regex is just a state machine (sometimes extended abit). I would use a parser as provided by BalusC or this one.
If you want know why the regex are limited read about formal grammars. A wikipedia article is a good start.