How to match a comment unless it's in a quoted string? - java

So I have some string:
//Blah blah blach
// sdfkjlasdf
"Another //thing"
And I'm using java regex to replace all the lines that have double slashes like so:
theString = Pattern.compile("//(.*?)\\n", Pattern.DOTALL).matcher(theString).replaceAll("");
And it works for the most part, but the problem is it removes all the occurrences and I need to find a way to have it not remove the quoted occurrence. How would I go about doing that?

Instead of using a parser that parses an entire Java source file, or writing something yourself that parses only those parts you're interested in, you could use some 3rd party tool like ANTLR.
ANTLR has the ability to define only those tokens you are interested in (and of course the tokens that can mess up your token-stream like multi-line comments and String- and char literals). So you only need to define a lexer (another word for tokenizer) that correctly handles those tokens.
This is called a grammar. In ANTLR, such a grammar could look like this:
lexer grammar FuzzyJavaLexer;
options{filter=true;}
SingleLineComment
: '//' ~( '\r' | '\n' )*
;
MultiLineComment
: '/*' .* '*/'
;
StringLiteral
: '"' ( '\\' . | ~( '"' | '\\' ) )* '"'
;
CharLiteral
: '\'' ( '\\' . | ~( '\'' | '\\' ) )* '\''
;
Save the above in a file called FuzzyJavaLexer.g. Now download ANTLR 3.2 here and save it in the same folder as your FuzzyJavaLexer.g file.
Execute the following command:
java -cp antlr-3.2.jar org.antlr.Tool FuzzyJavaLexer.g
which will create a FuzzyJavaLexer.java source class.
Of course you need to test the lexer, which you can do by creating a file called FuzzyJavaLexerTest.java and copying the code below in it:
import org.antlr.runtime.*;
public class FuzzyJavaLexerTest {
public static void main(String[] args) throws Exception {
String source =
"class Test { \n"+
" String s = \" ... \\\" // no comment \"; \n"+
" /* \n"+
" * also no comment: // foo \n"+
" */ \n"+
" char quote = '\"'; \n"+
" // yes, a comment, finally!!! \n"+
" int i = 0; // another comment \n"+
"} \n";
System.out.println("===== source =====");
System.out.println(source);
System.out.println("==================");
ANTLRStringStream in = new ANTLRStringStream(source);
FuzzyJavaLexer lexer = new FuzzyJavaLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
for(Object obj : tokens.getTokens()) {
Token token = (Token)obj;
if(token.getType() == FuzzyJavaLexer.SingleLineComment) {
System.out.println("Found a SingleLineComment on line "+token.getLine()+
", starting at column "+token.getCharPositionInLine()+
", text: "+token.getText());
}
}
}
}
Next, compile your FuzzyJavaLexer.java and FuzzyJavaLexerTest.java by doing:
javac -cp .:antlr-3.2.jar *.java
and finally execute the FuzzyJavaLexerTest.class file:
// *nix/MacOS
java -cp .:antlr-3.2.jar FuzzyJavaLexerTest
or:
// Windows
java -cp .;antlr-3.2.jar FuzzyJavaLexerTest
after which you'll see the following being printed to your console:
===== source =====
class Test {
String s = " ... \" // no comment ";
/*
* also no comment: // foo
*/
char quote = '"';
// yes, a comment, finally!!!
int i = 0; // another comment
}
==================
Found a SingleLineComment on line 7, starting at column 2, text: // yes, a comment, finally!!!
Found a SingleLineComment on line 8, starting at column 13, text: // another comment
Pretty easy, eh? :)

Use a parser, determine it char-by-char.
Kickoff example:
StringBuilder builder = new StringBuilder();
boolean quoted = false;
for (String line : string.split("\\n")) {
for (int i = 0; i < line.length(); i++) {
char c = line.charAt(i);
if (c == '"') {
quoted = !quoted;
}
if (!quoted && c == '/' && i + 1 < line.length() && line.charAt(i + 1) == '/') {
break;
} else {
builder.append(c);
}
}
builder.append("\n");
}
String parsed = builder.toString();
System.out.println(parsed);

(This is in answer to the question #finnw asked in the comment under his answer. It's not so much an answer to the OP's question as an extended explanation of why a regex is the wrong tool.)
Here's my test code:
String r0 = "(?m)^((?:[^\"]|\"(?:[^\"]|\\\")*\")*)//.*$";
String r1 = "(?m)^((?:[^\"\r\n]|\"(?:[^\"\r\n]|\\\")*\")*)//.*$";
String r2 = "(?m)^((?:[^\"\r\n]|\"(?:[^\"\r\n\\\\]|\\\\\")*\")*)//.*$";
String test =
"class Test { \n"+
" String s = \" ... \\\" // no comment \"; \n"+
" /* \n"+
" * also no comment: // but no harm \n"+
" */ \n"+
" /* no comment: // much harm */ \n"+
" char quote = '\"'; // comment \n"+
" // another comment \n"+
" int i = 0; // and another \n"+
"} \n"
.replaceAll(" +$", "");
System.out.printf("%n%s%n", test);
System.out.printf("%n%s%n", test.replaceAll(r0, "$1"));
System.out.printf("%n%s%n", test.replaceAll(r1, "$1"));
System.out.printf("%n%s%n", test.replaceAll(r2, "$1"));
r0 is the edited regex from your answer; it removes only the final comment (// and another), because everything else is matched in group(1). Setting multiline mode ((?m)) is necessary for ^ and $ to work right, but it doesn't solve this problem because your character classes can still match newlines.
r1 deals with the newline problem, but it still incorrectly matches // no comment in the string literal, for two reasons: you didn't include a backslash in the first part of (?:[^\"\r\n]|\\\"); and you only used two of them to match the backslash in the second part.
r2 fixes that, but it makes no attempt to deal with the quote in the char literal, or single-line comments inside the multiline comments. They can probably be handled too, but this regex is already Baby Godzilla; do you really want to see it all grown up?.

The following is from a grep-like program I wrote (in Perl) a few years ago. It has an option to strip java comments before processing the file:
# ============================================================================
# ============================================================================
#
# strip_java_comments
# -------------------
#
# Strip the comments from a Java-like file. Multi-line comments are
# replaced with the equivalent number of blank lines so that all text
# left behind stays on the same line.
#
# Comments are replaced by at least one space .
#
# The text for an entire file is assumed to be in $_ and is returned
# in $_
#
# ============================================================================
# ============================================================================
sub strip_java_comments
{
s!( (?: \" [^\"\\]* (?: \\. [^\"\\]* )* \" )
| (?: \' [^\'\\]* (?: \\. [^\'\\]* )* \' )
| (?: \/\/ [^\n] *)
| (?: \/\* .*? \*\/)
)
!
my $x = $1;
my $first = substr($x, 0, 1);
if ($first eq '/')
{
"\n" x ($x =~ tr/\n//);
}
else
{
$x;
}
!esxg;
}
This code does actually work properly and can't be fooled by tricky comment/quote combinations. It will probably be fooled by unicode escapes (\u0022 etc), but you can easily deal with those first if you want to.
As it's Perl, not java, the replacement code will have to change. I'll have a quick crack at producing equivalent java. Stand by...
EDIT: I've just whipped this up. Will probably need work:
// The trick is to search for both comments and quoted strings.
// That way we won't notice a (partial or full) comment withing a quoted string
// or a (partial or full) quoted-string within a comment.
// (I may not have translated the back-slashes accurately. You'll figure it out)
Pattern p = Pattern.compile(
"( (?: \" [^\"\\\\]* (?: \\\\. [^\"\\\\]* )* \" )" + // " ... "
" | (?: ' [^'\\\\]* (?: \\\\. [^'\\\\]* )* ' )" + // or ' ... '
" | (?: // [^\\n] * )" + // or // ...
" | (?: /\\* .*? \\* / )" + // or /* ... */
")",
Pattern.DOTALL | Pattern.COMMENTS
);
Matcher m = p.matcher(entireInputFileAsAString);
StringBuilder output = new StringBuilder();
while (m.find())
{
if (m.group(1).startsWith("/"))
{
// This is a comment. Replace it with a space...
m.appendReplacement(output, " ");
// ... or replace it with an equivalent number of newlines
// (exercise for reader)
}
else
{
// We matched a quoted string. Put it back
m.appendReplacement(output, "$1");
}
}
m.appendTail(output);
return output.toString();

You can't tell using regex if you are in double quoted string or not. In the end regex is just a state machine (sometimes extended abit). I would use a parser as provided by BalusC or this one.
If you want know why the regex are limited read about formal grammars. A wikipedia article is a good start.

Related

How to allow lexer to parse specific code parts from java?

I am currently creating a compiler with antlr4 which should allow java code to be parsed.
How do i allow:
public void =(Integer value) => java { this.value = value; }
that the code between java { } is not being parsed by antlr, but should have a visitor in my parser.
Currently i have
javaStatementBody: KWJAVA LCURLY .*? RCURLY
but this obviously does not work and .*? parses the whole file.
Please do not answer with "use quotes", thats not gonna be my solution, because i want to allow java code highlighting.
You could create separate lexer and parser grammars so that you can use lexical modes. Whenever the lexer "sees" the input java {, it moves to the JAVA_MODE. And when in the Java mode, you tokenise comments, string- and char literals. Also when in this mode, you encounter a {, you push the same JAVA_MODE so that the lexer knows it's nested once. And when you encounter a }, you pop a mode from the stack (resulting in either going back to the default mode, or staying in the Java mode but one level less deep).
A quick demo:
IslandLexer.g4
lexer grammar IslandLexer;
JAVA_START
: 'java' SPACES '{' -> pushMode(JAVA_MODE)
;
OTHER
: .
;
fragment SPACES : [ \t\r\n]+;
mode JAVA_MODE;
JAVA_CHAR : '\'' ( ~[\\'\r\n] | '\\' [tbnrf'\\] ) '\'';
JAVA_STRING : '"' ( ~[\\"\r\n] | '\\' [tbnrf"\\] )* '"';
JAVA_LINE_COMMENT : '//' ~[\r\n]*;
JAVA_BLOCK_COMMENT : '/*' .*? '*/';
JAVA_OPEN_BRACE : '{' -> pushMode(JAVA_MODE);
JAVA_CLOSE_BRACE : '}' -> popMode;
JAVA_OTHER : ~[{}];
IslandParser.g4
parser grammar IslandParser;
options { tokenVocab=IslandLexer; }
parse
: unit* EOF
;
unit
: base_language
| java_janguage
;
base_language
: OTHER+
;
java_janguage
: JAVA_START java_atom+
;
java_atom
: JAVA_CHAR
| JAVA_STRING
| JAVA_LINE_COMMENT
| JAVA_BLOCK_COMMENT
| JAVA_OPEN_BRACE
| JAVA_CLOSE_BRACE
| JAVA_OTHER
;
Test it with the following code:
String source = "foo \n" +
"\n" +
"java { \n" +
" char foo() { \n" +
" /* a quote in a comment \\\" */ \n" +
" String s = \"java {...}\"; \n" +
" return '}'; \n" +
" }\n" +
"}\n" +
"\n" +
"bar";
IslandLexer lexer = new IslandLexer(CharStreams.fromString(source));
IslandParser parser = new IslandParser(new CommonTokenStream(lexer));
System.out.println(parser.parse().toStringTree(parser));
which is the following parse tree:

ANTLR4: ignore white spaces in the input but not those in string literals

I have a simple grammar as follows:
grammar SampleConfig;
line: ID (WS)* '=' (WS)* string;
ID: [a-zA-Z]+;
string: '"' (ESC|.)*? '"' ;
ESC : '\\"' | '\\\\' ; // 2-char sequences \" and \\
WS: [ \t]+ -> skip;
The spaces in the input are completely ignored, including those in the string literal.
final String input = "key = \"value with spaces in between\"";
final SampleConfigLexer l = new SampleConfigLexer(new ANTLRInputStream(input));
final SampleConfigParser p = new SampleConfigParser(new CommonTokenStream(l));
final LineContext context = p.line();
System.out.println(context.getChildCount() + ": " + context.getText());
This prints the following output:
3: key="valuewithspacesinbetween"
But, I expected the white spaces in the string literal to be retained, i.e.
3: key="value with spaces in between"
Is it possible to correct the grammar to achieve this behavior or should I just override CommonTokenStream to ignore whitespace during the parsing process?
You shouldn't expect any spaces in parser rules since you're skipping them in your lexer.
Either remove the skip command or make string a lexer rule:
STRING : '"' ( '\\' [\\"] | ~[\\"\r\n] )* '"';

How to find all comments in the source code?

There are two style of comments , C-style and C++ style, how to recognize them?
/* comments */
// comments
I am feel free to use any methods and 3rd-libraries.
To reliably find all comments in a Java source file, I wouldn't use regex, but a real lexer (aka tokenizer).
Two popular choices for Java are:
JFlex: http://jflex.de
ANTLR: http://www.antlr.org
Contrary to popular belief, ANTLR can also be used to create only a lexer without the parser.
Here's a quick ANTLR demo. You need the following files in the same directory:
antlr-3.2.jar
JavaCommentLexer.g (the grammar)
Main.java
Test.java (a valid (!) java source file with exotic comments)
JavaCommentLexer.g
lexer grammar JavaCommentLexer;
options {
filter=true;
}
SingleLineComment
: FSlash FSlash ~('\r' | '\n')*
;
MultiLineComment
: FSlash Star .* Star FSlash
;
StringLiteral
: DQuote
( (EscapedDQuote)=> EscapedDQuote
| (EscapedBSlash)=> EscapedBSlash
| Octal
| Unicode
| ~('\\' | '"' | '\r' | '\n')
)*
DQuote {skip();}
;
CharLiteral
: SQuote
( (EscapedSQuote)=> EscapedSQuote
| (EscapedBSlash)=> EscapedBSlash
| Octal
| Unicode
| ~('\\' | '\'' | '\r' | '\n')
)
SQuote {skip();}
;
fragment EscapedDQuote
: BSlash DQuote
;
fragment EscapedSQuote
: BSlash SQuote
;
fragment EscapedBSlash
: BSlash BSlash
;
fragment FSlash
: '/' | '\\' ('u002f' | 'u002F')
;
fragment Star
: '*' | '\\' ('u002a' | 'u002A')
;
fragment BSlash
: '\\' ('u005c' | 'u005C')?
;
fragment DQuote
: '"'
| '\\u0022'
;
fragment SQuote
: '\''
| '\\u0027'
;
fragment Unicode
: '\\u' Hex Hex Hex Hex
;
fragment Octal
: '\\' ('0'..'3' Oct Oct | Oct Oct | Oct)
;
fragment Hex
: '0'..'9' | 'a'..'f' | 'A'..'F'
;
fragment Oct
: '0'..'7'
;
Main.java
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
JavaCommentLexer lexer = new JavaCommentLexer(new ANTLRFileStream("Test.java"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
for(Object o : tokens.getTokens()) {
CommonToken t = (CommonToken)o;
if(t.getType() == JavaCommentLexer.SingleLineComment) {
System.out.println("SingleLineComment :: " + t.getText().replace("\n", "\\n"));
}
if(t.getType() == JavaCommentLexer.MultiLineComment) {
System.out.println("MultiLineComment :: " + t.getText().replace("\n", "\\n"));
}
}
}
}
Test.java
\u002f\u002a <- multi line comment start
multi
line
comment // not a single line comment
\u002A/
public class Test {
// single line "not a string"
String s = "\u005C" \242 not // a comment \\\" \u002f \u005C\u005C \u0022;
/*
regular multi line comment
*/
char c = \u0027"'; // the " is not the start of a string
char q1 = '\u005c''; // == '\''
char q2 = '\u005c\u0027'; // == '\''
char q3 = \u0027\u005c\u0027\u0027; // == '\''
char c4 = '\047';
String t = "/*";
\u002f\u002f another single line comment
String u = "*/";
}
Now, to run the demo, do:
bart#hades:~/Programming/ANTLR/Demos/JavaComment$ java -cp antlr-3.2.jar org.antlr.Tool JavaCommentLexer.g
bart#hades:~/Programming/ANTLR/Demos/JavaComment$ javac -cp antlr-3.2.jar *.java
bart#hades:~/Programming/ANTLR/Demos/JavaComment$ java -cp .:antlr-3.2.jar Main
and you'll see the following being printed to the console:
MultiLineComment :: \u002f\u002a <- multi line comment start\nmulti\nline\ncomment // not a single line comment\n\u002A/
SingleLineComment :: // single line "not a string"
SingleLineComment :: // a comment \\\" \u002f \u005C\u005C \u0022;
MultiLineComment :: /*\n regular multi line comment\n */
SingleLineComment :: // the " is not the start of a string
SingleLineComment :: // == '\''
SingleLineComment :: // == '\''
SingleLineComment :: // == '\''
SingleLineComment :: \u002f\u002f another single line comment
EDIT
You can create a sort of lexer with regex yourself, of course. The following demo does not handle Unicode literals inside source files, however:
Test2.java
/* <- multi line comment start
multi
line
comment // not a single line comment
*/
public class Test2 {
// single line "not a string"
String s = "\" \242 not // a comment \\\" ";
/*
regular multi line comment
*/
char c = '"'; // the " is not the start of a string
char q1 = '\''; // == '\''
char c4 = '\047';
String t = "/*";
// another single line comment
String u = "*/";
}
Main2.java
import java.util.*;
import java.io.*;
import java.util.regex.*;
public class Main2 {
private static String read(File file) throws IOException {
StringBuilder b = new StringBuilder();
Scanner scan = new Scanner(file);
while(scan.hasNextLine()) {
String line = scan.nextLine();
b.append(line).append('\n');
}
return b.toString();
}
public static void main(String[] args) throws Exception {
String contents = read(new File("Test2.java"));
String slComment = "//[^\r\n]*";
String mlComment = "/\\*[\\s\\S]*?\\*/";
String strLit = "\"(?:\\\\.|[^\\\\\"\r\n])*\"";
String chLit = "'(?:\\\\.|[^\\\\'\r\n])+'";
String any = "[\\s\\S]";
Pattern p = Pattern.compile(
String.format("(%s)|(%s)|%s|%s|%s", slComment, mlComment, strLit, chLit, any)
);
Matcher m = p.matcher(contents);
while(m.find()) {
String hit = m.group();
if(m.group(1) != null) {
System.out.println("SingleLine :: " + hit.replace("\n", "\\n"));
}
if(m.group(2) != null) {
System.out.println("MultiLine :: " + hit.replace("\n", "\\n"));
}
}
}
}
If you run Main2, the following is printed to the console:
MultiLine :: /* <- multi line comment start\nmulti\nline\ncomment // not a single line comment\n*/
SingleLine :: // single line "not a string"
MultiLine :: /*\n regular multi line comment\n */
SingleLine :: // the " is not the start of a string
SingleLine :: // == '\''
SingleLine :: // another single line comment
EDIT: I've been searching for a while, but here is the real working regex:
String regex = "((//[^\n\r]*)|(/\\*(.+?)\\*/))"; // New Regex
List<String> comments = new ArrayList<String>();
Pattern p = Pattern.compile(regex, Pattern.DOTALL);
Matcher m = p.matcher(code);
// code is the C-Style code, in which you want to serach
while (m.find())
{
System.out.println(m.group(1));
comments.add(m.group(1));
}
With this input:
import Blah;
//Comment one//
line();
/* Blah */
line2(); // something weird
/* Multiline
another line for the comment
*/
It generates this output:
//Comment one//
/* Blah */
line2(); // something weird
/* Multiline
another line for the comment
*/
Notice that the last three lines of the output are one single print.
Have you tried regular expressions? Here is a nice wrap-up with Java example. It might need some tweaking However using only regular expressions won't be sufficient for more complicated structures (nested comments, "comments" in strings) but it is a nice start.

A regular expression for harvesting include and require directives

I am trying to harvest all inclusion directives from a PHP file using a regular expression (in Java).
The expression should pick up only those which have file names expressed as unconcatenated string literals. Ones with constants or variables are not necessary.
Detection should work for both single and double quotes, include-s and require-s, plus the additional trickery with _once and last but not least, both keyword- and function-style invocations.
A rough input sample:
<?php
require('a.php');
require 'b.php';
require("c.php");
require "d.php";
include('e.php');
include 'f.php';
include("g.php");
include "h.php";
require_once('i.php');
require_once 'j.php';
require_once("k.php");
require_once "l.php";
include_once('m.php');
include_once 'n.php';
include_once("o.php");
include_once "p.php";
?>
And output:
["a.php","b.php","c.php","d.php","f.php","g.php","h.php","i.php","j.php","k.php","l.php","m.php","n.php","o.php","p.php"]
Any ideas?
Use token_get_all. It's safe and won't give you headaches.
There is also PEAR's PHP_Parser if you require userland code.
To do this accurately, you really need to fully parse the PHP source code. This is because the text sequence: require('a.php'); can appear in places where it is not really an include at all - such as in comments, strings and HTML markup. For example, the following are NOT real PHP includes, but will be matched by the regex:
<?php // Examples where a regex solution gets false positives:
/* PHP multi-line comment with: require('a.php'); */
// PHP single-line comment with: require('a.php');
$str = "double quoted string with: require('a.php');";
$str = 'single quoted string with: require("a.php");';
?>
<p>HTML paragraph with: require('a.php');</p>
That said, if you are happy with getting a few false positives, the following single regex solution will do a pretty good job of scraping all the filenames from all the PHP include variations:
// Get all filenames from PHP include variations and return in array.
function getIncludes($text) {
$count = preg_match_all('/
# Match PHP include variations with single string literal filename.
\b # Anchor to word boundary.
(?: # Group for include variation alternatives.
include # Either "include"
| require # or "require"
) # End group of include variation alternatives.
(?:_once)? # Either one may be the "once" variation.
\s* # Optional whitespace.
( # $1: Optional opening parentheses.
\( # Literal open parentheses,
\s* # followed by optional whitespace.
)? # End $1: Optional opening parentheses.
(?| # "Branch reset" group of filename alts.
\'([^\']+)\' # Either $2{1]: Single quoted filename,
| "([^"]+)" # or $2{2]: Double quoted filename.
) # End branch reset group of filename alts.
(?(1) # If there were opening parentheses,
\s* # then allow optional whitespace
\) # followed by the closing parentheses.
) # End group $1 if conditional.
\s* # End statement with optional whitespace
; # followed by semi-colon.
/ix', $text, $matches);
if ($count > 0) {
$filenames = $matches[2];
} else {
$filenames = array();
}
return $filenames;
}
Additional 2011-07-24 It turns out the OP wants a solution in Java not PHP. Here is a tested Java program which is nearly identical. Note that I am not a Java expert and don't know how to dynamically size an array. Thus, the solution below (crudely) sets a fixed size array (100) to hold the array of filenames.
import java.util.regex.*;
public class TEST {
// Set maximum size of array of filenames.
public static final int MAX_NAMES = 100;
// Get all filenames from PHP include variations and return in array.
public static String[] getIncludes(String text)
{
int count = 0; // Count of filenames.
String filenames[] = new String[MAX_NAMES];
String filename;
Pattern p = Pattern.compile(
"# Match include variations with single string filename. \n" +
"\\b # Anchor to word boundary. \n" +
"(?: # Group include variation alternatives. \n" +
" include # Either 'include', \n" +
"| require # or 'require'. \n" +
") # End group of include variation alts. \n" +
"(?:_once)? # Either one may have '_once' suffix. \n" +
"\\s* # Optional whitespace. \n" +
"(?: # Group for optional opening paren. \n" +
" \\( # Literal open parentheses, \n" +
" \\s* # followed by optional whitespace. \n" +
")? # Opening parentheses are optional. \n" +
"(?: # Group for filename alternatives. \n" +
" '([^']+)' # $1: Either a single quoted filename, \n" +
"| \"([^\"]+)\" # or $2: a double quoted filename. \n" +
") # End group of filename alternativess. \n" +
"(?: # Group for optional closing paren. \n" +
" \\s* # Optional whitespace, \n" +
" \\) # followed by the closing parentheses. \n" +
")? # Closing parentheses is optional . \n" +
"\\s* # End statement with optional ws, \n" +
"; # followed by a semi-colon. ",
Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE | Pattern.COMMENTS);
Matcher m = p.matcher(text);
while (m.find() && count < MAX_NAMES) {
// The filename is in either $1 or $2
if (m.group(1) != null) filename = m.group(1);
else filename = m.group(2);
// Add this filename to array of filenames.
filenames[count++] = filename;
}
return filenames;
}
public static void main(String[] args)
{
// Test string full of various PHP include statements.
String text = "<?php\n"+
"\n"+
"require('a.php');\n"+
"require 'b.php';\n"+
"require(\"c.php\");\n"+
"require \"d.php\";\n"+
"\n"+
"include('e.php');\n"+
"include 'f.php';\n"+
"include(\"g.php\");\n"+
"include \"h.php\";\n"+
"\n"+
"require_once('i.php');\n"+
"require_once 'j.php';\n"+
"require_once(\"k.php\");\n"+
"require_once \"l.php\";\n"+
"\n"+
"include_once('m.php');\n"+
"include_once 'n.php';\n"+
"include_once(\"o.php\");\n"+
"include_once \"p.php\";\n"+
"\n"+
"?>\n";
String filenames[] = getIncludes(text);
for (int i = 0; i < MAX_NAMES && filenames[i] != null; i++) {
System.out.print(filenames[i] +"\n");
}
}
}
/(?:require|include)(?:_once)?[( ]['"](.*)\.php['"]\)?;/
Should work for all cases you've specified, and captures only the filename without the extension
Test script:
<?php
$text = <<<EOT
require('a.php');
require 'b.php';
require("c.php");
require "d.php";
include('e.php');
include 'f.php';
include("g.php");
include "h.php";
require_once('i.php');
require_once 'j.php';
require_once("k.php");
require_once "l.php";
include_once('m.php');
include_once 'n.php';
include_once("o.php");
include_once "p.php";
EOT;
$re = '/(?:require|include)(?:_once)?[( ][\'"](.*)\.php[\'"]\)?;/';
$result = array();
preg_match_all($re, $text, $result);
var_dump($result);
To get the filenames like you wanted, read $results[1]
I should probably point that I too am partial to cweiske's answer, and that unless you really just want an exercise in regular expressions (or want to do this say using grep), then you should use the tokenizer.
The following should work pretty well:
/^(require|include)(_once)?(\(\s+)("|')(.*?)("|')(\)|\s+);$/
You'll want the fourth captured group.
This works for me:
preg_match_all('/\b(require|include|require_once|include_once)\b(\(| )(\'|")(.+)\.php(\'|")\)?;/i', $subject, $result, PREG_PATTERN_ORDER);
$result = $result[4];

RegEx To Ignore Text Between Quotes

I have a Regex, which is [\\.|\\;|\\?|\\!][\\s]
This is used to split a string. But I don't want it to split . ; ? ! if it is in quotes.
I'd not use split but Pattern & Matcher instead.
A demo:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String text = "start. \"in quotes!\"; foo? \"more \\\" words\"; bar";
String simpleToken = "[^.;?!\\s\"]+";
String quotedToken =
"(?x) # enable inline comments and ignore white spaces in the regex \n" +
"\" # match a double quote \n" +
"( # open group 1 \n" +
" \\\\. # match a backslash followed by any char (other than line breaks) \n" +
" | # OR \n" +
" [^\\\\\r\n\"] # any character other than a backslash, line breaks or double quote \n" +
") # close group 1 \n" +
"* # repeat group 1 zero or more times \n" +
"\" # match a double quote \n";
String regex = quotedToken + "|" + simpleToken;
Matcher m = Pattern.compile(regex).matcher(text);
while(m.find()) {
System.out.println("> " + m.group());
}
}
}
which produces:
> start
> "in quotes!"
> foo
> "more \" words"
> bar
As you can see, it can also handle escaped quotes inside quoted tokens.
Here is what I do in order to ignore quotes in matches.
(?:[^\"\']|(?:\".*?\")|(?:\'.*?\'))*? # <-- append the query you wanted to search for - don't use something greedy like .* in the rest of your regex.
To adapt this for your regex, you could do
(?:[^\"\']|(?:\".*?\")|(?:\'.*?\'))*?[.;?!]\s*

Categories

Resources