How can I identify references to Java classes using Perl? - java

I'm writing a Perl script and I've come to a point where I need to parse a Java source file line by line checking for references to a fully qualified Java class name. I know the class I'm looking for up front; also the fully qualified name of the source file that is being searched (based on its path).
For example find all valid references to inside the com/bob/is/ file.
At this moment the cases I can think of that it needs to account for are:
The file being parsed is in the same package as the search class.
find references in foo/bar/
It should ignore comments.
// this is a comment saying this method returns a or Baz instance
// it shouldn't count
/* a multiline comment as well
this shouldn't count
if I put or Baz in here either */
In-line fully qualified references. fb = new;
References based off an import statement.
Baz b = new Baz();
What would be the most efficient way to do this in Perl 5.8? Some fancy regex perhaps?
open F, $File::Find::name or die;
# these three things are already known
# $classToFind looking for references of this class
# $pkgToFind the package of the class you're finding references of
# $currentPkg package name of the file being parsed
# ... do work here
close F;
# the results are availble here in some form

You also need to skip quoted strings (you can't even skip comments correctly if you don't also deal with quoted strings).
I'd probably write a fairly simple, efficient, and incomplete tokenizer very similar to the one I wrote in node 566467.
Based on that code I'd probably just dig through the non-comment/non-string chunks looking for \bimport\b and \b\Q$toFind\E\b matches. Perhaps similar to:
if( m[
| /(?![/*])
) {
my $code = substr( $_, $-[0], $+[0] - $-[0] );
my $imported = 0;
while( $code =~ /\b(import\s+)?\Q$package\E\b/g ) {
if( $1 ) {
... # Found importing of package
while( $code =~ /\b\Q$class\E\b/g ) {
... # Found mention of imported class
... # Found a package reference
} elsif( m[ \G ' (?: [^'\\]+ | \\. )* ' ]xgc
|| m[ \G " (?: [^"\\]+ | \\. )* " ]xgc
) {
# skip quoted strings
} elsif( m[\G//.*]g­c ) {
# skip C++ comments

A Regex is probably the best solution for this, although I did find the following module in CPAN that you might be able to use
Java::JVM::Classfile - Parses compiled class files and returns info about them. You would have to compile the files before you could use this.
Also, remember that it can be tricky to catch all possible variants of a multi-line comment with a regex.

This is really just a straight grep for Baz (or for /(| )Baz/ if you're concerned about false positives from some.other.Baz), but ignoring comments, isn't it?
If so, I'd knock together a state engine to track whether you're in a multiline comment or not. The regexes needed aren't anything special. Something along the lines of (untested code):
my $in_comment;
my %matches;
my $line_num = 0;
my $full_target = '';
my $short_target = (split /\./, $full_target)[-1]; # segment after last . (Baz)
while (my $line = <F>) {
if ($in_comment) {
next unless $line =~ m|\*/|; # ignore line unless it ends the comment
$line =~ s|.*\*/||; # delete everything prior to end of comment
} elsif ($line =~ m|/\*|) {
if ($line =~ m|\*/|) { # catch /* and */ on same line
$line =~ s|/\*.*\*/||;
} else {
$in_comment = 1;
$line =~ s|/\*.*||; # clear from start of comment to end of line
$line =~ s/\\\\.*//; # remove single-line comments
$matches{$line_num} = $line if $line =~ /$full_target| $short_target/;
for my $key (sort keys %matches) {
print $key, ': ', $matches{$key}, "\n";
It's not perfect and the in/out of comment state can be messed up by nested multiline comments or if there are multiple multiline comments on the same line, but that's probably good enough for most real-world cases.
To do it without the state engine, you'd need to slurp into a single string, delete the /.../ comments, and split it back into separate lines, and grep those for non-//-comment hits. But you wouldn't be able to include line numbers in the output that way.

This is what I came up with that works for all the different cases I've thrown at it. I'm still a Perl noob and its probably not the fastest thing in the world but it should work for what I need. Thanks for all the answers they helped me look at it in different ways.
my $className = 'Baz';
my $searchPkg = '';
my #potentialRefs, my #confirmedRefs;
my $samePkg = 0;
my $imported = 0;
my $currentPkg = 'com.bob';
$currentPkg =~ s/\//\./g;
if($currentPkg eq $searchPkg){
$samePkg = 1;
my $inMultiLineComment = 0;
open F, $_ or die;
my $lineNum = 0;
s|^.*?\*/||; #get rid of the closing part of the multiline comment we're in
$inMultiLineComment = 0;
if(length($_) > 0){
s|"([^"\\]*(\\.[^"\\]*)*)"||g; #remove strings first since java cannot have multiline string literals
s|/\*.*?\*/||g; #remove any multiline comments that start and end on the same line
s|//.*$||; #remove the // comments from what's left
if (m|/\*.*$|){
$inMultiLineComment = 1 ;#now if you have any occurence of /* then then at least some of the next line is in the multiline comment
next; #no sense continuing to process a blank string
if (/^\s*(import )?($searchPkg)?(.*)?\b$className\b/){
if($imported || $samePkg){
push(#confirmedRefs, $lineNum);
}else {
push(#potentialRefs, $lineNum);
$imported = 1;
} elsif($2){
push(#confirmedRefs, $lineNum);
close F;
for (#confirmedRefs){
print "$_\n";

If you are feeling adventurous enough you could have a look at Parse::RecDescent.


Regular expression fo parsing JSON arrays in Java [duplicate]

This question already has answers here:
Regular expression to match balanced parentheses
(21 answers)
Closed 3 years ago.
Is it possible to write a regular expression that matches a nested pattern that occurs an unknown number of times? For example, can a regular expression match an opening and closing brace when there are an unknown number of open/close braces nested within the outer braces?
For example:
public MyMethod()
if (test)
// More { }
// More { }
} // End
Should match:
if (test)
// More { }
// More { }
No. It's that easy. A finite automaton (which is the data structure underlying a regular expression) does not have memory apart from the state it's in, and if you have arbitrarily deep nesting, you need an arbitrarily large automaton, which collides with the notion of a finite automaton.
You can match nested/paired elements up to a fixed depth, where the depth is only limited by your memory, because the automaton gets very large. In practice, however, you should use a push-down automaton, i.e a parser for a context-free grammar, for instance LL (top-down) or LR (bottom-up). You have to take the worse runtime behavior into account: O(n^3) vs. O(n), with n = length(input).
There are many parser generators avialable, for instance ANTLR for Java. Finding an existing grammar for Java (or C) is also not difficult.
For more background: Automata Theory at Wikipedia
Using regular expressions to check for nested patterns is very easy.
Probably working Perl solution, if the string is on one line:
my $NesteD ;
$NesteD = qr/ \{( [^{}] | (??{ $NesteD }) )* \} /x ;
if ( $Stringy =~ m/\b( \w+$NesteD )/x ) {
print "Found: $1\n" ;
EDIT: check:
ruby information:
more perl:
even more perl:
perl, perl, perl:
And one more thing by Torsten Marek (who had pointed out correctly, that it's not a regex anymore):
The Pumping lemma for regular languages is the reason why you can't do that.
The generated automaton will have a finite number of states, say k, so a string of k+1 opening braces is bound to have a state repeated somewhere (as the automaton processes the characters). The part of the string between the same state can be duplicated infinitely many times and the automaton will not know the difference.
In particular, if it accepts k+1 opening braces followed by k+1 closing braces (which it should) it will also accept the pumped number of opening braces followed by unchanged k+1 closing brases (which it shouldn't).
Yes, if it is .NET RegEx-engine. .Net engine supports finite state machine supplied with an external stack. see details
Proper Regular expressions would not be able to do it as you would leave the realm of Regular Languages to land in the Context Free Languages territories.
Nevertheless the "regular expression" packages that many languages offer are strictly more powerful.
For example, Lua regular expressions have the "%b()" recognizer that will match balanced parenthesis. In your case you would use "%b{}"
Another sophisticated tool similar to sed is gema, where you will match balanced curly braces very easily with {#}.
So, depending on the tools you have at your disposal your "regular expression" (in a broader sense) may be able to match nested parenthesis.
...assuming that there is some maximum number of nestings you'd be happy to stop at.
Let me explain.
#torsten-marek is right that a regular expression cannot check for nested patterns like this, BUT it is possible to define a nested regex pattern which will allow you to capture nested structures like this up to some maximum depth. I created one to capture EBNF-style comments (try it out here), like:
(* This is a comment (* this is nested inside (* another level! *) hey *) yo *)
The regex (for single-depth comments) is the following:
m{1} = \(+\*+(?:[^*(]|(?:\*+[^)*])|(?:\(+[^*(]))*\*+\)+
This could easily be adapted for your purposes by replacing the \(+\*+ and \*+\)+ with { and } and replacing everything in between with a simple [^{}]:
p{1} = \{(?:[^{}])*\}
(Here's the link to try that out.)
To nest, just allow this pattern within the block itself:
p{2} = \{(?:(?:p{1})|(?:[^{}]))*\}
p{2} = \{(?:(?:\{(?:[^{}])*\})|(?:[^{}]))*\}
To find triple-nested blocks, use:
p{3} = \{(?:(?:p{2})|(?:[^{}]))*\}
p{3} = \{(?:(?:\{(?:(?:\{(?:[^{}])*\})|(?:[^{}]))*\})|(?:[^{}]))*\}
A clear pattern has emerged. To find comments nested to a depth of N, simply use the regex:
p{N} = \{(?:(?:p{N-1})|(?:[^{}]))*\}
where N > 1 and
p{1} = \{(?:[^{}])*\}
A script could be written to recursively generate these regexes, but that's beyond the scope of what I need this for. (This is left as an exercise for the reader. 😉)
Using the recursive matching in the PHP regex engine is massively faster than procedural matching of brackets. especially with longer strings.
$patt = '!\( (?: (?: (?>[^()]+) | (?R) )* ) \)!x';
preg_match_all( $patt, $str, $m );
matchBrackets( $str );
function matchBrackets ( $str, $offset = 0 ) {
$matches = array();
list( $opener, $closer ) = array( '(', ')' );
// Return early if there's no match
if ( false === ( $first_offset = strpos( $str, $opener, $offset ) ) ) {
return $matches;
// Step through the string one character at a time storing offsets
$paren_score = -1;
$inside_paren = false;
$match_start = 0;
$offsets = array();
for ( $index = $first_offset; $index < strlen( $str ); $index++ ) {
$char = $str[ $index ];
if ( $opener === $char ) {
if ( ! $inside_paren ) {
$paren_score = 1;
$match_start = $index;
else {
$inside_paren = true;
elseif ( $closer === $char ) {
if ( 0 === $paren_score ) {
$inside_paren = false;
$paren_score = -1;
$offsets[] = array( $match_start, $index + 1 );
while ( $offset = array_shift( $offsets ) ) {
list( $start, $finish ) = $offset;
$match = substr( $str, $start, $finish - $start );
$matches[] = $match;
return $matches;
as zsolt mentioned, some regex engines support recursion -- of course, these are typically the ones that use a backtracking algorithm so it won't be particularly efficient. example: /(?>[^{}]*){(?>[^{}]*)(?R)*(?>[^{}]*)}/sm
No, you are getting into the realm of Context Free Grammars at that point.
This seems to work: /(\{(?:\{.*\}|[^\{])*\})/m

How to remove comments from a given String in Java?

how do I remove comments start with "//" and with /**, * etc.? I haven't found any solutions on Stack Overflow that has helped me very much, a lot of them have been way above my head and I'm still at most basics.
What I have thought about so far:
for (int i = 0; i < length; i++) {
for (j = i; j < length; j++) {
if (obj.charAt(j) == '/' && obj.charAt(j + 1) == '/')
But I'm not really sure how to replace the words following those characters. And how to end when to stop the replacement with a "//" comment. With the /* comments, atleast conceptually I know I should replace all words till "*/" pops up. Though again, I'm not sure how to limit the replacement till that point. To replace I thought replacing the charAt after the second "/" with an empty string until....where? I cannot figure out where to "end" the replacing.
I have looked at a few implementations on Stack, but I really didn't get it. Any help is appreciated, especially if it's at a basic level and understandable for someone not well versed in programming!
I have done something similar with regex (Java 9+):
// Checks for
// 1) Single char literal '"'
// 2) Single char literal '\"'
// 3) Strings; termination ignores \", \\\", ..., but allows ", \\", \\\\", ...
// 4) Single-line comment // ... to first \n
// 5) Multi-line comments /*[*] ... */
Pattern regex = Pattern.compile(
// Assuming 'text' contains your java text
// Leaves 1,2,3) unchanged and replaces comments 4,5) with ""
// Need quoteReplacement to prevent matcher processing $ and \
String textWithoutComments = regex.matcher(text).replaceAll(
m -> == '/' ? "" : Matcher.quoteReplacement(;
If you don't have Java 9+ then you could use this replace function:
String textWithoutComments = replaceAll(regex, text,
m -> == '/' ? "" :;
public static String replaceAll(Pattern p, String s,
Function<MatchResult, String> replacer) {
Matcher m = p.matcher(s);
StringBuilder b = new StringBuilder();
int lastStart = 0;
while (m.find()) {
String replacement = replacer.apply(m);
b.append(s.substring(lastStart, m.start())).append(replacement);
lastStart = m.end();
return b.append(s.substring(lastStart)).toString();
I'm not sure if you're using an IDE like IntelliJ or Eclipse but you could do this without using code if you're just interested in removing all comments for the project. You can do this with "Replace in Path" tool. Notice how "Regex" is checked, allowing us to match lines based on regular expressions.
This configuration in the tool will delete all lines starting with a // and replace it with an empty line.
The command to get to this on a Mac is ctrl + shift + r.

Check string using regex

I have the following string:
String s = "http://www.[VP_ANY].com:8080/servlet/[VP_ALL]";
I need to check if this string has the words [VP_ANY] o [VP_ALL]. I tried something like this (and many combinations), but it doesn't work:
What am I doing wrong?
I tried the following:
s = "www.[VP_ANY].com:8080/servlet/[VP_ALL]";
System.out.println(s.replaceAll("\[VP_ANY\]", "A"));
The first 'System.out' returns false, and the second one returns the replacement correctly.
I'm escaping the "[" and "]" characters with 2 backslashes, but when I save the post just one is showed. But I'm using 2 ...
String s = "http://www.[VP_ANY].com:8080/servlet/[VP_ALL]";
^^ ^^ ^^ ^
Your regex will not work because there is no word boundaray between . and [, between ] and . and between / and [
Additionally I think you are wrong with the escaping, your word boundaries would need a backslash more and the others two less.
So, since the word boundaries are not working, you should be fine with
Try this one
try {
boolean foundMatch = subjectString.matches("(?i)\\bVP_(?:ANY|ALL)\\b");
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
or this
try {
boolean foundMatch = subjectString.matches("(?i)\\[VP_(?:ANY|ALL)\\]");
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
Try This
My go at Java
try {
boolean foundMatch = "www.[VP_ANY].com:8080/servlet/[VP_ALL]".matches("\\[VP_ANY\\]|\\[VP_ALL\\]");
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
"http://www.[VP_ANY].com:8080/servlet/[VP_ALL]".replaceAll ("http://www.(\\[VP_ANY\\]).com:8080/servlet/(\\[VP_ALL\\])", "$1:$2")
res117: java.lang.String = [VP_ANY]:[VP_ALL]
If you're looking for a literal [, you have to mask it - else it will mean a group like [A-Z].
Now if you read the regex from a file or a JTextField at runtime, that's all. But if you write it to your source code, the compiler will see the \ and treat it as a general masking, which might be needed to mask quotes like in
char apo = '\'';
String quote = "He said: \"Whut?\"";
So you have to mask it again, because only "\\" means "\".
So, for development, to not get too much confused, it is a fine idea to have a simple GUI-App with 2 or 3 textfields for testing regexps. If you succeed, you only have to add another level of masking, but to develop them, you can keep this second level away.
Divide et impera, like the ancient roman programmers told us.

ANTLR generating empty conditions

I'm trying to learn to use ANTLR, but I cannot figure out what's wrong with my code in this case. I hope this will be really easy for anyone with some experience with it. This is the grammar (really short).
grammar SmallTest;
#header {
package parseTest;
import java.util.ArrayList;
prog returns [ArrayList<ArrayList<String>> all]
:(stat { if ($all == null)
$all = new ArrayList<ArrayList<String>>();
} )+
stat returns [ArrayList<String> res]
:(element { if ($res == null)
$res = new ArrayList<String>();
element: ('a'..'z'|'A'..'Z')+ ;
NEWLINE:'\r'? '\n' ;
The problem is that when I generate the Java code there are some empty if conditions, and the compiler displays an error because of that, I could edit that manually, but that would probably be much worse. I guess something is wrong in this.
Sorry for asking, this has to be really stupid, but my example is so similar to those in the site that I cannot imagine a way to atomize the differences any more.
Thank you very much.
You should put the initialization of your lists inside the #init { ... } block of the rules, which get executed before anything in the rule is matched.
Also, your element rule should not be a parser rule, but a lexer rule instead (it should start with a capital!).
And the entry point of your parser, the prog rule, should end with the EOF token otherwise the parser might stop before all tokens are handled properly.
Finally, the #header { ... } section only applies to the parser (it is a short-hand for #parser::header { ... }), you need to add the package declaration to the lexer as well.
A working demo:
grammar SmallTest;
#header {
package parseTest;
import java.util.ArrayList;
#lexer::header {
package parseTest;
prog returns [ArrayList<ArrayList<String>> all]
#init {$all = new ArrayList<ArrayList<String>>();}
: (stat {$all.add($stat.res);})+ EOF
stat returns [ArrayList<String> res]
#init {$res = new ArrayList<String>();}
: (ELEMENT {$res.add($ELEMENT.text);})* NEWLINE
ELEMENT : ('a'..'z'|'A'..'Z')+ ;
NEWLINE : '\r'? '\n' ;
SPACE : ' ' {skip();};
package parseTest;
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
SmallTestLexer lexer = new SmallTestLexer(new ANTLRStringStream("a bb ccc\ndddd eeeee\n"));
SmallTestParser parser = new SmallTestParser(new CommonTokenStream(lexer));
And to run it all, do:
java -cp antlr-3.3.jar org.antlr.Tool parseTest/SmallTest.g
javac -cp .:antlr-3.3.jar parseTest/*.java
java -cp .:antlr-3.3.jar parseTest.Main
which yields:
[[a, bb, ccc], [dddd, eeeee]]
Try converting element into a token
ELEMENT: ('a'..'z'|'A'..'Z')+ ;

Regex to find commas that aren't inside "( and )"

I need some help to model this regular expression. I think it'll be easier with an example. I need a regular expression that matches a comma, but only if it's not inside this structure: "( )", like this:
Then the first five and the last four commas should match the expression, the two between xyz and inside the ( ) section shouldn't.
I tried a lot of combinations but regular expressions is still a little foggy for me.
I want it to use with the split method in Java. The example is short, but it can be much more longer and have more than one section between "( and )". The split method receives an expression and if some text (in this case the comma) matches the expression it will be the separator.
So, want to do something like this:
String keys[] = row.split(expr);
System.out.println(keys[0]); // print a
System.out.println(keys[1]); // print b
System.out.println(keys[2]); // print c
System.out.println(keys[3]); // print d
System.out.println(keys[4]); // print "("x","y",z)"
System.out.println(keys[5]); // print e
System.out.println(keys[6]); // print f
System.out.println(keys[7]); // print g
You can do this with a negative lookahead. Here's a slightly simplified problem to illustrate the idea:
String text = "a;b;c;d;<x;y;z>;e;f;g;<p;q;r;s>;h;i;j";
String[] parts = text.split(";(?![^<>]*>)");
// _ _ _ _ _______ _ _ _ _________ _ _ _
// [a, b, c, d, <x;y;z>, e, f, g, <p;q;r;s>, h, i, j]
Note that instead of ,, the delimiter is now ;, and instead of "( and "), the parentheses are simply < and >, but the idea still works.
On the pattern
The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.
The * repetition specifier can be used to match "zero-or-more times" of the preceding pattern.
The (?!…) is a negative lookahead; it can be used to assert that a certain pattern DOES NOT match, looking ahead (i.e. to the right) of the current position.
The pattern [^<>]*> matches a sequence (possibly empty) of everything except parentheses, finally followed by a paranthesis which is of the closing type.
Putting all of the above together, we get ;(?![^<>]*>), which matches a ;, but only if we can't see the closing parenthesis as the first parenthesis to its right, because witnessing such phenomenon would only mean that the ; is "inside" the parentheses.
This technique, with some modifications, can be adapted to the original problem. Remember to escape regex metacharacters ( and ) as necessary, and of course " as well as \ in a Java string literal must be escaped by preceding with a \.
You can also make the * possessive to try to improve performance, i.e. ;(?![^<>]*+>).
References class, Repetition, Lookarounds, Possessive
Try this one:
It worked for me in my testing, grabbed all commas not inside parenthesis.
Edit: Gopi pointed out the need to escape the slashes in Java:
Edit: Alan Moore pointed out some unnecessary complexity. Fixed.
If the parens are paired correctly and cannot be nested, you can split the text first at parens, then process the chunks.
List<String> result = new ArrayList<String>();
String[] chunks = text.split("[()]");
for (int i = 0; i < chunks.length; i++) {
if ((i % 2) == 0) {
String[] atoms = chunks[i].split(",");
for (int j = 0; j < atoms.length; j++)
After some tests I just found an answer that's doing what I need till now. At this moment, all itens inside the "( ... )" block are inside "" too, like in: "("a", "b", "c")", then, the regex ((?<!\"),)|(,(?!\")) works great for what I want!
But I still looking for one that can found the commas even if there's no "" in the inside terms.
Thankz for the help guyz.
This should do what you want:
I didnt check in java but if you test it with
the groups $1 and $2 contain the right values, so they match and you should get what you want.
A littlte be trickier this will get if you have other complexer values than a-z in between the commas.
If I understand the split correctly, dont use it, just fill your array with the backreference $0, $0 holds the values you are looking for.
Maybe a match function would be a better way and working with the values is better, cause you will get this really simple regExp. the others solutions I see so far are very good, no question aabout that, but they are really complicated and in 2 weeks you don't really know what the rexExp even did exactly.
By inversing the problem itself, the problem gets often simpler.
I had the same issue. I choose Adam Schmideg answer and improve it.
I had to deal with these 3 string for example :
France (Grenoble, Lyon), Germany (Berlin, Munich)
Italy, Suede, Belgium, Portugal
France, Italy (Torino), Spain (Bercelona, Madrid), Austria
The idea was to have :
France (Grenoble, Lyon)
or Germany (Berlin, Munich)
Italy, Suede, Belgium, Portugal
France, Italy (Torino), Spain (Bercelona, Madrid), Austria
I choose not to use regex because I was 100% of what I was doing and that it would work in any case.
String[] chunks = input.split("[()]");
for (int i = 0; i < chunks.length; i++) {
if ((i % 2) != 0) {
chunks[i] = "("+chunks[i].replaceAll(",", ";")+")";
StringBuffer buffer = new StringBuffer();
for (int i = 0; i < chunks.length; i++) {
String s = buffer.toString();
String[] output = s.split(",");

