Regular expression fo parsing JSON arrays in Java [duplicate]

Regular expression fo parsing JSON arrays in Java [duplicate] - java

This question already has answers here:
Regular expression to match balanced parentheses
(21 answers)
Closed 3 years ago.
Is it possible to write a regular expression that matches a nested pattern that occurs an unknown number of times? For example, can a regular expression match an opening and closing brace when there are an unknown number of open/close braces nested within the outer braces?
For example:
public MyMethod()
{
if (test)
{
// More { }
}
// More { }
} // End
Should match:
{
if (test)
{
// More { }
}
// More { }
}

No. It's that easy. A finite automaton (which is the data structure underlying a regular expression) does not have memory apart from the state it's in, and if you have arbitrarily deep nesting, you need an arbitrarily large automaton, which collides with the notion of a finite automaton.
You can match nested/paired elements up to a fixed depth, where the depth is only limited by your memory, because the automaton gets very large. In practice, however, you should use a push-down automaton, i.e a parser for a context-free grammar, for instance LL (top-down) or LR (bottom-up). You have to take the worse runtime behavior into account: O(n^3) vs. O(n), with n = length(input).
There are many parser generators avialable, for instance ANTLR for Java. Finding an existing grammar for Java (or C) is also not difficult.
For more background: Automata Theory at Wikipedia

Using regular expressions to check for nested patterns is very easy.
'/(\((?>[^()]+|(?1))*\))/'

Probably working Perl solution, if the string is on one line:
my $NesteD ;
$NesteD = qr/ \{( [^{}] | (??{ $NesteD }) )* \} /x ;
if ( $Stringy =~ m/\b( \w+$NesteD )/x ) {
print "Found: $1\n" ;
}
HTH
EDIT: check:
http://dev.perl.org/perl6/rfc/145.html
ruby information: http://www.ruby-forum.com/topic/112084
more perl: http://www.perlmonks.org/?node_id=660316
even more perl: https://metacpan.org/pod/Text::Balanced
perl, perl, perl: http://perl.plover.com/yak/regex/samples/slide083.html
And one more thing by Torsten Marek (who had pointed out correctly, that it's not a regex anymore):
http://coding.derkeiler.com/Archive/Perl/comp.lang.perl.misc/2008-03/msg01047.html

The Pumping lemma for regular languages is the reason why you can't do that.
The generated automaton will have a finite number of states, say k, so a string of k+1 opening braces is bound to have a state repeated somewhere (as the automaton processes the characters). The part of the string between the same state can be duplicated infinitely many times and the automaton will not know the difference.
In particular, if it accepts k+1 opening braces followed by k+1 closing braces (which it should) it will also accept the pumped number of opening braces followed by unchanged k+1 closing brases (which it shouldn't).

Yes, if it is .NET RegEx-engine. .Net engine supports finite state machine supplied with an external stack. see details

Proper Regular expressions would not be able to do it as you would leave the realm of Regular Languages to land in the Context Free Languages territories.
Nevertheless the "regular expression" packages that many languages offer are strictly more powerful.
For example, Lua regular expressions have the "%b()" recognizer that will match balanced parenthesis. In your case you would use "%b{}"
Another sophisticated tool similar to sed is gema, where you will match balanced curly braces very easily with {#}.
So, depending on the tools you have at your disposal your "regular expression" (in a broader sense) may be able to match nested parenthesis.

YES
...assuming that there is some maximum number of nestings you'd be happy to stop at.
Let me explain.
#torsten-marek is right that a regular expression cannot check for nested patterns like this, BUT it is possible to define a nested regex pattern which will allow you to capture nested structures like this up to some maximum depth. I created one to capture EBNF-style comments (try it out here), like:
(* This is a comment (* this is nested inside (* another level! *) hey *) yo *)
The regex (for single-depth comments) is the following:
m{1} = \(+\*+(?:[^*(]|(?:\*+[^)*])|(?:\(+[^*(]))*\*+\)+
This could easily be adapted for your purposes by replacing the \(+\*+ and \*+\)+ with { and } and replacing everything in between with a simple [^{}]:
p{1} = \{(?:[^{}])*\}
(Here's the link to try that out.)
To nest, just allow this pattern within the block itself:
p{2} = \{(?:(?:p{1})|(?:[^{}]))*\}
...or...
p{2} = \{(?:(?:\{(?:[^{}])*\})|(?:[^{}]))*\}
To find triple-nested blocks, use:
p{3} = \{(?:(?:p{2})|(?:[^{}]))*\}
...or...
p{3} = \{(?:(?:\{(?:(?:\{(?:[^{}])*\})|(?:[^{}]))*\})|(?:[^{}]))*\}
A clear pattern has emerged. To find comments nested to a depth of N, simply use the regex:
p{N} = \{(?:(?:p{N-1})|(?:[^{}]))*\}
where N > 1 and
p{1} = \{(?:[^{}])*\}
A script could be written to recursively generate these regexes, but that's beyond the scope of what I need this for. (This is left as an exercise for the reader. 😉)

Using the recursive matching in the PHP regex engine is massively faster than procedural matching of brackets. especially with longer strings.
http://php.net/manual/en/regexp.reference.recursive.php
e.g.
$patt = '!\( (?: (?: (?>[^()]+) | (?R) )* ) \)!x';
preg_match_all( $patt, $str, $m );
vs.
matchBrackets( $str );
function matchBrackets ( $str, $offset = 0 ) {
$matches = array();
list( $opener, $closer ) = array( '(', ')' );
// Return early if there's no match
if ( false === ( $first_offset = strpos( $str, $opener, $offset ) ) ) {
return $matches;
}
// Step through the string one character at a time storing offsets
$paren_score = -1;
$inside_paren = false;
$match_start = 0;
$offsets = array();
for ( $index = $first_offset; $index < strlen( $str ); $index++ ) {
$char = $str[ $index ];
if ( $opener === $char ) {
if ( ! $inside_paren ) {
$paren_score = 1;
$match_start = $index;
}
else {
$paren_score++;
}
$inside_paren = true;
}
elseif ( $closer === $char ) {
$paren_score--;
}
if ( 0 === $paren_score ) {
$inside_paren = false;
$paren_score = -1;
$offsets[] = array( $match_start, $index + 1 );
}
}
while ( $offset = array_shift( $offsets ) ) {
list( $start, $finish ) = $offset;
$match = substr( $str, $start, $finish - $start );
$matches[] = $match;
}
return $matches;
}

as zsolt mentioned, some regex engines support recursion -- of course, these are typically the ones that use a backtracking algorithm so it won't be particularly efficient. example: /(?>[^{}]*){(?>[^{}]*)(?R)*(?>[^{}]*)}/sm

No, you are getting into the realm of Context Free Grammars at that point.

This seems to work: /(\{(?:\{.*\}|[^\{])*\})/m

Related

BNF for Java Input Statements

I am writing a Java source code(.java) to pseudocode generator in Java using JDK 7. I want to display the pseudo code format of input statements as:
read n
like we see in Pascal.
However the thing is there are myriad ways to take console input in Java. My pseudo code generator is nothing but a parser of Java grammars. However I could not design a grammar to parse input statements.
So can anyone tell me how to write a BNF expression for Java Input Statements.
If my approach is wrong please mention the correct approach.

If I understand what you want, you'd like to have Java input like
int k = new Scanner(System.in).nextInt() ;
int m = k * 2 ;
k = k + 1 ;
turn into psuedo code output like
var k
read k
var m := k * 2
k := k + 1
There are at least three phases where you might recognize the "input statements". One is during parsing. Another is between parsing and generation (i.e. an analysis phase that transforms or marks the tree). A third is during generation.
Of the three, I'd suggest that the second or third might be best. Parsing is already complex enough. But if you really want to do it during parsing, you can do it with a combination of semantic and syntactic lookahead.
void DeclOrInput() :
{}
{
LOOKAHEAD( Input() ) Input()
|
LocalDeclStatement()
}
void Input() :
{ }
{ <NEW>
LOOKAHEAD({ token(1).image.equals("Scanner") } )
<ID>
"("
LOOKAHEAD( { token(1).image.equals("System") } )
<ID>
"."
LOOKAHEAD( { token(1).image.equals("in") } )
<ID>
")" "."
LOOKAHEAD( { token(1).image.equals( "nextInt" ) } )
<ID>
"(" ")" ";"
}

disambiguate tokens without using tokenizer state

I cannot get JavaCC to properly disambiguate tokens by their place in a grammar. I have the following JJTree file (I'll call it bug.jjt):
options
{
LOOKAHEAD = 3;
CHOICE_AMBIGUITY_CHECK = 2;
OTHER_AMBIGUITY_CHECK = 1;
SANITY_CHECK = true;
FORCE_LA_CHECK = true;
}
PARSER_BEGIN(MyParser)
import java.util.*;
public class MyParser {
public static void main(String[] args) throws ParseException {
MyParser parser = new MyParser(new java.io.StringReader(args[0]));
SimpleNode root = parser.production();
root.dump("");
}
}
PARSER_END(MyParser)
SKIP:
{
" "
}
TOKEN:
{
<STATE: ("state")>
|<PROD_NAME: (["a"-"z"])+ >
}
SimpleNode production():
{}
{
(
<PROD_NAME>
<STATE>
<EOF>
)
{return jjtThis;}
}
Generate the parser code with the following:
java -cp C:\path\to\javacc.jar jjtree bug.jjt
java -cp C:\path\to\javacc.jar javacc bug.jj
Now after compiling this, you can give run MyParser from the command line with a string to parse as the argument. It prints production if successful and spews an error if it fails.
I tried two simple inputs: foo state and state state. The first one parses, but the second one does not, since both state strings are tokenized as <STATE>. As I set LOOKAHEAD to 3, I expected it to use the grammar and see that one string state must be <STATE> and the other must be <PROD_NAME. However, no such luck. I have tried changing the various lookahead parameters to no avail. I am also not able to use tokenizer states (where you define different tokens allowable in different states), as this example is part of a more complicated system that will probably have a lot of these types of ambiguities.
Can anyone tell me how to make JavaCC properly disambiguate these tokens, without using tokenizer states?

This is covered in the FAQ under question 4.19.
There are three strategies outlined there
Putting choices in the grammar. See Bart Kiers's answer.
Using semantic look ahead. For this approach you get rid of the production defining STATE and write your grammar like this
void SimpleNode production():
{}
{
(
<PROD_NAME>
( LOOKAHEAD({getToken(1).kind == PROD_NAME && getToken(1).image.equals("state")})
<PROD_NAME>
...
|
...other choices...
)
)
{return jjtThis;}
}
If there are no other choices, then
void SimpleNode production():
{}
{
(
<PROD_NAME>
( LOOKAHEAD({getToken(1).kind == PROD_NAME && getToken(1).image.equals("state")})
<PROD_NAME>
...
|
{ int[][] expTokSeqs = new int[][] { new int[] {STATE } } ;
throw new ParseException(token, expTokSeqs, tokenImage) ; }
)
)
{return jjtThis;}
}
But, in this case, you need a production for STATE, as it is mentioned in the initialization of expTokSeqs. So you need a production.
< DUMMY > TOKEN : { < STATE : "state" > }
where DUMMY is a state that is never gone to.
Using lexical states. The title of the OP's question suggests he doesn't want to do this, but not why. It can be done if the state switching can be contained in the token manager. Suppose a file is a sequence of productions and each of production look like this.
name state : "a" | "b" name ;
That is it starts with a name, then the keyword "state" a colon, some tokens and finally a semicolon. (I'm just making this up as I have no idea what sort of language the OP is trying to parse.) Then you can use three lexical states DEFAULT, S0, and S1.
In the DEFAULT any sequence of letters (including "state") is a PROD_NAME. In DEFAULT, recognizing a PROD_NAME switches the state to S0.
In S0 any sequence of letters except "state" is a PROD_NAME and "state" is a STATE. In S0, recognizing a STATE token causes the tokenizer to switch to state S1.
In S1 any any sequence of letters (including "state") is a PROD_NAME. In S1, recognizing a SEMICOLON switches the state to DEFAULT.
So our example is tokenized like this
name state : "a" | "b" name ;
|__||______||_________________||_________
DEF- S0 S1 DEFAULT
AULT
The productions are written like this
<*> SKIP: { " " }
<S0> TOKEN: { <STATE: "state"> : S1 }
<DEFAULT> TOKEN:{ <PROD_NAME: (["a"-"z"])+ > : S0 }
<S0,S1> TOKEN:{ <PROD_NAME: (["a"-"z"])+ > }
<S1> TOKEN: { <SEMICOLON : ";" > : DEFAULT
<S0, DEFAULT> TOKEN : { <SEMICOLON : ";" > }
<*> TOKEN {
COLON : ":"
| ...etc...
}
It is possible for the parser to send state switching commands back to the tokenizer, but it is tricky to get it right and fragile. Se question 3.12 of the FAQ.

Lookahead does not concern the lexer while it composes characters to tokens. It is used by the parser when it matches non-terminals as composed from terminals (tokens).
If you define "state" to result in a token STATE, well, then that's what it is.
I agree with you, that tokenizer states aren't a good solution for permitting keywords to be used as identifiers. Is this really necessary? There's a good reason for HLL's not to permit this.
OTOH, if you can rewrite your grammar using just <PROD_NAME>s you might postpone the recognitions of the keywords during semantic analysis.

The LOOKAHEAD option only applies to the parser (production rules). The tokenizer is not affected by this: it will produce tokens without worrying what a production rule is trying to match. The input "state" will always be tokenized as a STATE, even if the parser is trying to match a PROD_NAME.
You could do something like this (untested, pseudo-ish grammar code ahead!):
SimpleNode production():
{}
{
(
prod_name_or_state()
<STATE>
<EOF>
)
{return jjtThis;}
}
SimpleNode prod_name_or_state():
{}
{
(
<PROD_NAME>
| <STATE>
)
{return jjtThis;}
}
which would match both "foo state" and "state state".
Or the equivalent, but more compact:
SimpleNode production():
{}
{
(
( <PROD_NAME> | <STATE> )
<STATE>
<EOF>
)
{return jjtThis;}
}

Validating an infix notation possibly using regex

I am thinking of validating an infix notation which consists of alphabets as operands and +-*/$ as operators [eg: A+B-(C/D)$(E+F)] using regex in Java. Is there any better way? Is there any regex pattern which I can use?

I am not familiar with the language syntax of infix, but you can certainly do a first pass validation check which simply verifies that all of the characters in the string are valid (i.e. acceptable characters = A-Z, +, -, *, /, $, ( and )). Here is a Java program which checks for valid characters and also includes a function which checks for unbalanced (possibly nested) parentheses:
import java.util.regex.*;
public class TEST {
public static void main(String[] args) {
String s = "A+B-(C/D)$(E+F)";
Pattern regex = Pattern.compile(
"# Verify that a string contains only specified characters.\n" +
"^ # Anchor to start of string\n" +
"[A-Z+\\-*/$()]+ # Match one or more valid characters\n" +
"$ # Anchor to end of string\n",
Pattern.COMMENTS);
Matcher m = regex.matcher(s);
if (m.find()) {
System.out.print("OK: String has only valid characters.\n");
} else {
System.out.print("ERROR: String has invalid characters.\n");
}
// Verify the string contains only balanced parentheses.
if (checkParens(s)) {
System.out.print("OK: String has no unbalanced parentheses.\n");
} else {
System.out.print("ERROR: String has unbalanced parentheses.\n");
}
}
// Function checks is string contains any unbalanced parentheses.
public static Boolean checkParens(String s) {
Pattern regex = Pattern.compile("\\(([^()]*)\\)");
Matcher m = regex.matcher(s);
// Loop removes matching nested parentheses from inside out.
while (m.find()) {
s = m.replaceFirst(m.group(1));
m.reset(s);
}
regex = Pattern.compile("[()]");
m = regex.matcher(s);
// Check if there are any erroneous parentheses left over.
if (m.find()) {
return false; // String has unbalanced parens.
}
return true; // String has balanced parens.
}
}
This does not validate the grammar, but may be useful as a first test to filter out obviously bad strings.

Possibly overkill, but you might consider using a fully fledged parser generator such as ANTLR (http://www.antlr.org/). With ANTLR you can create rules that will generate the java code for you automatically. Assuming you have only got valid characters in the input this is a syntax analysis problem, otherwise you would want to validate the character stream with lexical analysis first.
For syntax analysis you might have rules like:
PLUS : '+' ;
etc...
expression:
term ( ( PLUS | MINUS | MULTIPLY | DIVIDE )^ term )*
;
term:
constant
| OPENPAREN! expression CLOSEPAREN!
;
With constant being integers/reals whatever. If the ANTLR generated parser code can't match the input with your parser rules it will throw an exception so you can determine whether code is valid.

You probably could do it with recursive PCRE..but this may be a PITA.
since you only want to validate it, you can do it very simple. just use a stack, push all the elements one by one and remove valid expressions.
define some rules, for example:
an operator is only allowed if there is an alphabet on top of the stack
an alphabet or parentheses are only allowed if there is an operator on top of the stack
everything is allowed if the stack is empty
then:
if you encounter a closing parenthes remove everything up to the opening parenthes.
if you encounter an alphabet, remove the expression
after every removal of an expression, add an dummy alphabet. repeat the previous steps.
if the result is an alphabet, the expression is valid.
or something like that..

Regex to find commas that aren't inside "( and )"

I need some help to model this regular expression. I think it'll be easier with an example. I need a regular expression that matches a comma, but only if it's not inside this structure: "( )", like this:
,a,b,c,d,"("x","y",z)",e,f,g,
Then the first five and the last four commas should match the expression, the two between xyz and inside the ( ) section shouldn't.
I tried a lot of combinations but regular expressions is still a little foggy for me.
I want it to use with the split method in Java. The example is short, but it can be much more longer and have more than one section between "( and )". The split method receives an expression and if some text (in this case the comma) matches the expression it will be the separator.
So, want to do something like this:
String keys[] = row.split(expr);
System.out.println(keys[0]); // print a
System.out.println(keys[1]); // print b
System.out.println(keys[2]); // print c
System.out.println(keys[3]); // print d
System.out.println(keys[4]); // print "("x","y",z)"
System.out.println(keys[5]); // print e
System.out.println(keys[6]); // print f
System.out.println(keys[7]); // print g
Thanks!

You can do this with a negative lookahead. Here's a slightly simplified problem to illustrate the idea:
String text = "a;b;c;d;<x;y;z>;e;f;g;<p;q;r;s>;h;i;j";
String[] parts = text.split(";(?![^<>]*>)");
System.out.println(java.util.Arrays.toString(parts));
// _ _ _ _ _______ _ _ _ _________ _ _ _
// [a, b, c, d, <x;y;z>, e, f, g, <p;q;r;s>, h, i, j]
Note that instead of ,, the delimiter is now ;, and instead of "( and "), the parentheses are simply < and >, but the idea still works.
On the pattern
The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.
The * repetition specifier can be used to match "zero-or-more times" of the preceding pattern.
The (?!…) is a negative lookahead; it can be used to assert that a certain pattern DOES NOT match, looking ahead (i.e. to the right) of the current position.
The pattern [^<>]*> matches a sequence (possibly empty) of everything except parentheses, finally followed by a paranthesis which is of the closing type.
Putting all of the above together, we get ;(?![^<>]*>), which matches a ;, but only if we can't see the closing parenthesis as the first parenthesis to its right, because witnessing such phenomenon would only mean that the ; is "inside" the parentheses.
This technique, with some modifications, can be adapted to the original problem. Remember to escape regex metacharacters ( and ) as necessary, and of course " as well as \ in a Java string literal must be escaped by preceding with a \.
You can also make the * possessive to try to improve performance, i.e. ;(?![^<>]*+>).
References
regular-expressions.info/Character class, Repetition, Lookarounds, Possessive

Try this one:
(?![^(]*\)),
It worked for me in my testing, grabbed all commas not inside parenthesis.
Edit: Gopi pointed out the need to escape the slashes in Java:
(?![^(]*\\)),
Edit: Alan Moore pointed out some unnecessary complexity. Fixed.

If the parens are paired correctly and cannot be nested, you can split the text first at parens, then process the chunks.
List<String> result = new ArrayList<String>();
String[] chunks = text.split("[()]");
for (int i = 0; i < chunks.length; i++) {
if ((i % 2) == 0) {
String[] atoms = chunks[i].split(",");
for (int j = 0; j < atoms.length; j++)
result.add(atoms[j]);
}
else
result.add(chunks[i]);
}

Well,
After some tests I just found an answer that's doing what I need till now. At this moment, all itens inside the "( ... )" block are inside "" too, like in: "("a", "b", "c")", then, the regex ((?<!\"),)|(,(?!\")) works great for what I want!
But I still looking for one that can found the commas even if there's no "" in the inside terms.
Thankz for the help guyz.

This should do what you want:
(".*")|([a-z])
I didnt check in java but if you test it with http://www.fileformat.info/tool/regex.htm
the groups $1 and $2 contain the right values, so they match and you should get what you want.
A littlte be trickier this will get if you have other complexer values than a-z in between the commas.
If I understand the split correctly, dont use it, just fill your array with the backreference $0, $0 holds the values you are looking for.
Maybe a match function would be a better way and working with the values is better, cause you will get this really simple regExp. the others solutions I see so far are very good, no question aabout that, but they are really complicated and in 2 weeks you don't really know what the rexExp even did exactly.
By inversing the problem itself, the problem gets often simpler.

I had the same issue. I choose Adam Schmideg answer and improve it.
I had to deal with these 3 string for example :
France (Grenoble, Lyon), Germany (Berlin, Munich)
Italy, Suede, Belgium, Portugal
France, Italy (Torino), Spain (Bercelona, Madrid), Austria
The idea was to have :
France (Grenoble, Lyon)
or Germany (Berlin, Munich)
Italy, Suede, Belgium, Portugal
France, Italy (Torino), Spain (Bercelona, Madrid), Austria
I choose not to use regex because I was 100% of what I was doing and that it would work in any case.
String[] chunks = input.split("[()]");
for (int i = 0; i < chunks.length; i++) {
if ((i % 2) != 0) {
chunks[i] = "("+chunks[i].replaceAll(",", ";")+")";
}
}
StringBuffer buffer = new StringBuffer();
for (int i = 0; i < chunks.length; i++) {
buffer.append(chunks[i]);
}
String s = buffer.toString();
String[] output = s.split(",");

How can I identify references to Java classes using Perl?

I'm writing a Perl script and I've come to a point where I need to parse a Java source file line by line checking for references to a fully qualified Java class name. I know the class I'm looking for up front; also the fully qualified name of the source file that is being searched (based on its path).
For example find all valid references to foo.bar.Baz inside the com/bob/is/YourUncle.java file.
At this moment the cases I can think of that it needs to account for are:
The file being parsed is in the same package as the search class.
find foo.bar.Baz references in foo/bar/Boing.java
It should ignore comments.
// this is a comment saying this method returns a foo.bar.Baz or Baz instance
// it shouldn't count
/* a multiline comment as well
this shouldn't count
if I put foo.bar.Baz or Baz in here either */
In-line fully qualified references.
foo.bar.Baz fb = new foo.bar.Baz();
References based off an import statement.
import foo.bar.Baz;
...
Baz b = new Baz();
What would be the most efficient way to do this in Perl 5.8? Some fancy regex perhaps?
open F, $File::Find::name or die;
# these three things are already known
# $classToFind looking for references of this class
# $pkgToFind the package of the class you're finding references of
# $currentPkg package name of the file being parsed
while(<F>){
# ... do work here
}
close F;
# the results are availble here in some form

You also need to skip quoted strings (you can't even skip comments correctly if you don't also deal with quoted strings).
I'd probably write a fairly simple, efficient, and incomplete tokenizer very similar to the one I wrote in node 566467.
Based on that code I'd probably just dig through the non-comment/non-string chunks looking for \bimport\b and \b\Q$toFind\E\b matches. Perhaps similar to:
if( m[
\G
(?:
[^'"/]+
| /(?![/*])
)+
]xgc
) {
my $code = substr( $_, $-[0], $+[0] - $-[0] );
my $imported = 0;
while( $code =~ /\b(import\s+)?\Q$package\E\b/g ) {
if( $1 ) {
... # Found importing of package
while( $code =~ /\b\Q$class\E\b/g ) {
... # Found mention of imported class
}
last;
}
... # Found a package reference
}
} elsif( m[ \G ' (?: [^'\\]+ | \\. )* ' ]xgc
|| m[ \G " (?: [^"\\]+ | \\. )* " ]xgc
) {
# skip quoted strings
} elsif( m[\G//.*]gc ) {
# skip C++ comments

A Regex is probably the best solution for this, although I did find the following module in CPAN that you might be able to use
Java::JVM::Classfile - Parses compiled class files and returns info about them. You would have to compile the files before you could use this.
Also, remember that it can be tricky to catch all possible variants of a multi-line comment with a regex.

This is really just a straight grep for Baz (or for /(foo.bar.| )Baz/ if you're concerned about false positives from some.other.Baz), but ignoring comments, isn't it?
If so, I'd knock together a state engine to track whether you're in a multiline comment or not. The regexes needed aren't anything special. Something along the lines of (untested code):
my $in_comment;
my %matches;
my $line_num = 0;
my $full_target = 'foo.bar.Baz';
my $short_target = (split /\./, $full_target)[-1]; # segment after last . (Baz)
while (my $line = <F>) {
$line_num++;
if ($in_comment) {
next unless $line =~ m|\*/|; # ignore line unless it ends the comment
$line =~ s|.*\*/||; # delete everything prior to end of comment
} elsif ($line =~ m|/\*|) {
if ($line =~ m|\*/|) { # catch /* and */ on same line
$line =~ s|/\*.*\*/||;
} else {
$in_comment = 1;
$line =~ s|/\*.*||; # clear from start of comment to end of line
}
}
$line =~ s/\\\\.*//; # remove single-line comments
$matches{$line_num} = $line if $line =~ /$full_target| $short_target/;
}
for my $key (sort keys %matches) {
print $key, ': ', $matches{$key}, "\n";
}
It's not perfect and the in/out of comment state can be messed up by nested multiline comments or if there are multiple multiline comments on the same line, but that's probably good enough for most real-world cases.
To do it without the state engine, you'd need to slurp into a single string, delete the /.../ comments, and split it back into separate lines, and grep those for non-//-comment hits. But you wouldn't be able to include line numbers in the output that way.

This is what I came up with that works for all the different cases I've thrown at it. I'm still a Perl noob and its probably not the fastest thing in the world but it should work for what I need. Thanks for all the answers they helped me look at it in different ways.
my $className = 'Baz';
my $searchPkg = 'foo.bar';
my #potentialRefs, my #confirmedRefs;
my $samePkg = 0;
my $imported = 0;
my $currentPkg = 'com.bob';
$currentPkg =~ s/\//\./g;
if($currentPkg eq $searchPkg){
$samePkg = 1;
}
my $inMultiLineComment = 0;
open F, $_ or die;
my $lineNum = 0;
while(<F>){
$lineNum++;
if($inMultiLineComment){
if(m|^.*?\*/|){
s|^.*?\*/||; #get rid of the closing part of the multiline comment we're in
$inMultiLineComment = 0;
}else{
next;
}
}
if(length($_) > 0){
s|"([^"\\]*(\\.[^"\\]*)*)"||g; #remove strings first since java cannot have multiline string literals
s|/\*.*?\*/||g; #remove any multiline comments that start and end on the same line
s|//.*$||; #remove the // comments from what's left
if (m|/\*.*$|){
$inMultiLineComment = 1 ;#now if you have any occurence of /* then then at least some of the next line is in the multiline comment
s|/\*.*$||g;
}
}else{
next; #no sense continuing to process a blank string
}
if (/^\s*(import )?($searchPkg)?(.*)?\b$className\b/){
if($imported || $samePkg){
push(#confirmedRefs, $lineNum);
}else {
push(#potentialRefs, $lineNum);
}
if($1){
$imported = 1;
} elsif($2){
push(#confirmedRefs, $lineNum);
}
}
}
close F;
if($imported){
push(#confirmedRefs,#potentialRefs);
}
for (#confirmedRefs){
print "$_\n";
}

If you are feeling adventurous enough you could have a look at Parse::RecDescent.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regular expression fo parsing JSON arrays in Java [duplicate] - java

Using regular expressions to check for nested patterns is very easy. '/(\((?>[^()]+|(?1))*\))/'

Yes, if it is .NET RegEx-engine. .Net engine supports finite state machine supplied with an external stack. see details

as zsolt mentioned, some regex engines support recursion -- of course, these are typically the ones that use a backtracking algorithm so it won't be particularly efficient. example: /(?>[^{}]){(?>[^{}])(?R)(?>[^{}])}/sm

No, you are getting into the realm of Context Free Grammars at that point.

This seems to work: /(\{(?:\{.\}|[^\{])\})/m

Related

BNF for Java Input Statements

disambiguate tokens without using tokenizer state

Validating an infix notation possibly using regex

Regex to find commas that aren't inside "( and )"

How can I identify references to Java classes using Perl?

Categories

Resources