Complex Java Regular Expression with Nested Groupings - java

I am trying to get a regular expression written that will capture what I'm trying to match in Java, but can't seem to get it.
This is my latest attempt:
Pattern.compile( "[A-Za-z0-9]+(/[A-Za-z0-9]+)*/?" );
This is what I want to match:
hello
hello/world
hello/big/world
hello/big/world/
This what I don't want matched:
/
/hello
hello//world
hello/big//world
I'd appreciate any insight into what I am doing wrong :)

Try this regex:
Pattern.compile( "^[A-Za-z0-9]+(/[A-Za-z0-9]+)*/?$" );

Doesn't your regex require question mark at the end?
I always write unit tests for my regexes so I can fiddle with them until they pass.

// your exact regex:
final Pattern regex = Pattern.compile( "[A-Za-z0-9]+(/[A-Za-z0-9]+)*/?" );
// your exact examples:
final String[]
good = { "hello", "hello/world", "hello/big/world", "hello/big/world/" },
bad = { "/", "/hello", "hello//world", "hello/big//world"};
for (String goodOne : good) System.out.println(regex.matcher(goodOne).matches());
for (String badOne : bad) System.out.println(!regex.matcher(badOne).matches());
prints a solid column of true values.
Put another way: your regex is perfectly fine just as it is.

It looks like what you're trying to 'Capture' is being overwritten each quantified itteration. Just change parenthesis arangement.
# "[A-Za-z0-9]+((?:/[A-Za-z0-9]+)*)/?"
[A-Za-z0-9]+
( # (1 start)
(?: / [A-Za-z0-9]+ )*
) # (1 end)
/?
Or, with no capture's at all -
# "[A-Za-z0-9]+(?:/[A-Za-z0-9]+)*/?"
[A-Za-z0-9]+
(?: / [A-Za-z0-9]+ )*
/?

Related

Scala RegEx String extractors behaving inconsistently

I have two regular expression extractors.
One for .java files and the other is for .scala files
val JavaFileRegEx =
"""\S*
\s+
//
\s{1}
([^\.java]+)
\.java
""".replaceAll("(\\s)", "").r
val ScalaFileRegEx =
"""\S*
\s+
//
\s{1}
([^\.scala]+)
\.scala
""".replaceAll("(\\s)", "").r
I want to use these extractors above to extract a java file name and a scala file name from the example code below.
val string1 = " // Tester.java"
val string2 = " // Hello.scala"
string1 match {
case JavaFileRegEx(fileName1) => println(" Java file: " + fileName1)
case other => println(other + "--NO_MATCH")
}
string2 match {
case ScalaFileRegEx(fileName2) => println(" Scala file: " + fileName2)
case other => println(other + "--NO_MATCH")
}
I get this output indicating that the .java file matched but the .scala file did not.
Java file: Tester
// Hello.scala--NO_MATCH
How is it that the Java file matched but the .scala file did not?
NOTE
[] denotes character class. It matches only a single character.
[^] denotes match anything except the characters present in the character class.
In your first regex
\S*\s+//\s{1}([^\.java]+)\.java
\S* matches nothing as there is space in starting
\s+ matches the space which is in starting
// matches // literally
\s{1} matches next space
You are using [^\.java] which says match anything except . or j or a or v or a which can be written as [^.jav].
So, the left string now to be tested is
Tester.java
(Un)luckily any character from Tester does not matches . or j or a or v until we encounter a .. So Tester is matched and then java is also matched.
In your second regex
\S*\s+//\s{1}([^\.scala]+)\.scala
\S* matches nothing as there is space in starting
\s+ matches the space which is in starting
// matches // literally
\s{1} matches next space
Now, you are using [^\.scala] which says that match anything except . or s or c or a or l or a which can be written as [^.scla].
You have now
Hello.scala
but (un)luckily Hello here contains l which is not allowed according to character class and the regex fails.
How to correct it?
I will modify only a bit of your regex
\S*\s+//\s{1}([^.]*)\.java
<-->
This says that match anything except .
You can also use \w here instead if [^.]
Regex Demo
\S*\s+//\s{1}([^.]*)\.scala
Regex Demo
There is no need of {1} in \s{1}. You can simply write it as \s and it will match exactly one space like
\S*\s+//\s([^.]*)\.java

Regex: match JUnit assertEquals?

I'm migrating quite a few tests from JUnit to Spock:
// before
assertEquals("John Doe", userDTO.getFirstName());
// after
userDTO.getFirstName() == "John Doe"
To help make things quicker I want to replace (most of) JUnit's assert expressions with Spock's via a regular expression - supervised and file-by-file. assertFalse, assertTrue and assertNotNull are easy, but assertEqual is not since it has 2 parameters.
My current attempt is: assertEquals\(([^;]+),([^;]+)\);. But this doesn't work so well because it doesn't know whether a , separates an assertEquals parameter or not. How to I solve this?
My test cases are:
assertEquals(az, bz);
assertEquals(az(), bz);
assertEquals(az, bz());
assertEquals(az(), bz));
assertEquals(az, bz(cz, dz));
assertEquals(bz(cz, dz), az);
PS: Nested method calls are out of scope here.
Online: https://www.debuggex.com/r/aESv3YmNWsakNgI6/1
In general, matching arbitrarily nested structures with regexes is not something you should be doing. If we, however, limit your needs to the test cases you've listed here (removing the 4th, which is an error), then we can do something. You can also construct regexes for a variety of additional limited cases without making the thing too difficult.
I'll illustrate with python, but the same things probably work in your IDE.
>>> import re
>>> import pprint
>>> t = ["assertEquals(az, bz);", \
... "assertEquals(az(), bz);", \
... "assertEquals(az, bz());", \
... "assertEquals(az, bz(dz));", \
... "assertEquals(bz(dz), az);", \
... "assertEquals(az, bz(cz, dz));", \
... "assertEquals(bz(cz, dz), az);"]
>>> var = r'([a-z]+(\(([a-z]+(\s*,\s*[a-z]+)*)?\))?)'
>>> res = [ \
... re.sub( \
... r'assertEquals\(\s*' + var + '\s*,\s*' + var + '\s*\)', \
... r'\1 == \5', str \
... ) \
... for str in t]
>>> pprint.pprint(res)
['az == bz;',
'az() == bz;',
'az == bz();',
'az == bz(dz);',
'bz(dz) == az;',
'az == bz(cz, dz);',
'bz(cz, dz) == az;']
The important part is var:
( # group the entire var before the comma
[a-z]+ # acceptable variable name
( # followed by an optional group
\( # containing a pair of matching parens
( # which contain, optionally
[a-z]+ # an acceptable variable name
( # followed by any number (0 or more)
\s*,\s*[a-z]+ # of commas followed by acceptable variable names
)*
)?
\)
)?
)
To get this to work on your actual code, you'll have to change [a-z] to something more reasonable like [a-zA-Z0-9_]

Use java regex to find all strings that start with '#' and end with ' ' , and not include ' ' and '#'

I need to get all strings(not empty) starts with # and end with ' '(space) in String below:
String s = "#test1 #test2 #test3 #test4 ## #test5";
I hope I can get all "test1", "test2", "test3", "test4", "test5" strings.
How to do it with java regx? thanks a lot!
You can use the following regex
#\w+
\w is similar to [a-zA-Z\d_]
\w+ matches 1 to many characters which are from [a-zA-Z\d_]
The Java regex (?<=#)[^# ]+(?= ) should do the trick. According to Regex Planet's Java regex page that regex matches test1, test2, test3 and test4. (#test5 does not end with a space, so test5 is not matched.)
If you're OK with matching the leading #s and trailing s as well, you can get away with the simpler Java regex #[^# ]+.
Finally I solved it with code below:
Pattern pattern = Pattern.compile("#\\p{L}+");

Youtube complete Java Regex

I need to parse several pages to get all of their Youtube IDs.
I found many regular expressions on the web, but : the Java ones are not complete (they either give me garbage in addition to the IDs, or they miss some IDs).
The one that I found that seems to be complete is hosted here. But it is written in JavaScript and PHP. Unfortunately I couldn't translate them into JAVA.
Can somebody help me rewrite this PHP regex or the following JavaScript one in Java?
'~
https?:// # Required scheme. Either http or https.
(?:[0-9A-Z-]+\.)? # Optional subdomain.
(?: # Group host alternatives.
youtu\.be/ # Either youtu.be,
| youtube\.com # or youtube.com followed by
\S* # Allow anything up to VIDEO_ID,
[^\w\-\s] # but char before ID is non-ID char.
) # End host alternatives.
([\w\-]{11}) # $1: VIDEO_ID is exactly 11 chars.
(?=[^\w\-]|$) # Assert next char is non-ID or EOS.
(?! # Assert URL is not pre-linked.
[?=&+%\w]* # Allow URL (query) remainder.
(?: # Group pre-linked alternatives.
[\'"][^<>]*> # Either inside a start tag,
| </a> # or inside <a> element text contents.
) # End recognized pre-linked alts.
) # End negative lookahead assertion.
[?=&+%\w]* # Consume any URL (query) remainder.
~ix'
/https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube\.com\S*[^\w\-\s])([\w\-]{11})(?=[^\w\-]|$)(?![?=&+%\w]*(?:['"][^<>]*>|<\/a>))[?=&+%\w]*/ig;
First of all you need to insert and extra backslash \ foreach backslash in the old regex, else java thinks you escapes some other special characters in the string, which you are not doing.
https?:\\/\\/(?:[0-9A-Z-]+\\.)?(?:youtu\\.be\\/|youtube\\.com\\S*[^\\w\\-\\s])([\\w\\-]{11})(?=[^\\w\\-]|$)(?![?=&+%\\w]*(?:['\"][^<>]*>|<\\/a>))[?=&+%\\w]*
Next when you compile your pattern you need to add the CASE_INSENSITIVE flag. Here's an example:
String pattern = "https?:\\/\\/(?:[0-9A-Z-]+\\.)?(?:youtu\\.be\\/|youtube\\.com\\S*[^\\w\\-\\s])([\\w\\-]{11})(?=[^\\w\\-]|$)(?![?=&+%\\w]*(?:['\"][^<>]*>|<\\/a>))[?=&+%\\w]*";
Pattern compiledPattern = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);
Matcher matcher = compiledPattern.matcher(link);
while(matcher.find()) {
System.out.println(matcher.group());
}
Marcus above has a good regex, but i found that it doesn't recognize youtube links that have "www" but not "http(s)" in them
for example www.youtube....
i have an update:
^(?:https?:\\/\\/)?(?:[0-9A-Z-]+\\.)?(?:youtu\\.be\\/|youtube\\.com\\S*[^\\w\\-\\s])([\\w\\-]{11})(?=[^\\w\\-]|$)(?![?=&+%\\w]*(?:['\"][^<>]*>|<\\/a>))[?=&+%\\w]*
it's the same except for the start

Need regex to format file in php

I have a java file that I want to post online. I am using php to format the file.
Does anyone know the regex to turn the comments blue?
INPUT:
/*****
*This is the part
*I want to turn blue
*for my class
*******************/
class MyClass{
String s;
}
Thanks.
Naiive version:
$formatted = preg_replace('|(/\*.*?\*/)|m', '<span class="blue">$1</span>', $java_code_here);
... not tested, YMMV, etc...
In general, you won't be able to parse specific parts of a Java file using only regular expressions - Java is not a regular language. If your file has additional structure (such as "it always begins with a comment followed by a newline, followed by a class definition"), you can generate a regular expression for such a case. For instance, you'd match /\*+(.*?)\*+/$, where . is assumed to match multiple lines, and $ matches the end of a line.
In general, to make a regex work, you first define what patterns you want to find (rigorously, but in spoken language), and then translate that to standard regular expression notation.
Good luck.
A regex that can parse simple quotes should be able to find comments in C/C++ style languages.
I assume Java is of that type.
This is a Perl faq sample by someone else, although I added the part about // style comments (with or without line continuation) and reformated.
It basically does a global search and replace. Data is replaced verbatim if non a comment, otherwise replace the comment with your color formatting tags.
You should be able to adapt this to php, and it is expanded for clarity (maybe too much clarity though).
s{
## Comments, group 1:
(
/\* ## Start of /* ... */ comment
[^*]*\*+ ## Non-* followed by 1-or-more *'s
(?:
[^/*][^*]*\*+
)* ## 0-or-more things which don't start with /
## but do end with '*'
/ ## End of /* ... */ comment
|
// ## Start of // ... comment
(?:
[^\\] ## Any Non-Continuation character ^\
| ## OR
\\\n? ## Any Continuation character followed by 0-1 newline \n
)*? ## To be done 0-many times, stopping at the first end of comment
\n ## End of // comment
)
| ## OR, various things which aren't comments, group 2:
(
" (?: \\. | [^"\\] )* " ## Double quoted text
|
' (?: \\. | [^'\\] )* ' ## Single quoted text
|
. ## Any other char
[^/"'\\]* ## Chars which doesn't start a comment, string, escape
) ## or continuation (escape + newline)
}
{defined $2 ? $2 : "<some color>$1</some color>"}gxse;

Categories

Resources