Regex for mismatched substrings

Regex for mismatched substrings - java

I am trying to match some Java code that has mismatched strings in it. For example, I have the following block of code that I want to match:
protected String methodName(String args[]) {
final String METHOD = "wrongMethodName";
...
}
And the following block of code I don't want to match
protected String methodName(String args[]) {
final String METHOD = "methodName";
...
}
Right now, I have the following (not working) regular expression, which requires DOTALL enabled:
(\w+?)\(.*?\) ?{.*?METHOD *= *".*?";
If I try negative look behind with the capture group, the regex doesn't compile because the size of the look behind isn't known before hand.
java.util.regex.PatternSyntaxException:
Look-behind group does not have an obvious maximum length near index 39
Is there a way I can use the capture group in this regex to say I want to match strings that don't match the capture group?

I think you can use a negative lookahead instead (if I understood your problem correctly), try:
(\b\w+?\b)\(.*?\) ?{.*?METHOD *= *"(?!\1).*?"
See it here on Regexr
I used also word boundaries in the first group, otherwise it just starts matching at the second letter.

Related

Use a regex to find a pattern somewhere between two words

Given the following string
{"type":"PrimaryParty","name":"Karen","id":"456789-9996"},
{"type":"SecondaryParty","name":"Juliane","id":"345678-9996"},
{"type":"SecondaryParty","name":"Ellen","id":"001234-9996"}
I am looking for strings matching the pattern \d{6}-\d{4}, but only if they are following the string "SecondaryParty". The processor is Java-based
Using https://regex101.com/ I have come up with this, which works fine using the ECMAScript(JavaScript) Flavor.
(?<=SecondaryParty.*?)\d{6}-\d{4}(?=\"})
But as soon as I switch to Java, it says
* A quantifier inside a lookbehind makes it non-fixed width
? The preceding token is not quantifiable
When using it in java.util.regex, the error says
Look-behind group does not have an obvious maximum length near index 20 (?<=SecondaryParty.*?)\d{6}-\d{4}(?="}) ^
How do I overcome the "does not have an obvious maximum length" problem in Java?

You can get the value without using lookarounds by matching instead, and use a single capture group for the value that you want to get:
\"SecondaryParty\"[^{}]*\"(\d{6}-\d{4})\"
Explanation
\"SecondaryParty\" Match "SecondaryParty"
[^{}]*\" Match optional chars other than { and }
(\d{6}-\d{4}) Capture group 1, match 6 digits - 4 digits
\" Match "
See a regex101 demo and a Java demo.

You might use a curly braces quantifier as a workaround:
(?<=SecondaryParty.{0,255})\d{6}-\d{4}(?=\"})
The minimum and maximum inside curly braces quantifier are depend on your actual data.

You could use (?<=SecondaryParty)(.*?)(\d{6}-\d{4})(?=\"}) regex expression and take the value of the second group which will match the pattern \d{6}-\d{4}, but only if they are following the string "SecondaryParty".
Sample Java code
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class IdRegexMatcher {
public static void main(String[] args) {
String input ="{\"type\":\"PrimaryParty\",\"name\":\"Karen\",\"id\":\"456789-9996\"},\n" +
"{\"type\":\"SecondaryParty\",\"name\":\"Juliane\",\"id\":\"345678-9996\"},\n" +
"{\"type\":\"SecondaryParty\",\"name\":\"Ellen\",\"id\":\"001234-9996\"}";
Pattern pattern = Pattern.compile("(?<=SecondaryParty)(.*?)(\\d{6}-\\d{4})(?=\\\"})");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
String idStr = matcher.group(2);
System.out.println(idStr);
}
}
}
which gives the output
345678-9996
001234-9996
One possible optimization in the above regex could be to use [^0-9]*? instead of .*? under the assumption that the name wouldn't contain numbers.

Matching regex groups with Java

I am trying to split a line with regex by using groups, but it's not working as I expected.
I want to match for example this line:
Ex. #1: temp name(this is the data)
and also this:
Ex. #2: temp name()
I used this regex:
[\s]*temp[\s]+[\s]*([A-Za-z]+)[\s]*[(]\s*(.*)+[)]\s*[{]\s*
which means: grab anything that starts with temp then put in group #1 the "name" then grab whatever inside the bracket and put it in group #2.
However, group #2 is always empty.
This is my code to fetch the data:
Pattern PATTERN = Pattern.compile("[\\s]*temp[\\s]+[\\s]*([A-Za-z]+)[\\s]*[(]\\s*(.*)+[)]\\s*");
Matcher m = PATTERN.matcher("temp name(this is the data)");
m.matches();
String name = m.group(1);
String data = m.group(2); // always empty
What am I doing wrong?

Your pattern doesn't match because it requires an open curly brace at the end, but your input doesn't have one.
Ignoring that small problem, the main problem is the little + after your capture group (.*)+. The plus requires one or more matches of .* and the group returned is the last match of the many. The term .* is greedy, so it consumes everything up to the bracket. The only way to match again is to consume nothing. So the last match of group 2 is blank.
To fix it, remove the + after group 2:
Pattern PATTERN = Pattern.compile("\\s*temp\\s+([A-Za-z]+)\\s*[(]\\s*(.*)[)]\\s*");
Note also how I removed other unnecessary characters from your regex, eg the single-character character classes - ie [\\s] is identical to \s. And \\s+\\s* is identical to just \\s+, because + is greedy.
I also removed the trailing curly bracket, which you can restore if your input data actually has it (your question showed input of "temp name(this is the data)", which has no trailing curly bracket).

Your regex should be this:
Pattern pattern = Pattern.compile("\\s*temp\\s+([A-Za-z]+)\\s*[(]\\s*(.*)[)]\\s*");
You had (.*)+ which means one or more matches of .*. This results in nothing being captured.
Testing:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Example {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("\\s*temp\\s+([A-Za-z]+)\\s*[(]\\s*(.*)[)]\\s*");
Matcher m = pattern.matcher("temp name(this is the data)");
if(m.matches()) {
System.out.println(m.group(1));
System.out.println(m.group(2));
}
}
}
Output:
name
this is the data

[\s] is equivalent with \s
[\s]+[\s]* is equivalent with \s+
[(] is equivalent with \( (same for [)] and [}])
This would leave your regexp as:
\s*temp\s+([A-Za-z]+)\s*\(\s*(.*)+\)\s*\{\s*
Assuming you actually want to match temp name(...) { (your regexp is looking for a {, while in your question you do not specify that):
(.*)+ is your problem. You're saying: "Match any number (including 0) chatacters and put them in a capture group, repeat that at least once".
Regexp are by default greedy (= they consume as much as possible), so the capture group will first contain everything within the two brackets, then the + will try to match the entire group again, and will match it with "" (the emtpy string) as this fulfils the capture group's pattern. This will elave your capture group emtpy.
What you want instead is \s*temp\s+([A-Za-z]+)\s*\(\s*(.*)\)\s*\{\s*

The reason you are getting empty groups is because you are creating multiple capture groups every time you put something between (), even if it is nested.
To make a group so it doesnt capture you can designate it as a non-capturing group by using ?: for example (?:sometest(this is the value we want)) will return just one group while (sometest(this is the value we want)) will return 2 groups.
For your particular regex, I have refined and simplified it, as you had capture groups you did not need.
Simple solution:
\\s*temp\\s+([A-Za-z]+)\\s*\\(\\s*(.*)\\)\\s*\{\\s*
given the input:
Ex. #1: temp name(this is the data) {
Ex. #2: temp name() {
$1 = name, $2 = data
Pay attention to the fact that your regex contains a trailing curly brace. You can modify the regex to match without it and it will result in this:
\\s*temp\\s+([A-Za-z]+)\\s*\\(\\s*(.*)\\)\\s*
https://regex101.com/r/tD0tO0/1

Java regular expression for UserName with length range

I am writing a regular expression to validate UserName.
Here is the rule:
Length: 6 - 20 characters
Must start with letter a-zA-Z
Can contains a-zA-Z0-9 and dot(.)
Can't have 2 consecutive dots
Here is what I tried:
public class TestUserName {
private static String USERNAME_PATTERN = "[a-z](\\.?[a-z\\d]+)+";
private static Pattern pattern = Pattern.compile(USERNAME_PATTERN, CASE_INSENSITIVE);
public static void main(String[] args) {
System.out.println(pattern.matcher("user.name").matches()); // true
System.out.println(pattern.matcher("user.name2").matches()); // true
System.out.println(pattern.matcher("user2.name").matches()); // true
System.out.println(pattern.matcher("user..name").matches()); // false
System.out.println(pattern.matcher("1user.name").matches()); // false
}
}
The pattern I used is good but no length constraint.
I tried to append {6,20} constraint to the pattern but It failed.
"[a-z](\\.?[a-z\\d]+)+{6,20}" // failed pattern to validate length
Anyone has any ideas?
Thanks!

You can use a lookahead regex for all the checks:
^[a-zA-Z](?!.*\.\.)[a-zA-Z.\d]{5,19}$
Using [a-zA-Z.\d]{5,19} because we have already matched one char [a-zA-Z] at start this making total length in the range {6,20}
Negative lookahead (?!.*\.\.) will assert failure if there are 2 consecutive dots
Equivalent Java pattern will be:
Pattern p = Pattern.compile("^[a-zA-Z](?!.*\\.\\.)[a-zA-Z.\\d]{5,19}$");

Use a negative look ahead to prevent double dots:
"^(?!.*\\.\\.)(?i)[a-z][a-z\\d.]{5,19}$"
(?i) means case insensitve (so [a-z] means [a-zA-Z])
(?!.*\\.\\.) means there isn't two consecutive dots anywhere in it
The rest is obvious.
See live demo.

I would use the following regex :
^(?=.{6,20}$)(?!.*\.\.)[a-zA-Z][a-zA-Z0-9.]+$
The (?=.{6,20}$) positive lookahead makes sure the text will contain 6 to 20 characters, while the (?!.*\.\.) negative lookahead makes sure the text will not contain .. at any point.

This will also suffice (for only matching)
(?=^.{6,20}$)(?=^[A-Za-z])(?!.*\.\.)
For capturing, the matched pattern, you can use
(?=^.{6,20}$)(?=^[A-Za-z])(?!.*\.\.)(^.*$)

Why does this regex capture the excluded character?

I have a regex like this:
(?:(\\s| |\\A|^))(?:#)[A-Za-z0-9]{2,}
What I am trying to do is find a pattern that starts with an # and has two or more characters after, however it can't start in the middle of a word.
I'm new to regex but was under the impression ?: matches but then excludes the character however my regex seems to match but include the characters. Ideally I'd like for "#test" to return "test" and "test#test" to not match at all.
Can anyone tell me what I've done wrong?
Thanks.

Your understanding is incorrect. The difference between (...) and (?:...) is only that the former also creates a numbered match group which can be referred to with a backreference from within the regex, or as a captured match group from code following the match.
You could change the code to use lookbehinds, but the simple and straightforward fix is to put ([A-Za-z0-9]{2,}) inside regular parentheses, like I have done here, and retrieve the first matched group. (The # doesn't need any parentheses around it in this scenario, but the ones you have are harmless.)

Try this : You could use word boundary to specify your condition.
public static void main(String[] args) {
String s1 = "#test";
String s2 = "test#test";
String pattern = "\\b#\\w{2,}\\b";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(s1);
m.find();
System.out.println(m.group());
}
o/p :
#test
throws `IllegalStateException` in the second case (s2)..

How about:
\W#[\S]{2}[\S]*
The strings caught by this regular expression needs to be trimmed and remove the first character.

I guess you better need the following one:
(?<=(?<!\w)#)\w{2,}
Debuggex Demo
Don't forget to escape the backslashes in Java since in a string literal:
(?<=(?<!\\w)#)\\w{2,}

Regex matching in Java

I have a string in Java that I need to split using "<$" and "$>" as delimiters.
But if I have something looking like "\<$something_we_dont_care_what$>" than we ignore it and move on.
I've been trying to write a regex doing this for a while but I keep failing and reading about regular expressions in Java is just making me more and more confused...
Can anyone tell me the right way to do this?
Thank you.

Think you have two strings - not in your code, but read from file or a JTextField:
s = "\<$foo$>";
p = "[^\\]?<\$[^\$]*\$>";
And you want to match the pattern to the String.
What I have done so far:
A group, which does not contain a backslash [^\\]? but might be optional.
<$, where the Dollar, as special regex has to be masked by a backslash, as the backslash before.
A group [^\$]* which does not contain another Dollar of free length.
A Dollar with \$> a greater-than. Again: Dollar masked.
A question for your domain is, whether the foo-part, or something_we_dont_care_what, might contain a dollar sign, not followed by a >. I asssumed not.
s.match (p);
Should now return true or false, but the problem is, how to get it into your code. The problem is, that not only regex, but Java itself treats the backslash as masking character. So you have to double each of them:
p = "[^\\\\]?<\\$[^\\$]*\\$>";
If the test case is a literal text in your code too, this applies for it too:
"\\<$foo$>".matches (p);
Trying them out is often a good idea if you have a tool where you can omit the Java masking first - a simple GUI with two JTextFields, or code which reads the pattern from a properties file, which saves you from repeated recompiles.
public class PM
{
public static void main (String args[])
{
String bad = "\\<$foo$>";
String good = "<$foo$>";
String p = "[^\\\\]?<\\$[^\\$]*\\$>";
System.out.println ("bad:\t" + bad.matches (p));
System.out.println ("good:\t" + good.matches (p));
}
}

Never mind.
I've found a solution after a few hours of browsing and experimenting.
Regex expression that does exactly what I wanted is following:
// char $ needs to be escaped because it has different meaning in regular expressions
// <$
String leftDelimiter = "(<\\$)";
// $>
String rightDelimiter = "(\\$>)";
// leftDelimiter | rightDelimiter
// when used to split a string would split it each time it detected those two patters
// and it would also split it in the case I dont want them to split it
// and that is "\<$foo$>" case - when they are "escaped" in the string
// to solve it we can try to match our leftDelimiter only if char \ isnt before it
// matches all [$ that dont start with \
String fixedLeftDelimiter = "(?<!\\\\)"+leftDelimiter;
// the problem presents itself with the rightDelimiter because it needs to check
// whether there had been a leftDelimiter before it that has been escaped
// the following takes care of that
// matches all $> that dont have a <$ starting with \
String betterRightDelimiter = "(?<!\\\\"+leftDelimiter+whatCanBeInTags+rightDelimiter;
// whatCanBeInTags is everything that can be in out tags besides $ sign
// we are using {0,"+(Integer.MAX_VALUE-3)+"}? instead of *? because of a limitation
// of number of characters put in lookbehind assertion
String whatCanBeInTags = "[^\\$]{0,"+(Integer.MAX_VALUE-3)+"}?)";

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex for mismatched substrings - java

I think you can use a negative lookahead instead (if I understood your problem correctly), try: (\b\w+?\b)\(.?\) ?{.?METHOD = "(?!\1).*?" See it here on Regexr I used also word boundaries in the first group, otherwise it just starts matching at the second letter.

Related

Use a regex to find a pattern somewhere between two words

Matching regex groups with Java

Java regular expression for UserName with length range

Why does this regex capture the excluded character?

Regex matching in Java

Categories

Resources

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex for mismatched substrings - java

I think you can use a negative lookahead instead (if I understood your problem correctly), try: (\b\w+?\b)\(.*?\) ?{.*?METHOD *= *"(?!\1).*?" See it here on Regexr I used also word boundaries in the first group, otherwise it just starts matching at the second letter.

Related

Use a regex to find a pattern somewhere between two words

Matching regex groups with Java

Java regular expression for UserName with length range

Why does this regex capture the excluded character?

Regex matching in Java

Categories

Resources

I think you can use a negative lookahead instead (if I understood your problem correctly), try: (\b\w+?\b)\(.?\) ?{.?METHOD = "(?!\1).*?" See it here on Regexr I used also word boundaries in the first group, otherwise it just starts matching at the second letter.