Use a regex to find a pattern somewhere between two words - java

Given the following string
{"type":"PrimaryParty","name":"Karen","id":"456789-9996"},
{"type":"SecondaryParty","name":"Juliane","id":"345678-9996"},
{"type":"SecondaryParty","name":"Ellen","id":"001234-9996"}
I am looking for strings matching the pattern \d{6}-\d{4}, but only if they are following the string "SecondaryParty". The processor is Java-based
Using https://regex101.com/ I have come up with this, which works fine using the ECMAScript(JavaScript) Flavor.
(?<=SecondaryParty.*?)\d{6}-\d{4}(?=\"})
But as soon as I switch to Java, it says
* A quantifier inside a lookbehind makes it non-fixed width
? The preceding token is not quantifiable
When using it in java.util.regex, the error says
Look-behind group does not have an obvious maximum length near index 20 (?<=SecondaryParty.*?)\d{6}-\d{4}(?="}) ^
How do I overcome the "does not have an obvious maximum length" problem in Java?

You can get the value without using lookarounds by matching instead, and use a single capture group for the value that you want to get:
\"SecondaryParty\"[^{}]*\"(\d{6}-\d{4})\"
Explanation
\"SecondaryParty\" Match "SecondaryParty"
[^{}]*\" Match optional chars other than { and }
(\d{6}-\d{4}) Capture group 1, match 6 digits - 4 digits
\" Match "
See a regex101 demo and a Java demo.

You might use a curly braces quantifier as a workaround:
(?<=SecondaryParty.{0,255})\d{6}-\d{4}(?=\"})
The minimum and maximum inside curly braces quantifier are depend on your actual data.

You could use (?<=SecondaryParty)(.*?)(\d{6}-\d{4})(?=\"}) regex expression and take the value of the second group which will match the pattern \d{6}-\d{4}, but only if they are following the string "SecondaryParty".
Sample Java code
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class IdRegexMatcher {
public static void main(String[] args) {
String input ="{\"type\":\"PrimaryParty\",\"name\":\"Karen\",\"id\":\"456789-9996\"},\n" +
"{\"type\":\"SecondaryParty\",\"name\":\"Juliane\",\"id\":\"345678-9996\"},\n" +
"{\"type\":\"SecondaryParty\",\"name\":\"Ellen\",\"id\":\"001234-9996\"}";
Pattern pattern = Pattern.compile("(?<=SecondaryParty)(.*?)(\\d{6}-\\d{4})(?=\\\"})");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
String idStr = matcher.group(2);
System.out.println(idStr);
}
}
}
which gives the output
345678-9996
001234-9996
One possible optimization in the above regex could be to use [^0-9]*? instead of .*? under the assumption that the name wouldn't contain numbers.

Related

Matching regex groups with Java

I am trying to split a line with regex by using groups, but it's not working as I expected.
I want to match for example this line:
Ex. #1: temp name(this is the data)
and also this:
Ex. #2: temp name()
I used this regex:
[\s]*temp[\s]+[\s]*([A-Za-z]+)[\s]*[(]\s*(.*)+[)]\s*[{]\s*
which means: grab anything that starts with temp then put in group #1 the "name" then grab whatever inside the bracket and put it in group #2.
However, group #2 is always empty.
This is my code to fetch the data:
Pattern PATTERN = Pattern.compile("[\\s]*temp[\\s]+[\\s]*([A-Za-z]+)[\\s]*[(]\\s*(.*)+[)]\\s*");
Matcher m = PATTERN.matcher("temp name(this is the data)");
m.matches();
String name = m.group(1);
String data = m.group(2); // always empty
What am I doing wrong?
Your pattern doesn't match because it requires an open curly brace at the end, but your input doesn't have one.
Ignoring that small problem, the main problem is the little + after your capture group (.*)+. The plus requires one or more matches of .* and the group returned is the last match of the many. The term .* is greedy, so it consumes everything up to the bracket. The only way to match again is to consume nothing. So the last match of group 2 is blank.
To fix it, remove the + after group 2:
Pattern PATTERN = Pattern.compile("\\s*temp\\s+([A-Za-z]+)\\s*[(]\\s*(.*)[)]\\s*");
Note also how I removed other unnecessary characters from your regex, eg the single-character character classes - ie [\\s] is identical to \s. And \\s+\\s* is identical to just \\s+, because + is greedy.
I also removed the trailing curly bracket, which you can restore if your input data actually has it (your question showed input of "temp name(this is the data)", which has no trailing curly bracket).
Your regex should be this:
Pattern pattern = Pattern.compile("\\s*temp\\s+([A-Za-z]+)\\s*[(]\\s*(.*)[)]\\s*");
You had (.*)+ which means one or more matches of .*. This results in nothing being captured.
Testing:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Example {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("\\s*temp\\s+([A-Za-z]+)\\s*[(]\\s*(.*)[)]\\s*");
Matcher m = pattern.matcher("temp name(this is the data)");
if(m.matches()) {
System.out.println(m.group(1));
System.out.println(m.group(2));
}
}
}
Output:
name
this is the data
[\s] is equivalent with \s
[\s]+[\s]* is equivalent with \s+
[(] is equivalent with \( (same for [)] and [}])
This would leave your regexp as:
\s*temp\s+([A-Za-z]+)\s*\(\s*(.*)+\)\s*\{\s*
Assuming you actually want to match temp name(...) { (your regexp is looking for a {, while in your question you do not specify that):
(.*)+ is your problem. You're saying: "Match any number (including 0) chatacters and put them in a capture group, repeat that at least once".
Regexp are by default greedy (= they consume as much as possible), so the capture group will first contain everything within the two brackets, then the + will try to match the entire group again, and will match it with "" (the emtpy string) as this fulfils the capture group's pattern. This will elave your capture group emtpy.
What you want instead is \s*temp\s+([A-Za-z]+)\s*\(\s*(.*)\)\s*\{\s*
The reason you are getting empty groups is because you are creating multiple capture groups every time you put something between (), even if it is nested.
To make a group so it doesnt capture you can designate it as a non-capturing group by using ?: for example (?:sometest(this is the value we want)) will return just one group while (sometest(this is the value we want)) will return 2 groups.
For your particular regex, I have refined and simplified it, as you had capture groups you did not need.
Simple solution:
\\s*temp\\s+([A-Za-z]+)\\s*\\(\\s*(.*)\\)\\s*\{\\s*
given the input:
Ex. #1: temp name(this is the data) {
Ex. #2: temp name() {
$1 = name, $2 = data
Pay attention to the fact that your regex contains a trailing curly brace. You can modify the regex to match without it and it will result in this:
\\s*temp\\s+([A-Za-z]+)\\s*\\(\\s*(.*)\\)\\s*
https://regex101.com/r/tD0tO0/1

Regex: Is it possible to skip repeating negative lookbehinds?

I've been trying to fix a simple regex that:
Matches all characters from beginning of line (^) to the first & character or to the end of line ($).
The match cannot start with a &.
Examples:
test should match test.
one&two should match one.
&test shouldn't match anything.
My current regex is the following:
^(?<!\&)(.+?)(?=\&|$)
(Regex101)
Currently, this regex fails example 3, where if I gave this regex &test it matches &test, but it shouldn't match anything.
I think it may be a problem with the negative lookbehind (?<!\&) and that &test matches because the character before it is not a &, but it doesn't account for any following & characters.
Is modifying the negative lookbehind to account for repeating & characters possible, and if so, how could I fix this regex?
(I know that Regex101 is using Python's Regex, but this question's Regex is intended to work with Java.)
You need to use a look-ahead instead of a look-behind, and instead of lazy dot matching with a lookahead, use a negated character class:
^[^&]+
See demo (note that \n is added just for a demo, if you test strings without newline characters, it won't be necessary).
Here, ^ asserts the position at the start of the string, and [^&]+ class matches 1 or more characters other than & (thus, no need to use (?=\&|$) look-ahead, if needed, the whole line will be matched).
See IDEONE demo
public static void main (String[] args) throws java.lang.Exception
{
System.out.println(fetchMatch("test", 0));
System.out.println(fetchMatch("one&test", 0));
System.out.println(fetchMatch("&test", 0));
}
public static String fetchMatch(String s, int groupId)
{
Pattern pattern = Pattern.compile("^[^&]+");
Matcher matcher = pattern.matcher(s);
if (matcher.find()){
return matcher.group(groupId);
}
return "ERROR: NOT MATCHED";
}
Output:
test
one
ERROR: NOT MATCHED

Why does this regex capture the excluded character?

I have a regex like this:
(?:(\\s| |\\A|^))(?:#)[A-Za-z0-9]{2,}
What I am trying to do is find a pattern that starts with an # and has two or more characters after, however it can't start in the middle of a word.
I'm new to regex but was under the impression ?: matches but then excludes the character however my regex seems to match but include the characters. Ideally I'd like for "#test" to return "test" and "test#test" to not match at all.
Can anyone tell me what I've done wrong?
Thanks.
Your understanding is incorrect. The difference between (...) and (?:...) is only that the former also creates a numbered match group which can be referred to with a backreference from within the regex, or as a captured match group from code following the match.
You could change the code to use lookbehinds, but the simple and straightforward fix is to put ([A-Za-z0-9]{2,}) inside regular parentheses, like I have done here, and retrieve the first matched group. (The # doesn't need any parentheses around it in this scenario, but the ones you have are harmless.)
Try this : You could use word boundary to specify your condition.
public static void main(String[] args) {
String s1 = "#test";
String s2 = "test#test";
String pattern = "\\b#\\w{2,}\\b";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(s1);
m.find();
System.out.println(m.group());
}
o/p :
#test
throws `IllegalStateException` in the second case (s2)..
How about:
\W#[\S]{2}[\S]*
The strings caught by this regular expression needs to be trimmed and remove the first character.
I guess you better need the following one:
(?<=(?<!\w)#)\w{2,}
Debuggex Demo
Don't forget to escape the backslashes in Java since in a string literal:
(?<=(?<!\\w)#)\\w{2,}

Regex to match four repeated letters in a string using a Java pattern

I want to match something like aaaa, aaaad, adjjjjk. Something like ([a-z])\1+ was used to match the repeated characters, but I am not able to figure this out for four letters.
You want to match a single character and then that character repeated three more times:
([a-z])\1{3}
Note: In Java you need to escape the backslashes inside your regular expressions.
Update: The reason why it isn't doing what you want is because you are using the method matches which requires that the string exactly matches the regular expression, not just that it contains the regular expression. To check for containment you should instead use the Matcher class. Here is some example code:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Program
{
public static void main(String[] args)
{
Pattern pattern = Pattern.compile("([a-z])\\1{3}");
Matcher matcher = pattern.matcher("asdffffffasdf");
System.out.println(matcher.find());
}
}
Result:
true
Not knowing about the finite repetition syntax, your own problem solving skill should lead you to this:
([a-z])\1\1\1
Obviously it's not pretty, but:
It works
It exercises your own problem solving skill
It may lead you to deeper understanding of concepts
In this case, knowing the desugared form of the finite repetition syntax
I have a concern:
"ffffffff".matches("([a-z])\\1{3,}") = true
"fffffasdf".matches("([a-z])\\1{3,}") = false
"asdffffffasdf".matches("([a-z])\\1{3,}") = false
What can I do for the bottom two?
The problem is that in Java, matches need to match the whole string; it is as if the pattern is surrounded by ^ and $.
Unfortunately there is no String.containsPattern(String regex), but you can always use this trick of surrounding the pattern with .*:
"asdfffffffffasf".matches(".*([a-z])\\1{3,}.*") // true!
// ^^ ^^
You can put {n} after something to match it n times, so:
([a-z])\1{3}
General regex pattern for predefinite repetition is {4}.
Thus here ([a-z])\1{3} should match your 4 chars.

Java regular expression to match patterns and extract them

I tried writing a program in Java using regex to match a pattern and extract it. Given a string like "This is a link- #www.google.com# and this is another #google.com#" I should be able to get #www.google.com# and #google.com# strings extracted. Here is what I tried-
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class ParseLinks {
public static void main(String[] args) {
String message = "This is a link- #www.google.com# and this is another #google.com#";
Pattern p = Pattern.compile("#.*#");
Matcher matcher = p.matcher(message);
while(matcher.find()) {
String result = matcher.group();
System.out.println(result);
}
}
}
This results in output- #www.google.com# and this is another #google.com#. But what I wanted is only the strings #www.google.com# and #google.com# extracted. Can I please know the regex for this?
#[^#]+#
Though thinking about it, a hash sign is a bad choice for delimiting URLs, for rather obvious reasons.
The reason why your's does not work is the greediness of the star (from regular-expressions.info):
[The star] repeats the previous item
zero or more times. Greedy, so as many
items as possible will be matched
before trying permutations with less
matches of the preceding item, up to
the point where the preceding item is
not matched at all.
Assuming Java regex supports it, use the non-greedy pattern .*? instead of the greedy .* so that it will end the capture as soon as possible instead of as late as possible.
If the language doesn't support it, then you can approximate it by simply checking for anything that's not an ending delimiter, like so:
#[^#]*#
Regular expressions are "greedy" by default, that is, they will match as much text as possible. In your example, the pattern "#.*#" translates to
match a "#"
match as many characters as possible such that you can still ...
... match a "#"
What you want is a "non-greedy" or "reluctant" pattern such as "*?". Try "#.*?#" in your case.

Categories

Resources