Java regular expression nongreedy issue - java

I would like to match the smallest sub string that starts with d and ends with a and contains o.
Example : "djswxaeqobdnoa" => "dnoa"
With this code :
Pattern pattern = Pattern.compile("d.*?o.*?a");
Matcher matcher = pattern.matcher("fondjswxaeqobdnoajezbpfrehanxi");
while (matcher.find()) {
System.out.println(matcher.group());
}
The entire input string "djswxaeqobdnoa" printed instead of just "dnoa". Why ? How can I match the smallest ?
Here a solution :
String shortest = null;
Pattern pattern = Pattern.compile("(?=(d.*?o.*?a))");
Matcher matcher = pattern.matcher("ondjswxaeqobdnoajezbpfrehanxi");
while (matcher.find()) {
for (int i = 1; i <= matcher.groupCount(); i++) {
if (shortest == null || matcher.group(i).length() < shortest.length()) {
shortest = matcher.group(i);
}
}
}

djswxaeqobdnoa
d....*..o..*.a
That's one match of your regular expression consuming the full String.

You are matching the whole String, hence the whole String is returned by your group invocation.
If you want specific matches of each segment of your Pattern, you only need to group those segments.
For instance:
Pattern pattern = Pattern.compile("(d.*?)(o.*?)a");
Matcher matcher = pattern.matcher("djswxaeqobdnoa");
while (matcher.find()) {
System.out.println(matcher.group());
// specific groups are 1-indexed
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
}
Output
djswxaeqobdnoa
djswxaeq
obdno

Your regex is d.*?o.*?a and the string you want to compare is djswxaeqobdnoa.
starts with letter d and match the shortest possiblity in which the next character would be o. So it matches from d to first o.Because of nongreedyness .*? again it matches the shortest possiblity from o to the next shortest a. Thus it matches the whole string.

Thanks with this code it works :
String shortest = null;
Pattern pattern = Pattern.compile("(?=(d.*?o.*?a))");
Matcher matcher = pattern.matcher("ondjswxaeqobdnoajezbpfrehanxi");
while (matcher.find()) {
for (int i = 1; i <= matcher.groupCount(); i++) {
if (shortest == null || matcher.group(i).length() < shortest.length()) {
shortest = matcher.group(i);
}
}
Which is better ?

Related

m.find() returns false when it should return true

m.find() returns false when it should return true.
solrQueries[i] contains the string:
"fl=trending:0,id,business_attr,newarrivals:0,bestprice:0,score,mostviewed:0,primarySortOrder,fastselling:0,modelNumber&defType=pedismax&pf=&mm=2<70%&bgids=1524&bgboost=0.1&shards.tolerant=true&stats=true"
The code is:
Pattern p = Pattern.compile("&mm=(\\d+)&");
for(int i=0; i<solrQueries.length; i++) {
Matcher m = p.matcher(solrQueries[i].toLowerCase());
System.out.println(p.matcher(solrQueries[i].toLowerCase()));
if (m.find()) {
System.out.println(m.group(1));
mmValues[i] = m.group(1);
}
Oh,
Pattern p = Pattern.compile("(?i)&mm=(\d+)");
works fine now.
Thank you, #Wiktor Stribiżew
You executed m.find() twice (first, in System.out.println(m.find()); and then in if (m.find())). And since there is only 1 match - even if the regex matches - you would get nothing after the second run.
Use
public String[] fetchMmValue(String[] solrQueries) {
String[] mmValues = new String[solrQueries.length];
Pattern p = Pattern.compile("(?i)&mm=(\\d+)");
for(int i=0; i<solrQueries.length; i++) {
Matcher m = p.matcher(solrQueries[i]);
if (m.find()) {
// System.out.println(m.group(1)); // this is just for debugging
mmValues[i] = m.group(1);
}
return mmValues;
}
If you want to get all chars other than & after &mm=, use another regex:
"&mm=([^&]+)"
where [^&]+ matches 1 or more chars other than &.

Finding the longest "number sequence" in a string using only a single regex

I want to find a single regex which matches the longest numerical string in a URL.
I.e for the URL: http://stackoverflow.com/1234/questions/123456789/ask, I would like it to return : 123456789
I thought I could use : ([\d]+)
However this returns the first match from the left, not the longest.
Any ideas :) ?
This regex will be used as an input to a strategy pattern, which extracts certain characteristics from urls:
public static String parse(String url, String RegEx) {
Pattern pattern = Pattern.compile(regex);
Matcher m = pattern.matcher(url);
if (m.find()) {
return m.group(1);
}
return null;
}
So it would be much tidier if I could use a single regex. :( –
Don't use regex. Just iterate the characters:
String longest = 0;
int i = 0;
while (i < str.length()) {
while (i < str.length() && !Character.isDigit(str.charAt(i))) {
++i;
}
int start = i;
while (i < str.length() && Character.isDigit(str.charAt(i))) {
++i;
}
if (i - start > longest.length()) {
longest = str.substring(start, i);
}
}
#Andy already gave a non-regex answer, which is probably faster, but if you want to use regex, you must, as #Jan points out, add logic, e.g.:
public String findLongestNumber(String input) {
String longestMatch = "";
int maxLength = 0;
Matcher m = Pattern.compile("([\\d]+)").matcher(input);
while (m.find()) {
String currentMatch = m.group();
int currentLength = currentMatch.length();
if (currentLength > maxLength) {
maxLength = currentLength;
longestMatch = currentMatch;
}
}
return longestMatch;
}
t
Not possible with pure Regex, however I would do it this way (using Stream Max and Regex) :
String url = "http://stackoverflow.com/1234/questions/123456789/ask";
Pattern biggest = Pattern.compile("/(\\d+)/");
Matcher m = biggest.matcher(url);
List<String> matches = new ArrayList<>();
while(m.find()){
matches.add(m.group(1));
}
System.out.println(matches.parallelStream().max((String a, String b) -> Integer.compare(a.length(), b.length())).get());
Will print : 123456789

matcher to avoid words ending with s,ing, or words in the middle

I am trying to match a text against a glossary list. the problem is that my pattern shows different behavour for one text.
for example here is my text :
\nfor Sprints \nSprints \nSprinting \nAccount Accounts Accounting\nSprintsSprints
with the following pattern matcher, I try to only find the exact word matches with glossary,and avoid finding the words ends with s,ing,... it only return me the right answer for "Account" word, but if I try Sprint, then it returns me Sprints, Sprinting, etc which is not right:
Pattern findTerm = Pattern.compile("(" + item.getTerm() + ")(\\W)",Pattern.DOTALL);
and here is my code :
private static String findGlossaryTerms(String response, List<Glossary> glossary) {
StringBuilder builder = new StringBuilder();
for (int offset = 0; offset < response.length(); offset++) {
boolean match = false;
if (response.startsWith("<", offset)) {
String newString = response.substring(offset);
Pattern findHtmlTag = Pattern.compile("\\<.*?\\>");
Matcher matcher = findHtmlTag.matcher(newString);
if (matcher.find()) {
String htmlTag = matcher.group(0);
builder.append(htmlTag);
offset += htmlTag.length() - 1;
match = true;
}
}
for (Glossary item : glossary) {
if (response.startsWith(item.getTerm(), offset)) {
String textFromOffset = response.substring(offset - 1);
Pattern findTerm = Pattern.compile("(" + item.getTerm() + ")(\\W)",Pattern.DOTALL);
Matcher matcher = findTerm.matcher(textFromOffset);
if (matcher.find()) {
builder.append("<span class=\"term\">").append(item.getTerm()).append("</span>");
offset += item.getTerm().length() - 1;
match = true;
break;
}
}
if (!match)
builder.append(response.charAt(offset));
}
return builder.toString();
}
What is the \\W in your pattern good for? if it just to ensure that the word ends, then use word boundaries instead:
Pattern findTerm = Pattern.compile("(\\b" + item.getTerm() + "\\b)",Pattern.DOTALL);
Those word boundaries ensure, that you are really matching the complete word and don't get partial matches.

Regex. How to find similar parts after some text

I have string p[name]=[1111];[2222] and i need to take from it 3 parts p[name]=, [1111] and [2222]. String can be different like p[name]=[1111] or p[name]=[1111];[2222,[1,2,3],1];[3333]
I'm trying to use regex for it, but can't find working solution.
My regex is
(p\\[[a-zA-Z0-9]+\\]=)(?:(\\[.[^;]+\\]);?)+
When i run this code i have only two groups
Pattern p = Pattern.compile("(p\\[[a-zA-Z0-9^=]+\\]=)(?:;*(\\[.[^;]+\\]))+");
Matcher m = p.matcher("p[name]=[1111];[2222]");
if (m.find()) {
for(int i = 1, l = m.groupCount(); i <= l; ++i) {
System.out.println(m.group(i));
}
}
Result is
p[name]=
[2222]
Why not simply do this?
Pattern p = Pattern.compile("p\\[[a-z0-9]+]=|\\[[0-9]+]", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher("p[name]=[1111];[2222]");
while(m.find()) {
System.out.println(m.group());
}
However, if you want to check the string structure at the same time, you can use this kind of pattern:
Pattern p = Pattern.compile("(p\\[[a-z0-9]+]=)|\\G(?<!^)(\\[[0-9]+])(?:;|$)", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher("p[name]=[1111];[2222]");
while(m.find()) {
System.out.println((m.group(1))? m.group(1) : m.group(2));
}
I can give you this regex. That should also work with this: p[name]=[1111];[2222,[1,2,3],1]
Pattern p = Pattern.compile("([p]+)|\[[a-z0-9\,\[\]]+\]");
Matcher m = p.matcher("p[name]=[1111];[2222]");
if (m.find()) {
for(int i = 1, l = m.groupCount(); i <= l; ++i) {
System.out.println(m.group(i));
}
}

How to iterate over regex expression

Let's say I have the following String:
name1=gil;name2=orit;
I want to find all matches of name=value and make sure that the whole string matches the pattern.
So I did the following:
Ensure that the whole pattern matches what I want.
Pattern p = Pattern.compile("^((\\w+)=(\\w+);)*$");
Matcher m = p.matcher(line);
if (!m.matches()) {
return false;
}
Iterate over the pattern name=value
Pattern p = Pattern.compile("(\\w+)=(\\w+);");
Matcher m = p.matcher(line);
while (m.find()) {
map.put(m.group(1), m.group(2));
}
Is there some way to do this with one regex?
You can validate and iterate over matches with one regex by:
Ensuring there are no unmatched characters between matches (e.g. name1=x;;name2=y;) by putting a \G at the start of our regex, which mean "the end of the previous match".
Checking whether we've reached the end of the string on our last match by comparing the length of our string to Matcher.end(), which returns the offset after the last character matched.
Something like:
String line = "name1=gil;name2=orit;";
Pattern p = Pattern.compile("\\G(\\w+)=(\\w+);");
Matcher m = p.matcher(line);
int lastMatchPos = 0;
while (m.find()) {
System.out.println(m.group(1));
System.out.println(m.group(2));
lastMatchPos = m.end();
}
if (lastMatchPos != line.length())
System.out.println("Invalid string!");
Live demo.
You have to enable multiline-mode for "^" and "$" to work as expected.
Pattern p = Pattern.compile("^(?:(\\w+)=(\\w+);)*$", Pattern.MULTILINE);
while (m.find()) {
for (int i = 0; i < m.groupCount() - 2; i += 2) {
map.put(m.group(i + 1), m.group(i + 2));
}
}
Comments where right, you still have to iterate through matching groups for each line and make the outer group a non-capturing group (?:...).
String example = "name1=gil;name2=orit;";
Pattern pattern = Pattern.compile("((name[0-9]+?=(.+?);))+?");
Matcher matcher = pattern.matcher(example);
// verifies full match
if (matcher.matches()) {
System.out.println("Whole String matched: " + matcher.group());
// resets matcher
matcher.reset();
// iterates over found
while (matcher.find()) {
System.out.println("\tFound: " + matcher.group(2));
System.out.println("\t--> name is: " + matcher.group(3));
}
}
Output:
Whole String matched: name1=gil;name2=orit;
Found: name1=gil;
--> name is: gil
Found: name2=orit;
--> name is: orit

Categories

Resources