Match longest string in Regex OR in case of common substring - java

In a regex OR, When there are multiple inputs with a common prefix, The regex will match the first input in Regex OR instead of longest match.
For example, for the regular expression regex = (KA|KARNATAKA) and input = KARNATAKA the output will be 2 matches match1 =KA and match2 = KA.
But what I want is complete longest possible match out of given input in Regex OR which is match1 = KARNATAKA in my given example.
Here is the example in a regex client
So what I am doing right now is, I am sorting the input in Regex OR by length in descending order.
My question is, Can we specify in the regex itself to match the longest possible String? Or is sorting the only way to do it?
I have already refered this question and I don't see a solution other than sorting

You can use word boundary (\b) to avoid matching prefixes
For the case you mentioned: the following regex will only match KA or KARNATAKA
(\bKA\b|\bKARNATAKA\b)
Try here

You can create a helper method for this:
public final class PatternHelper {
public static Pattern compileSortedOr(String regex) {
Matcher matcher = Pattern.compile("(.*)\\((.*\\|.*)\\)(.*)").matcher(regex);
if (matcher.matches()) {
List<String> conditions = Arrays.asList(matcher.group(2).split("\\|"));
List<String> sortedConditions = conditions.stream()
.sorted((c1, c2) -> c2.length() - c1.length())
.collect(Collectors.toList());
return Pattern.compile(matcher.group(1) +
"(" +
String.join("|", sortedConditions) +
")" +
matcher.group(3));
}
return Pattern.compile(regex);
}
}
Matcher matcher = PatternHelper.compileSortedOr("(KA|KARNATAKA)").matcher("KARNATAKA");
if (matcher.matches()) {
System.out.println(matcher.group(1));
}
Output:
KARNATAKA
P.S. This only works for simple expressions without nested brackets. You would need to tweak if you are expecting much complex expressions.

Related

Searching characters with regular expressions

How do I search a string that can have a "<=", ">=" or a "="?
I´ve reached this point:
[<>][=]
so it searches the first two
Is there any character that inside the [<>] searches "nothing" so i will just get the [=] that follows?
To make some pattern optional, one or zero occurrences, use ? quantifier:
[<>]?=
In Java, you can use it with matches() to check if a string contains <=, >= or just =:
if (s.matches("(?s).*[<>]?=.*")) {...}
Or using a Matcher#find() (demo):
String s = "Some = equal sign";
Pattern pattern = Pattern.compile("[<>]?=");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println("Found " + matcher.group());
} // => Found =
An alternative to #stribizhev's suggestion to use ? is to explicitly enumerate the three cases:
(<=|>=|=)

How to parse a range input in java

I want to parse a range of data (e.g. 100-2000) in Java. Is this code correct:
String patternStr = "^(\\\\d+)-(\\\\d+)$";
Pattern pattern = Pattern.compile(patternStr);
Matcher matcher = pattern.matcher(inputStr);
if(matcher.find()){
// Doing some parser
}
Too many backslashes, and you can use matches() without anchors (^$).
String inputStr = "100-2000";
String patternStr = "(\\d+)-(\\d+)";
Pattern pattern = Pattern.compile(patternStr);
Matcher matcher = pattern.matcher(inputStr);
if (matcher.matches()) {
System.out.println(matcher.group(1) + " - " + matcher.group(2));
}
As for your question "Is this code correct", all you had to do was wrap the code in a class with a main method and run it, and you'd get the answer: No.
No, you're double (well, quadruple)-escaping the digits.
It should be: "^(\\d+)-(\\d+)$".
Meaning:
Start of input: ^
Group 1: 1+ digit(s): (\\d+)
Hyphen literal: -
Group 2: 1+ digit(s): (\\d+)
End of input: $
Notes
The groups are useful for back-references. Here you're using none, so you can ditch the parenthesis around the \\d+ expressions.
You are parsing the representation of a range in this example.
If you want an actual range class, you can use the [min-max] idiom, where "min" and "max" are numbers, for instance [0-9].
As mentioned by Andreas, you can use String.matches without the Pattern-Matcher idiom and the ^ and $, if you want to match the whole input.

Java regex matching each occurence separately

I have this regex:
<a href(.*foo.bar.*)a>
For this string, it gives me only 1 match, but I need it to give 3 matches.
First RANDOM TEXT COULD BE HERE Second RANDOM TEXT COULD BE HERE Third
So each a href should be individual.
How could I accomplish this?
EDIT:
This code searches for matches:
Pattern pattern = Pattern.compile("<a href(.*foo.bar.*)a>");
Matcher matcher = pattern.matcher(body);
List<String> matches = new ArrayList<String>();
while (matcher.find()) {
matches.add(matcher.group());
}
Change to:
<a href(.*?foo\.bar.*?)a>
It removes the greediness. And real dots should be escaped to \..
Use .*? instead of .*. The greedy quantifier matches characters as many as possible, while the reluctant quantifier matches the least number of characters in a single find operation.
Besides, use foo\.bar if you intend to match a literal text of "foo.bar".
Hope below code will help you:
int noOfTimefoundString = 0;
Pattern pattern = Pattern.compile("<a href=\"https://foo.bar");
Matcher matcher = pattern.matcher(body);
List<String> matches = new ArrayList<String>();
while (matcher.find()) {
matches.add(matcher.group());
noOfTimefoundString++;
}
Iterator matchesItr = matches.iterator();
while(matchesItr.hasNext()){
System.out.println(matchesItr.next());
}
System.out.println("No. of times search string found = "+noOfTimefoundString);

Reg expression - split string between matching strings

I am trying to get an array of strings, from a lengthy string. Array consist of strings matching between two other strings (??? and ??? in my case). I tried the following code and it's not giving me the expected results
Pattern pattern = Pattern.compile("\\?\\?\\?(.*?)\\?\\?\\?");
String[] arrayOfKeys = pattern.split("???label.missing???sdfjkhsjkdf sjkdghfjksdg ???some.label???sdjkhsdj");
for (String key : arrayOfKeys) {
System.out.println(key);
}
My expected result is:
["label.missing", "some.label"]
Use Pattern.matcher() to obtain a Matcher for the input string, then use Matcher.find() to find the pattern you want. Matcher.find() will find substring(s) that matches the Pattern provided.
Pattern pattern = Pattern.compile("\\?{3}(.*?)\\?{3}");
Matcher m = pattern.matcher(inputString);
while (m.find()) {
System.out.println(m.group(1));
}
Pattern.split() will use your pattern as delimiter to split the string (then the delimiter part is discarded), which is obviously not what you want in this case. Your regex is designed to match the text that you want to extract.
I shorten the pattern to use quantifier repeating exactly 3 times {3}, instead of writing \? 3 times.
I would create a string input with what you're trying to split, and call input.split() on it.
String input = "???label.missing???sdfjkhsjkdf sjkdghfjksdg ???some.label???sdjkhsdj";
String[] split = input.split("\\?\\?\\?");
Try it here:
http://ideone.com/VAmCyu
Pattern pattern = Pattern.compile("\\?{3}(.+?)\\?{3}");
Matcher matcher= pattern.matcher("???label.missing???sdfjkhsjkdf sjkdghfjksdg ???some.label???sdjkhsdj");
List<String> aList = new ArrayList<String>();
while(matcher.find()) {
aList.add(matcher.group(1));
}
for (String key : aList) {
System.out.println(key);
}

Java regex patterns

I need help with this matter. Look at the following regex:
Pattern pattern = Pattern.compile("[A-Za-z]+(\\-[A-Za-z]+)");
Matcher matcher = pattern.matcher(s1);
I want to look for words like this: "home-made", "aaaa-bbb" and not "aaa - bbb", but not
"aaa--aa--aaa". Basically, I want the following:
word - hyphen - word.
It is working for everything, except this pattern will pass: "aaa--aaa--aaa" and shouldn't. What regex will work for this pattern?
Can can remove the backslash from your expression:
"[A-Za-z]+-[A-Za-z]+"
The following code should work then
Pattern pattern = Pattern.compile("[A-Za-z]+-[A-Za-z]+");
Matcher matcher = pattern.matcher("aaa-bbb");
match = matcher.matches();
Note that you can use Matcher.matches() instead of Matcher.find() in order to check the complete string for a match.
If instead you want to look inside a string using Matcher.find() you can use the expression
"(^|\\s)[A-Za-z]+-[A-Za-z]+(\\s|$)"
but note that then only words separated by whitespace will be found (i.e. no words like aaa-bbb.). To capture also this case you can then use lookbehinds and lookaheads:
"(?<![A-Za-z-])[A-Za-z]+-[A-Za-z]+(?![A-Za-z-])"
which will read
(?<![A-Za-z-]) // before the match there must not be and A-Z or -
[A-Za-z]+ // the match itself consists of one or more A-Z
- // followed by a -
[A-Za-z]+ // followed by one or more A-Z
(?![A-Za-z-]) // but afterwards not by any A-Z or -
An example:
Pattern pattern = Pattern.compile("(?<![A-Za-z-])[A-Za-z]+-[A-Za-z]+(?![A-Za-z-])");
Matcher matcher = pattern.matcher("It is home-made.");
if (matcher.find()) {
System.out.println(matcher.group()); // => home-made
}
Actually I can't reproduce the problem mentioned with your expression, if I use single words in the String. As cleared up with the discussion in the comments though, the String s contains a whole sentence to be first tokenised in words and then matched or not.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegExp {
private static void match(String s) {
Pattern pattern = Pattern.compile("[A-Za-z]+(\\-[A-Za-z]+)");
Matcher matcher = pattern.matcher(s);
if (matcher.matches()) {
System.out.println("'" + s + "' match");
} else {
System.out.println("'" + s + "' doesn't match");
}
}
/**
* #param args
*/
public static void main(String[] args) {
match(" -home-made");
match("home-made");
match("aaaa-bbb");
match("aaa - bbb");
match("aaa--aa--aaa");
match("home--home-home");
}
}
The output is:
' -home-made' doesn't match
'home-made' match
'aaaa-bbb' match
'aaa - bbb' doesn't match
'aaa--aa--aaa' doesn't match
'home--home-home' doesn't match

Categories

Resources