Get all possible matches in a regex match (In Java)? - java

I am using a regex to match few possible values in a string that coming with my objects, there I need to get all possible values that are matching from my string as below,
If my string value is "This is the code ABC : xyz use for something".
Here is my code that I am using to extract matchers,
String my_regex = "(ABC|ABC :).*";
List <String> matchers = Pattern.compile(my_regex, Pattern.CASE_INSENSITIVE)
.matcher(my_string)
.results()
.map(MatchResult::group)
.collect(Collection.toList());
I am expecting the 2 list items as the output > {"ABC", "ABC :"}, But I am only getting one. Help would be highly appreciated.

What you describe just isn't how regex engines work. They do not find all possible variant search results; they simply consume and give you all results, moving forward. In other words, had you written:
String my_regex = "(ABC|ABC :)"; // note, get rid of the .*
String myString = "This is the code ABC : xyz use for something ABC again";
Then you'd get 2 results back - ABC : and ABC.
Yes, the regex could just as easily match just the ABC part instead of the ABC : part and it would still be valid. However, regexp matching is by default 'greedy' - it will match as much as it can. For some operators (specifically, * and +) you can use the non-greedy variants: *? and +? which will match as little as possible.
In other words, given:
String regex = "(a*?)(a+)";
String myString = "aaaaa";
Then group 1 would match 0 a (that's the shortest string that can match (a*?) whilst still being able to match the entire regex to the input), and group 2 would be aaaaa.
If, on the other hand, you wrote (a*)(a+), then group 1 would be aaaa and group 2 would be a. It is not possible to ask the regexp engine to provide for you the combinatory explosion, with every possible length of 'a' - which appears to be what you want. The regexp API that ships with java does not have any option to do this, nor does any other regexp API I know of, so you'd have to write that yourself, perhaps. I admit I haven't scoured the web for every possible alternate regex engine impl for java, there are a bunch of third party libraries, perhaps one of them can do it.
NB: I said at the start: Get rid of the .*. That's because otherwise it's still just the one match: ABC : xyz use for something ABC again is the longest possible match and given that regex engines are greedy, that's what you will get: It is a valid 'interpretation' of your string (1 match), consuming the most - that's how it works.
NB2: Greediness can never change whether a regex even matches or not. It just changes which of the input is assigned to which group, and when find()ing more than once (which .results() does - it find()s until no more matches are found - which matches you get.

Related

Regular expression to match a sentence

I have used regular expression
(.*)(\s){1}([0-9]*min|lightning)\b
to match a string like the following:
Writing Fast Tests Against Enterprise Rails 60min
Clojure Ate Scala (on my project) 45min
Accounting-Driven Development 45min
The matcher group will give:
Writing Fast Tests Against Enterprise Rails
Space
60 mins
But a string like ???????????????????? 60min also matches.
Can anyone help me out with this?
Updated
As per the answer regex like below will solve it but for input 3
^([^XX]*)(\s){1}([0-9]*min|lightning)\b$
i want to allow - so that input 3 matches. String like as below will
also match which is not correct
Group 1 should contain only alphabets
------------ 60min
Please see the link https://regex101.com/r/JH5sl1/6
Change (.*)(\s){1}([0-9]*min|lightning)\b to ^([\w ]*)(\s){1}([0-9]*min|lightning)\b$.
It will ensure that you match only alphanumerics and spaces at the start of the string. See demo.
Since the question is evolving, let's take the issue on the other way. Change your regex to:
^([^XX]*)(\s){1}([0-9]*min|lightning)\b$
^^^^^
And change XX to whatever you don't want to match. It's called a negated character class and allow you "refuse" the characters it contains.
If you want to avoid the presence of two consecutive non alphanumeric chars, you can use a negative lookahead:
^(?!.*(\W)\1)(\D*)\s(\d*min|lightning)$
Demo

How to check that the line does not contradict a regular expression (regex)?

Sorry if this a duplicate, I didn't find similar question, maybe I missed something...
I will explain what I want by example:
Suppose we have a simple regular expression for checking email
private static final String EMAIL_PATTERN = "^[_A-Za-z0-9-\\+]+(\\.[_A-Za-z0-9-]+)*#[A-Za-z0-9-]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})$";
I start to enter (in some input field) email (or not email, but some string) symbol by symbol and check whole line by regex.
Enter: u
Line: u
Check: true \\ because there is a chance that I will enter right symbols further
Enter: s
Line: us
Check: true
...
Enter: #
Line: username#
Check: true
Enter: #
Line: username##
Check: false \\ because there is no way to get correct line appending any symbols
Enter: d
Line: username#d
Check: true
Enter: .
Line: username#domain.
Check: true
Enter: .
Line: username#domain..
Check: false
By other words I want to check string by regex and get positive result if there is possibility that appending symbols will give us correct string.
First things first...
Your E-Mail Regex is wrong
E-Mails are extremely hard to validate just on the base of how the address looks... Most times people just do it wrong. You too. Sorry.
This is not really about regex, but about UX...
You are probably better off just allowing the user to enter whatever they want, and tell them if their email is likely to be mistyped, and not prevent the user from entering it in the first place.
As to validating while input
If you still want to run with your regex, just make the later parts optional, so the incomplete parts match the regex already.
https://regex101.com/r/zO6nM7/1
/^[_a-z0-9-\+.-]+(\.[_a-z0-9-]+)*#?([a-z0-9+\-]+\.)*[a-z0-9+\-]*$/i
What you can use here is ?. That is the symbol for making the preceding symbol optional. Your example is bad as email shouldn't be validated via regex so I'll use something else.
Suppose you want to match the following
4 alphabets then 2 digits then 4 alphabets
So you can use a regex like
(?:\w{0,4})?(?:\d{0,2})?(?:\w{0,4})?
The below is called a non-capturing group. You can use a capturing group but you shouldn't due to performance as you don't need the captured stuff.
(?: something)
Explantation
Basically what I did was go and decide what were the pieces of my string in my initial specification of
4 alphabets ..
and then I broke each piece into a separate regex and make it optional so that my regex is basically saying
Match 0 to 4 characters optionally and match further
Match 0 to 2 digits optionally and match further
Match 0 to 4 characters optionally and match further
The above is not full proof. It can give false positives but if the concern is whether appending may give correct result and you don't need absolutes then you can use this approach.
For better results i.e. absolutely correct results you can have lookbehinds. But be warned that can become complicated. But if you are looking for something simple this can work.
You can use the regex.Pattern and regex.Matcher classes from the util package
import java.util.regex.Matcher;
import java.util.regex.Pattern;
//...
private static final String EMAIL_PATTERN = "^[_A-Za-z0-9-\\+]+(\\.[_A-Za-z0-9-]+)*#[A-Za-z0-9-]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})$";
// String to be scanned to find the pattern.
String line = "Your line";
String pattern = EMAIL_PATTERN;
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher(line);
// You can use capturing groups to partially check the line
// e.g. for Java 6 use the following to access the first group
String value = m.group(0);
// True, in case the matcher did not find any according value:
value.equals("NO MATCH");
compare http://www.tutorialspoint.com/java/java_regular_expressions.htm and Java Regex Capturing Groups
You could then check the different groups and as soon as there is a group that is not matched, but a following group does have a match, your case in which no valid input would be possible anymore (by adding only).
To make it work with your example, you can call say a validate function that implements the above code on every enter-event, e.g. a key-down event. But that depends on your UI-framework

Regular expression non-greedy but still

I have some larger text which in essence looks like this:
abc12..manycharshere...hi - abc23...manyothercharshere...jk
Obviously there are two items, each starting with "abc", the numbers (12 and 23) are interesting as well as the "hi" and "jk" at the end.
I would like to create a regular expression which allows me to parse out the numbers, but only if the two characters at the end match, i.e. I am looking for the number related to "jk", but the following regular expression matches the whole string and thus returns "12", not "23" even when non-greedy matching the area with the following:
abc([0-9]+).*?jk
Is there a way to construct a regular expression which matches text like the one above, i.e. retrieving "23" for items ending in "jk"?
Basically I would need something like "match abc followed by a number, but only if there is "jk" at the end before another instance of "abc followed by a number appears"
Note: the texts/matches are an abstraction here, the actual text is more complicated, espially the things that can appear as "manyothercharactershere", I simplified to show the underlying problem more clearly.
Use a regex like this. .*abc([0-9]+).*?jk
demo here
I think you want something like this,
abc([0-9]+)(?=(?:(?!jk|abc[0-9]).)*jk)
DEMO
You need to use negative lookahead here to make it work:
abc(?!.*?abc)([0-9]+).*?jk
RegEx Demo
Here (?!.*?abc) is negative lookahead that makes sure to match abc where it is NOT followed by another abc thus making sure closes string between abc and jk is matched.
Being non-greedy does not change the rule, that the first match is returned. So abc([0-9]+).*?jk will find the first jk after “abcnumber” rather than the last one, but still match the first “abcnumber”.
One way to solve this is to tell that the dot should not match abc([0-9]+):
abc([0-9]+)((?!abc([0-9]+)).)*jk
If it is not important to have the entire pattern being an exact match you can do it simpler:
.*(abc([0-9]+).*?jk)
In this case, it’s group 1 which contains your intended match. The pattern uses a greedy matchall to ensure that the last possible “abcnumber” is matched within the group.
Assuming that hyphen separates "items", this regex will capture the numbers from the target item:
abc([0-9]+)[^-]*?jk
See demo

Regex matches with multiple patterns

I'm trying to compile one Java Regex pattern but have that pattern look for 3 different matches. I've learned that I can do that using the pipe (|) but I'm having trouble with the actual syntax of the regex.
I'm looking through XML data and trying to pull out 3 matches. The XML will look something like this:
<Element createdOn="1405358703367" updatedOn="1405358718804" url="http://www.someurl.com" />
The regex I'm trying looks like this so far:
((?<="url": ").*(?=")) | (createdOn="(\d)") | (updatedOn="(\d)")
In the end I Need to get everything between the quotes in the XML (i.e. 1405358703367, 1405358718804, and http://www.someurl.com.
I had the URL regex working on its own earlier, but there seems to be no matches being made.
Thanks.
Get the matched group from index 2.
(url|createdOn|updatedOn)="([^"]*)"
DEMO
Here is sample code:
String string = "<Element createdOn=\"1405358703367\" updatedOn=\"1405358718804\" url=\"http://www.someurl.com\" />";
String patternString = "(url|createdOn|updatedOn)=\"([^\"]*)\"";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println(matcher.group(2));
}
output:
1405358703367
1405358718804
http://www.someurl.com
Java doesn't have library method that extracts matches, but you only need one line:
String[] matches = input.replaceAll("^[^\"]*\"|\"[^\"]*$", "").split("\"[^\"]*\"");
This works by stripping off leading and trailing input up to/from the first/last quote, then splits on quote-nonquote-quote input, leaving the target matches as an array.
In action:
String input = "<Element createdOn=\"1405358703367\" updatedOn=\"1405358718804\" url=\"http://www.someurl.com\" />";
String[] matches = input.replaceAll("^[^\"]*\"|\"[^\"]*$", "").split("\"[^\"]*\"");
System.out.println(Arrays.toString(matches));
Output:
[1405358703367, 1405358718804, http://www.someurl.com]
The pipe (|) is used to find a match that could be some-pattern OR some-other-pattern OR yet-another-pattern. It's not good at finding all occurrences of several patterns. To do that, if the patterns you're looking for aren't necessarily in a fixed order, you'll need to use a loop.
Here's a code example that starts with the pattern you tried, fixes some problems, and uses a loop to find the patterns:
Pattern p = Pattern.compile("((?<=url=\").*(?=\"))|(createdOn=\"(\\d+)\")|(updatedOn=\"(\\d+)\")");
Matcher m = p.matcher(source);
while (m.find()) {
System.out.println("Found: "+m.group());
System.out.println("Group 1: "+m.group(1));
System.out.println("Group 3: "+m.group(3));
System.out.println("Group 5: "+m.group(5));
}
(Some problems with your original pattern: You put space characters before and after each |, which are treated literally and mean the pattern has to match spaces that aren't there. I added + after \\d because you want to match more than one digit. There were some mistakes, like putting : after url instead of =.)
Now the code uses a loop to find each successive pattern that matches one of the patterns you're looking for. It matches either url=... or createdOn=... or updatedOn=..., but by using a loop we will find all of them. (Note that it doesn't care if it sees a url or a createdOn attribute twice in the source. You'll have to check that yourself.)
The group() method with no parameters will return whatever was matched by the pattern. group(1), group(3), and group(5) return certain subsections of the pattern; the numbers are determined by counting wherever you use ( in the pattern except for (?. So group 1 matches something using url as a lookbehind; group 2 starts with createdOn; group 3 is the sequence of digits following createdOn; group 4 starts with updatedOn, etc. The way the pattern is set up, not all of these will have values, since only one of the three alternatives separated by | will match. The rest will be null. As a result, the output of the above code will display null for two of the groups, and a useful value for the other. If you do things this way, you'll need to test for null to see which value actually got returned.
This would also be a case where named capturing groups could be useful. See http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html. Matcher has a group(name) function that takes a group name as a parameter.
This is one approach, but there are always multiple approaches to string parsing, and the other answers posted here are valid also. Plus there are already XML parsers to take care of things like this for you.
NOTE: This answer was meant to point out how | works. I don't recommend actually doing things this way, since it's overly complicated. If you're going to look separately for each attribute, it would be simpler just to set up three patterns and look for each one, one at a time. Or use #braj's suggestion in a loop, and perform a later check to make sure the createdOn and updatedOn values are numeric.

Possible Regular Expression Question

I have a simple program that looks up details of an IP you give it, and I will show you an example of some of my code
int regIndex = src.indexOf("Region:") + 16;
int endIndex = src.indexOf("<", regIndex);
String region = src.substring(regIndex, endIndex);
if(regIndex == 15) region = "None";
int counIndex = src.indexOf("Country:") + 17;
int couneIndex = src.indexOf(" <", counIndex);
String country = src.substring(counIndex, couneIndex);
As you can see, it is definitely not the most efficient way to do this. The website I am using gives the information like this: http://whatismyipaddress.com/ip/1.1.1.1
I have never really used Regular Expressions before, but it seems to me like there might be one that could really make this more efficient and easier to program, but I've been looking around and I'm pretty lost.
So basically my question is, how could I use a Regular Expression for this (Or if there is another more efficient way).
Any help would be great,
Thanks :)
You can do something like this:
String s = "bla Country: Australia <bla";
Pattern pattern = Pattern.compile("Country: (.*) [<]");
Matcher matcher = pattern.matcher(s);
if(matcher.find()) {
System.out.println("Country = " + matcher.group(1));
}
The source would look like this
<tr><th>Country:</th><td>Australia <img src="http://whatismyipaddress.com/images/flags/au.png" alt="au flag"> </td></tr>
To use regular expression means to match a pattern.
The pattern that indicates your wanted data is pretty straight forward Country:. You need also to match the following tags like <\/th><td>. The only thing is you need to escape the forward slash. Then there is the data you are looking for, I would suggest to match everything that is not a <, so [^<], this is a capturing group with a negation at the beginning, meaning any character that is not a <, to repeat this add a + at the end, meaning at least one of the preceding character.
So, the complete thing should look like this:
Country:<\/th><td>\s*([^<]+)\s*<
I added here also the brackets, they mean put the found pattern into a variable, so your result can be found in capturing group 1. I added also \s*, this is a whitespace character repeated 0 or more times, this is to match whitespace before or after your data, I assume that you don't need that.
Firstly there are some online sites that can help you to develop a regular expression. They let you enter some text, and a regular expression and then show you the result of applying the expression to the text. This saves you having to write code as you develop the expression and expand your understanding. A good site I use alot is FileFormat regex because it allows me to test one expression against multiple test strings. A quick search also brought up regex Planet, RegExr and RegexPal. There are lots of others.
In terms of resources, the Java Pattern class reference is useful for Java development and I quite like regular-expression.info as well.
For your problem I used fileFormat.info and came up with this regex to match "http://whatismyipaddress.com/ip/1.1.1.1":
.*//([.\w]+)/.*/(\d+(?:.\d+){3})
or as a java string:
".*//([.\\w]+)/.*/(\\d+(?:.\\d+){3})"
A quick break down says anything (.*), followed by two slashes (//), followed by at least one or more decimal points or characters (([.\w]+)), followed by a slash, any number of characters and another slash (/.*/), followed by at least 1 digit ((\d+), followed by 3 sets of a decimal point and at least one digit ((?:.\d+){3})). The sets of brackets around the server name part and the IP part are called capturing groups and you can use methods on the Java Matcher class to return the contents of these sections. The ?: on the second part of the ip address tells it that we are using the brackets to group the characters but it's not to be treated as a capturing group.
This regex is not as strict or as flexible as it should be, but it's a starting point.
All of this can be researched on the above links.

Categories

Resources