How to regExp 'zero or one' groups which contain '.*' - java

I'm trying to get record1, record2, record3 from text:
"Record1 ANY TEXT 123 4 5 Record2 ANOTHER TEXT 90-8098 Record3 MORE TEXT ASD 123"
Each record appears ONE or ZERO times.
I use pattern:
(Record1.*)?(Record2.*)?(Record3.*)?
If each record appears,
matcher.group(1) == "Record1 ANY TEXT 123 4 5 Record2 ANOTHER TEXT 90-8098 Record3 MORE TEXT ASD 123"
matcher.group(2) == null
matcher.group(3) == null
If I use pattern:
(Record1.*)(Record2.*)(Record3.*)
matcher.group(1) == "Record1 ANY TEXT 123 4 5 "
matcher.group(2) == "Record2 ANOTHER TEXT 90-8098 "
matcher.group(3) == "Record3 MORE TEXT ASD 123"
It's exatly what I want, but each record can appear zero time and this regexp not suitable
What pattern should I use?

You want to make your quantifiers non-greedy, and you want to use anchors:
^.*?(Record1.*?)?(Record2.*?)?(Record3.*?)?$
In your original expression, your .* was basically consuming everything to the end of the string, because that's how regular expressions behave, by default (called greedy matching). Since the second and third groups were optional, there was no reason for the engine not to simply match everything with that first .*—it was the most efficient match.
By adding a ? after any quantifier, e.g. *? or +? or ?? or {m,n}?, you instruct the engine to match as little as possible, i.e. invoke non-greedy matching.
So, why the anchors? Well, if you invoke non-greedy matching, the engine's going to try to match as little as possible. So, it'd match nothing, since all your groups are optional! By forcing the whole expression to match the beginning, ^, as well as the end, $, you force to regular expression to find some way to match as few characters as possible via .*?, but still match as much as needed to get all the details.

If your text is tightly packed and is composed of just Record, why not use split
(if Java calls it split).
split regex:
# "(?:(?!Record)[\\S\\s])*(Record[\\S\\s]*?)(?=Record|$(?!\\n))"
(?:
(?! Record )
[\S\s]
)*
( Record [\S\s]*? )
(?=
Record
| $ (?! \n )
)

Related

Regex pattern matching with multiple strings

Forgive me. I am not familiarized much with Regex patterns.
I have created a regex pattern as below.
String regex = Pattern.quote(value) + ", [NnoneOoff0-9\\-\\+\\/]+|[NnoneOoff0-9\\-\\+\\/]+, "
+ Pattern.quote(value);
This regex pattern is failing with 2 different set of strings.
value = "207e/160";
Use Case 1 -
When channelStr = "207e/160, 149/80"
Then channelStr.matches(regex), returns "true".
Use Case 2 -
When channelStr = "207e/160, 149/80, 11"
Then channelStr.matches(regex), returns "false".
Not able to figure out why? As far I can understand it may be because of the multiple spaces involved when more than 2 strings are present with separated by comma.
Not sure what should be correct pattern I should write for more than 2 strings.
Any help will be appreciated.
If you print your pattern, it is:
\Q207e/160\E, [NnoneOoff0-9\-\+\/]+|[NnoneOoff0-9\-\+\/]+, \Q207e/160\E
It consists of an alternation | matching a mandatory comma as well on the left as on the right side.
Using matches(), should match the whole string and that is the case for 207e/160, 149/80 so that is a match.
Only for this string 207e/160, 149/80, 11 there are 2 comma's, so you do get a partial match for the first part of the string, but you don't match the whole string so matches() returns false.
See the matches in this regex demo.
To match all the values, you can use a repeating pattern:
^[NnoeOf0-9+/-]+(?:,\h*[NnoeOf0-90+/-]+)*$
^ Start of string
[NnoeOf0-9\\+/-]+
(?: Non capture group
,\h* Match a comma and optional horizontal whitespace chars
[NnoeOf0-90-9\\+/-]+ Match 1+ any of the listed in the character class
)* Close the non capture group and optionally repeat it (if there should be at least 1 comma, then the quantifier can be + instead of *)
$ End of string
Regex demo
Example using matches():
String channelStr1 = "207e/160, 149/80";
String channelStr2 = "207e/160, 149/80, 11";
String regex = "^[NnoeOf0-9+/-]+(?:,\\h*[NnoeOf0-90+/-]+)*$";
System.out.println(channelStr1.matches(regex));
System.out.println(channelStr2.matches(regex));
Output
true
true
Note that in the character class you can put - at the end not having to escape it, and the + and / also does not have to be escaped.
You can use regex101 to test your RegEx. it has a description of everything that's going on to help with debugging. They have a quick reference section bottom right that you can use to figure out what you can do with examples and stuff.
A few things, you can add literals with \, so \" for a literal double quote.
If you want the pattern to be one or more of something, you would use +. These are called quantifiers and can be applied to groups, tokens, etc. The token for a whitespace character is \s. So, one or more whitespace characters would be \s+.
It's difficult to tell exactly what you're trying to do, but hopefully pointing you to regex101 will help. If you want to provide examples of the current RegEx you have, what you want to match and then the strings you're using to test it I'll be happy to provide you with an example.
^(?:[NnoneOoff0-9\\-\\+\\/]+ *(?:, *(?!$)|$))+$
^ Start
(?: ... ) Non-capturing group that defines an item and its separator. After each item, except the last, the separator (,) must appear. Spaces (one, several, or none) can appear before and after the comma, which is specified with *. This group can appear one or more times to the end of the string, as specified by the + quantifier after the group's closing parenthesis.
Regex101 Test

Regular expression to determine if the String consists of more than 4 numbers

I want to extract URL strings from a log which looks like below:
<13>Mar 27 11:22:38 144.0.116.31 AgentDevice=WindowsDNS AgentLogFile=DNS.log PluginVersion=X.X.X.X Date=3/27/2019 Time=11:22:34 AM Thread ID=11BC Context=PACKET Message= Internal packet identifier=0000007A4843E100 UDP/TCP indicator=UDP Send/Receive indicator=Snd Remote IP=X.X.X.X Xid (hex)=9b01 Query/Response=R Opcode=Q Flags (hex)=8081 Flags (char codes)=DR ResponseCode=NOERROR Question Type=A Question Name=outlook.office365.com
I am looking to extract Name text which contains more that 5 digits.
A possible way suggested is (\d.*?){5,} but does not seem to work, kindly suggest another way get the field.
Example of string match:
outlook12.office345.com
outlook.office12345.com
You can look for the following expression:
Name=([^ ]*\d{5,}[^ ]*)
Explanation:
Name= look for anything that starts with "Name=", than capture if:
[^ ]* any number of characters which is not a space
\d{5,} then 5 digits in a row
[^ ]* then again, all digits up to a white space
This regular expression:
(?<=Name=).*\d{5,}.*?(?=\s|$)
would extract strings like outlook.office365666.com (with 5 or more consecutive digits) from your example input.
Demo: https://regex101.com/r/YQ5l2w/1
Try this pattern: (?=\b.*(?:\d[^\d\s]*){5,})\S*
Explanation:
(?=...) - positive lookahead, assures that pattern inside it is matched somewhere ahead :)
\b - word boundary
(?:...) - non-capturing group
\d[^\d\s]* - match digit \d, then match zero or more of any characters other than whitespace \s or digit \d
{5,} - match preceeding pattern 5 or more times
\S* - match zero or more of any characters other than space to match the string if assertion is true, but I think you just need assertion :)
Demo
If you want only consecutive numbers use simplified pattern (?=\b.*\d{5,})\S*.
Another demo
Of course, you have to add positive lookbehind: (?<=Name=) to assert that you have Name= string preceeding
Try this regex
([a-z0-9]{5,}.[a-z0-9]{5,})+.com
https://regex101.com/r/OzsChv/3
It Groups,
outlook.office365.com
outlook12.office345.com
also all url strings

Java regex shortest match

I have the following string, (a.1) (b.2) (c.3) (d.4). I want to change it to (1) (2) (3) (4). I use the following method.
str.replaceAll("\(.*[.](.*)\)","($1)"). And I only get (4). What is the correct method?
Thanks
Couple things here. First, your escapes for the parentheses are incorrect. In Java string literals, backslash itself is an escape character, meaning you need to use \\( to represent \( in regex.
I think your question is how to do non-greedy matches in regex. Use ? to specify non-greedy matching; e.g. *? means "zero or more times, but as few times as possible".
This doesn't negate other answers, but they depend on your test input being as simple as it is in your question. This gives me the correct output without changing the spirit of your original regex (that only the parentheses and dot delimiter are known to be present):
String test = "(a.1) (b.2) (c.3) (d.4)";
String replaced = test.replaceAll("\\(.*?[.](.*?)\\)", "($1)");
System.out.println(replaced); // "(1) (2) (3) (4)"
Root cause
You want to match ()-delimited substrings, but are using .* greedy dot pattern that can match any 0 or more chars (other than line break chars). The \(.*[.](.*)\) pattern will match the first ( in (a.1) (b.2) (c.3) (d.4), then .* will grab the whole string, and backtracking will start trying to accommodate text for the subsequent obligatory subpatterns. [.] will find the last . in the string, the one before the last digit, 4. Then, (.*) will again grab all the rest of the string, but since the ) is required right after, due to backtracking the last (.*) will only capture 4.
Why is lazy / reluctant .*? not a solution?
Even if you use \(.*?[.](.*?)\), if there are (xxx) like substrings inside the string, they will get matched together with expected matches, as . matches any char but line break chars.
Solution
.replaceAll("\\([^()]*\\.([^()]*)\\)", "($1)")
See the regex demo. The [^()] will only match any char BUT a ( and ).
Details
\( - a ( char
[^()]* - a negated character class matching 0 or more chars other than ( and )
\. - a dot
([^()]*) - Group 1 (its value is later referred to with $1 from the replacement pattern): any 0+ chars other than ( and )
\) - a ) char.
Java demo:
List<String> strs = Arrays.asList("(a.1) (b.2) (c.3) (d.4)", "(a.1) (xxxx) (b.2) (c.3) (d.4)");
for (String str : strs)
System.out.println("\"" + str.replaceAll("\\([^()]*\\.([^()]*)\\)", "($1)") + "\"");
Output:
"(1) (2) (3) (4)"
"(1) (xxxx) (2) (3) (4)"
try this one, it will match any alphabets, . and " and replace them all with empty ""
str.replaceAll("[a-zA-Z\\.\"]", "")
Edit:
You can use also [^\\d)(\\s] to match all characters that are not number, space and )( and replace them all with empty "" string
String str = "(a.1) (b.2) (c.3) (d.4)";
System.out.println(str.replaceAll("[^\\d)(\\s]",""));
Try this
str.replaceAll("[A-Za-z0-9]+\.","");
[A-Za-z0-9] will match the upper case, lower case and digits. If you want to match anything before the dot(.) you can use .+ or .* in the place of [A-Za-z0-9]+

Regular expression for both length with whitespaces

I am trying to write a regular expression with following conditions.
Allow empty at any position in string.
First three are characters-range (1-3)
Next six are numeric (must) -range (6)
Next optional to have characters - range (1-3)
After that optional to have numeric - range(0-2)
For this i tried lot of things nothing works.
^[a-zA-Z]{1,3}[0-9]{6}[a-zA-Z]{0,3}[0-9]{0,2}
This expression works fine for matching all criteria but it is not allowing empty strings. Thanks in advance.
I just want to validate the string like "AB 123456 ADF 12".
As i mentioned first point the string contains empty space at any position in given string like "AB 123 456 ADF 12".
You have to wrap your pattern in parentheses and make it optional using ?:
^(?:[a-zA-Z]{1,3}[0-9]{6}[a-zA-Z]{0,3}[0-9]{0,2})?$
^ Assert beginning of string
(?: Start of non-capturing group
[a-zA-Z]{1,3}[0-9]{6}[a-zA-Z]{0,3}[0-9]{0,2} Your pattern
)? End of NCG, optional
$ Assert end of string
If you want to match strings with whitespace characters add \\s (or \s treating literal) and remove ?:
^(?:[a-zA-Z]{1,3}[0-9]{6}[a-zA-Z]{0,3}[0-9]{0,2}|\s*)$
^^^^
Live demo
Update
Based on comment:
^(?:[a-zA-Z](?:\s*[a-zA-Z]){0,2}\s*\d(?:\s*\d){5}(?:\s*[a-zA-Z](?:\s*[a-zA-Z]){0,2})?\s*(?:\d\s*\d?)?)$
Live demo

Java regex capturing groups indexes

I have the following line,
typeName="ABC:xxxxx;";
I need to fetch the word ABC,
I wrote the following code snippet,
Pattern pattern4=Pattern.compile("(.*):");
matcher=pattern4.matcher(typeName);
String nameStr="";
if(matcher.find())
{
nameStr=matcher.group(1);
}
So if I put group(0) I get ABC: but if I put group(1) it is ABC, so I want to know
What does this 0 and 1 mean? It will be better if anyone can explain me with good examples.
The regex pattern contains a : in it, so why group(1) result omits that? Does group 1 detects all the words inside the parenthesis?
So, if I put two more parenthesis such as, \\s*(\d*)(.*): then, will be there two groups? group(1) will return the (\d*) part and group(2) return the (.*) part?
The code snippet was given in a purpose to clear my confusions. It is not the code I am dealing with. The code given above can be done with String.split() in a much easier way.
Capturing and grouping
Capturing group (pattern) creates a group that has capturing property.
A related one that you might often see (and use) is (?:pattern), which creates a group without capturing property, hence named non-capturing group.
A group is usually used when you need to repeat a sequence of patterns, e.g. (\.\w+)+, or to specify where alternation should take effect, e.g. ^(0*1|1*0)$ (^, then 0*1 or 1*0, then $) versus ^0*1|1*0$ (^0*1 or 1*0$).
A capturing group, apart from grouping, will also record the text matched by the pattern inside the capturing group (pattern). Using your example, (.*):, .* matches ABC and : matches :, and since .* is inside capturing group (.*), the text ABC is recorded for the capturing group 1.
Group number
The whole pattern is defined to be group number 0.
Any capturing group in the pattern start indexing from 1. The indices are defined by the order of the opening parentheses of the capturing groups. As an example, here are all 5 capturing groups in the below pattern:
(group)(?:non-capturing-group)(g(?:ro|u)p( (nested)inside)(another)group)(?=assertion)
| | | | | | || | |
1-----1 | | 4------4 |5-------5 |
| 3---------------3 |
2-----------------------------------------2
The group numbers are used in back-reference \n in pattern and $n in replacement string.
In other regex flavors (PCRE, Perl), they can also be used in sub-routine calls.
You can access the text matched by certain group with Matcher.group(int group). The group numbers can be identified with the rule stated above.
In some regex flavors (PCRE, Perl), there is a branch reset feature which allows you to use the same number for capturing groups in different branches of alternation.
Group name
From Java 7, you can define a named capturing group (?<name>pattern), and you can access the content matched with Matcher.group(String name). The regex is longer, but the code is more meaningful, since it indicates what you are trying to match or extract with the regex.
The group names are used in back-reference \k<name> in pattern and ${name} in replacement string.
Named capturing groups are still numbered with the same numbering scheme, so they can also be accessed via Matcher.group(int group).
Internally, Java's implementation just maps from the name to the group number. Therefore, you cannot use the same name for 2 different capturing groups.
For The Rest Of Us
Here is a simple and clear example of how this works:
( G1 )( G2 )( G3 )( G4 )( G5 )
Regex:([a-zA-Z0-9]+)([\s]+)([a-zA-Z ]+)([\s]+)([0-9]+)
String: "!* UserName10 John Smith 01123 *!"
group(0): UserName10 John Smith 01123
group(1): UserName10
group(2):
group(3): John Smith
group(4):
group(5): 01123
As you can see, I have created FIVE groups which are each enclosed in parentheses.
I included the !* and *! on either side to make it clearer. Note that none of those characters are in the RegEx and therefore will not be produced in the results. Group(0) merely gives you the entire matched string (all of my search criteria in one single line). Group 1 stops right before the first space because the space character was not included in the search criteria. Groups 2 and 4 are simply the white space, which in this case is literally a space character, but could also be a tab or a line feed etc. Group 3 includes the space because I put it in the search criteria ... etc.
Hope this makes sense.
Parenthesis () are used to enable grouping of regex phrases.
The group(1) contains the string that is between parenthesis (.*) so .* in this case
And group(0) contains whole matched string.
If you would have more groups (read (...) ) it would be put into groups with next indexes (2, 3 and so on).

Categories

Resources