Regex backreference (.) [duplicate] - java

Wishing to put some order into my knowledge of regular expressions I decided to go through a book about them, Introducing Regular Expressions. And I know it's silly but one of the introductory examples doesn't make sense to me.
(\d)\d\1
Sample text:
123-456-7890
(should capture the first number, 123)
Can anyone explain what is going on in here?
As far as I can figure out, the first \d captures the number 123. The \1 backreferences (marks) the group for later use. The parenthesis limit the scope of the group. But what does the second \d does?
Simple explanation, like to a small child or a golden retriever are prefered.

\d is just one digit.
This regular expression doesn't match the "123-456-7890" string but it would match "323" (which could be part of a greater string, for example "323-456-7890") :
(\d) : first digit ("3")
\d : another digit ("2")
\1 : first group (which was "3")
Now, if your book pretends that (\d)\d\1 should capture "123" in "123-456-7890", then it might contain an error...

(\d)\d\1 step by step:
The first \d matches one digit
And the parentheses () mark this as a capturing group - this is the first one, so the digit is remembered as "group 1"
The second \d says there is another digit
\1 says "here is the value from our previous group 1" - that is the digit that was matched in step 1.
So like dystroy already said: the regex should match a sequence of three digits of which the first and the third are equal.

Related

Problem coming up with appropriate Regex expression

I need to match text similar to the following text in an if statement.
REG#John Smith#14102245862#7 johns road new york#John Anthony Smith
The expression is meant to match a REG keyword at the beginning of the string then username followed by an account number composed of numbers with no specific restriction on the number of digits, then the address and lastly the name of the individual the address is registered to.
The Regex expression I had come up with is not working. The regex expression is below:
^REG\#\w\#[0-9]\#\w\#\w
May you kindly assist in showing me where I went wrong and how to make it work.
Thank you in advance
The problem is that you don't use quantifiers (* or +) and space is not included within \w which stands for [A-Za-z0-9_]. The character # does not need to be escaped (at least as far as I know in Java). Try the following Regex:
^REG#[\w ]+#\d+#[\w ]+#[\w ]+
^REG matches the beginning of the string (REG) literally
# matches self literally
[\w ]+ stands for at least one word character or space
\d+ stands for at least one digit
In Java, don't forget the double escaping:
String regex = "^REG#[\\w ]+#\\d+#[\\w ]+#[\\w ]+";
Try ^REG\#.*?\#[0-9]*?\#.*?\#.* , the operator *? means repeat until next slice of expression, in that case, \#

Simple Regex phantom matches? [duplicate]

I try to extract the error number from strings like "Wrong parameters - Error 1356":
Pattern p = Pattern.compile("(\\d*)");
Matcher m = p.matcher(myString);
m.find();
System.out.println(m.group(1));
And this does not print anything, that became strange for me as the * means * - Matches the preceding element zero or more times from Wiki
I also went to the www.regexr.com and regex101.com and test it and the result was the same, nothing for this expression \d*
Then I start to test some different things (all tests made on the sites I mentioned):
(\d)* doesn't work
\d{0,} doesn't work
[\d]* doesn't work
[0-9]* doesn't work
\d{4} works
\d+ works
(\d+) works
[0-9]+ works
So, I start to search on the web if I could find an explanation for this. The best I could find was here on the Quantifier section, which states:
\d? Optional digit (one or none).
\d* Eat as many digits as possible (but none if necessary)
\d+ Eat as many digits as possible, but at least one.
\d*? Eat as few digits as necessary (possibly none) to return a match.
\d+? Eat as few digits as necessary (but at least one) to return a match.
The question
As english is not my primary language I'm having trouble to understand the difference (mainly the (but none if necessary) part). So could you Regex expert guys explain this in simple words please?
The closest thing that I find to this question here on SO was this one: Regex: possessive quantifier for the star repetition operator, i.e. \d** but here it is not explained the difference.
The * quantifier matches zero or more occurences.
In practice, this means that
\d*
will match every possible input, including the empty string. So your regex matches at the start of the input string and returns the empty string.
but none if necessary means that it will not break the regex pattern if there is no match. So \d* means it will match zero or more occurrences of digits.
For eg.
\d*[a-z]*
will match
abcdef
but \d+[a-z]*
will not match
abcdef
because \d+ implies that at least one digit is required.
\d* Eat as many digits as possible (but none if necessary)
\d* means it matches a digit zero or more times. In your input, it matches the least possible one (ie, zero times of the digit). So it prints none.
\d+
It matches a digit one or more times. So it should find and match a digit or a digit followed by more digits.
With the pattern /d+ at least one digit will need to be reached, and then the match will return all subsequent characters until a non-digit character is reached.
/d* will match all the empty strings (zero or more), as well at the match. The .Net Regex parser will return all these empty string groups in its set of matches.
Simply:
\d* implies zero or more times
\d+ means one or more times

How to insert spaces after full stops at the end of sentences, but not in abbreviations or floating point numbers?

I have a JTextArea in which I want to replace all full stops without a space next to them e.g in "This is a sentence.This is another C.O.D sentence.This is yet another C.A.T. sentence." to "This is a sentence. This is another C.O.D sentence. This is yet another C.A.T. sentence.". But I don't want the abbreviations or floating point numbers to gain extra spaces e.g "This is a C.A.T. float 5.5" should not become "This is a C. A. T. float 5. 5"! I am using string.replaceAll(".",". ") for this which is not proving to be sufficient.
Keeping it simple, without negative look-behinds and such:
s = s.replaceAll("([^A-Z0-9.])\\.([^0-9 \t])", "$1. $2");
Replace the period when not:
after a capital itself (U.N.C. or M.Twain)
after a digit (1. - hoping the sentence does not end in a digit)
after a period (...)
before a digit (.5 - hoping the next sentence does not start with a digit)
before a space or tab
you can use the regex
([^A-Z])\.(?!\d)
which replaces all "." not followed by a number and not preceded by a uppercase letter
see the regex demo, online compiler
(You should edit your question to clearly state your requirement, e.g. handling of abbreviation)
You could replace (?<!\b[A-Z])\.(?!\d) with .<space>
Demonstration: https://regex101.com/r/g1g7Yg/1
Explanation:
(?<! ) negative look-behind group
\b[A-Z] word boundary following by one uppercase character
(i.e. one upper case character)
\. a dot
(?!\d) negative look-ahead group, of single digit
Which basically means, replace a dot if it is NOT preceded by single upper case character, and NOT followed by digit
There are still some flaws that it will not replace Hello world.1 apple 1 day. It shouldn't be difficult to change the regex to fix this if you understand the above regex.

Regular expressions - capturing groups confusion

I am reading an Oracle tutorial on regular expressions. I am on the topic Capturing groups. Though the reference is excellent, but except that a parenthesis represents a group, I am finding many difficulties in understanding the topic. Here are my confusions.
What is the significance of counting groups in an expression?
What are non-capturing groups?
Elaborating with examples would be nice.
One usually doesn't count groups other than to know which group has which number. E.g. ([abc])([def](\d+)) has three groups, so I know to refer to them as \1, \2 and \3. Note that group 3 is inside 2. They are numbered from the left by where they begin.
When searching with regex to find something in a string, as opposed to matching when you make sure the whole string matches the subject, group 0 will give you just the matched string, but not the stuff that was before or after it. Imagine if you will a pair of brackets around your whole regex. It's not part of the total count because it's not really considered a group.
Groups can be used for other things than capturing. E.g. (foo|bar) will match "foo" or "bar". If you're not interested in the contents of a group, you can make it non-capturing (e.g: (?:foo|bar) (varies by dialect)), so as not to "use up" the numbers assigned to groups. But you don't have to, it's just convenient sometimes.
Say I want to find a word that begins and ends in the same letter: \b([a-z])[a-z]*\1\b The \1 will then be the same as whatever the first group captured. Of course it can be used for much more powerful stuff, but I think you'll get the idea.
(Coming up with relevant examples is certainly the hardest part.)
Edit: I answered when the questions were:
What is the significance of counting groups in an expression?
There is a special group, called as group-0, which means the entire expression. It is not reported by groupCount() method. Why is that?
I don't understand what are non-capturing groups?
Why we need back-references? What is the significance of back-references?
Say you have a string, abcabc, and you want to figure out whether the first part of the string matches the second part. You can do this with a single regex by using capturing groups and backreferences. Here is the regex I would use:
(.+)\1
The way this works is .+ matches any sequence of characters. Because it is in parentheses, it is caught in a group. \1 is a backreference to the 1st capturing group, so it is the equivalent of the text caught by the capturing group. After a bit of backtracking, the capturing group matches the first part of the string, abc. The backreference \1 is now the equivalent of abc, so it matches the second half of the string. The entire string is now matched, so it is confirmed that the first half of the string matches the second half.
Another use of backreferences is in replacing. Say you want to replace all {...} with [...], if the text inside { and } is only digits. You can easily do this with capturing groups and backreferences, using the regex
{(\d+)}
And replacing with that with [\1].
The regex matches {123} in the string abc {123} 456, and captures 123 in the first capturing group. The backreference \1 is now the equivalent of 123, so replacing {(\d+)} in abc {123} 456 with [\1] results in abc [123] 456.
The reason non-capturing groups exist is because groups in general have more uses that just capturing. The regex (xyz)+ matches a string that consists entirely of the group, xyz, repeated, such as xyzxyzxyz. A group is needed because xyz+ only matches xy and then z repeated, i.e. xyzzzzz. The problem with using capturing groups is that they are slightly less efficient compared to non-capturing groups, and they take up an index. If you have a complicated regex with a lot of groups in it, but you only need to reference a single one somewhere in the middle, it's a lot better to just reference \1 rather than trying to count all the groups up to the one you want.
I hope this helps!
Can't think of an appropriate example at the moment, but I'm assuming someone might need to know the number of sub matches in the RegEx.
Group 0 is always the entire base match. I'm assuming groupCount() just lets you know how many capture groups you've specified in the expression.
A non-capturing group (?:) would be used to, well, not capture a group. Ex. if you need to test if a string contains one of several words and don't want to capture the word in a new group: (?:hello|hi there) world !== hello|hi there world. The first matches "hello world" or "hi there world" but the second matches "hello" or "hi there world".
They can be used as a part of a multitude of powerful reasons, such as testing whether or not a number is prime or composite. :) Or you could simply test to ensure a search parameter isn't repeated, ie. ^(\d)(?!.*\1)\d+$ would ensure the first digit is unique in a string.

Add an identifier when a space is followed by four digits

I am dealing with a String,
30-Nov-2012 30-Nov-2012 United Kingdom, 31-Oct-2012 31-Oct-2012 United Arab Emirates, 29-Oct-2012 31-Oct-2012 India
I need to add spaces every time a space appears after a four digit numbers, i.e.:
30-Nov-2012#30-Nov-2012#United Kingdom, 31-Oct-2012#31-Oct-2012#United Arab Emirates, 29-Oct-2012#31-Oct-2012#India
I don't know how to write Regular Expressions, any help please?
Since you don't know how to write Regular Expressions, your best bet is learning how to use Regular Expressions.
Here's a pretty good tutorial.
http://www.vogella.com/articles/JavaRegularExpressions/article.html
The expression you want to write doesn't seem very complicated. You should be able to do it after going through this tutorial.
Edit: Here's another hint. Take a look at replaceAll
and also take a look at positive lookbehinds
Edit2: I'm bored, so here's the answer.
string.replaceAll("(?<=\\d{4})\\s", "#");
Try this regex:
(\d{4})\s+
replace with
$1#
and a sample code:
String result = inputString.replaceAll("(\\d{4})\\s+", "$1#");
explain:
\d
Matches any decimal digit.
{n}
Matches the previous element exactly n times.
\s
Matches any white-space character.
+
Matches the previous element one or more times.
(subexpression)
Captures the matched subexpression and assigns it a zero-based ordinal number.
$ number
Substitutes the substring matched by group number.

Categories

Resources