what would be the regular expression to find duplicate set of digits in a numeric string?
Suppose
String s="0.1234523452345234";
From this string I need to obtain "2345". I tried the following regex-
String s="0.1234523452345234";
String regex="(\\d+)\\1+\\b";
Pattern p=Pattern.compile(regex);
Matcher m=p.matcher(s);
if(m.find())
{
System.out.println(m.group(0));
}
But the output is
523452345234
While i need to print
2345
"(\\d+)\\1+\\b" macthes any sequence of digits followed immediately by this sequence at least once. It can be followed by multiple occurences of the sequence (the + quantifier). The regex also enforces a word boundary after the last matching sequence.
I think what you are looking for is the following regex:
"(\\d+).*\\1" (without word boundary, anything between your sequences, and only one repetition of the sequence. Example:
0.1234789897897123499
^^^^ ^^^^---- (\\d+) and \\1
^^^^^^^^^-------- .*
If your longest run needs to be followed immediately by the duplicate (no fillers inbetween), then drop the .* from the regex.
group(0) will return the full match (e.g. 12347898978971234), group(1) will contain the first capturing group (e.g. 1234).
I tried this regular expression that finds the number that duplicates one time , it can be shown by m.group(1) the first occurence :
String s="0.1234523452345234";
String regex="([0-9]+)\\1";
Pattern p=Pattern.compile(regex);
Matcher m=p.matcher(s);
if(m.find())
{
System.out.println(m.group(1));
}
Output :
2345
Related
I have a String like
String str = "305556710S or 100596269C OR CN111111111";
I just want to match the characters in this string that start with numbers or start with numbers and end with English letters,
Then prefix the matched characters add with two "??" characters.
I write a Patern like
Pattern pattern = Pattern.compile("^[0-9]{1,10}[A-Z]{0,1}", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
int start = matcher.start();
int end = matcher.end();
String matchStr = matcher.group();
System.err.println(matchStr);
}
But it can only match the first character "305556710S".
But If I modify the Pattern
Pattern pattern = Pattern.compile("[0-9]{1,10}[A-Z]{0,1}", Pattern.CASE_INSENSITIVE);
It will matches "305556710S","100596269C","111111111".But the prefix of "111111111" is English character "CN" which is not my goal.
I only want match the "305556710S" and "100596269C" and add two "??" characters before the matched Characters.Can somebody help me ?
First, you should avoid the ^ in this particular regexp. As you noticed, you can't return more than one result, as "^" is an instruction for "match the beginning of the string"
Using \b can be a solution, but you may get invalid results. For example
305556710S or -100596269C OR CN111111111
The regexp "\\b[0-9]{1,10}[A-Z]{0,}\\b" will match 100596269C (because the hyphen is not word character, so there is a word boundary between - and 1)
The following regexp matches exactly what you want: all numbers, that may be followed by some English chars, either at the beginning of the string or after a space, and either followed by a space or at the end of the string.
(?<=^| )[0-9]{1,10}[A-Z]*(?= |$)
Explanations:
(?<=^| ) is a lookbehind. It makes sure that there is either ^ (string start) or a space behind actual location. Note that lookbehinds don't add matching chars to the result: the space won't be part of the result
[0-9]{1,10}[A-Z]* matches digits (at least one, up to ten), then one or more letters.
(?= |$) is a lookahead. It makes sure that there will be either a space or $ (end of string) after this match. Like lookbehinds, the chars aren't added to the results and position remains the same : the space read here for example can also be read by the lookbehind of the next captured string
Examples : 305556710S or 100596269C OR CN111111111 matches: at index 0 [305556710S], at index 15 [100596269C]; 100596269C123does not match.
I think you need to use word boundaries \b. Try this changed pattern:
"\\b[0-9]{1,10}[A-Z]{0,1}\\b"
This prints out:
305556710S
100596269C
Why it works:
The difference here is that it will check only those character sequences that are within a pair of word boundaries. In the earlier pattern you used, a character sequence even from the middle of a word may be used to match against the pattern due to which even 11111... from CN1111... was matched against the pattern and it passed.
A word boundary also matches the end of the string input. So, even if a candidate word appears at the end of the line, it will get picked up.
If more than one English alphabet can come at the end, then remove the max occurrence indicator, 1 in this case:
"\\b[0-9]{1,10}[A-Z]{0,}\\b"
I am trying to use a pattern to search for a Zip Code within a string. I cannot get it to work correctly.
A sample of the inputLine is
What is the weather in 75042?
What I am trying to use for a pattern is
public String getZipcode(String inputLine) {
Pattern pattern = Pattern.compile(".*weather.*([0-9]+).*");
Matcher matcher = pattern.matcher(inputLine);
if (matcher.find()) {
return matcher.group(1).toString();
}
return "Zipcode Not Found.";
}
If I am looking to only get 75002, what do I need to change? This only outputs the last digit in the number, 2. I am terribly confused and I do not completely understand the Javadocs for the Pattern class.
The reason is because the .* matches the first digits and let only one left for your capturing group, you have to throw it away
A more simple pattern can be used here : \D+(\d+)\D+ which means
some non-digits \D+, then some digits to capture (\d+), then some non-digits \D+
public String getZipcode(String inputLine) {
Pattern pattern = Pattern.compile("\\D+(\\d+)\\D+");
Matcher matcher = pattern.matcher(inputLine);
if (matcher.find()) {
return matcher.group(1).toString();
}
return "Zipcode Not Found.";
}
Workable Demo
The problem is that your middle .* is too greedy and eats away 7500. One easy fix is to add a space before your regexp: .*weather.* ([0-9]+).* or even use \\s. But the best is to use non-greedy version of .*? so regexp should be .*weather.*?([0-9]+).*
Spaces are missing in your regex (\s). You can use \s* or \s+ based on your data
Pattern pattern = Pattern.compile("weather\\s*\\w+\\s*(\\d+)");
Matcher matcher = pattern.matcher(inputLine);
Your .*weather.*([0-9]+).* pattern grabs the whole line with the first .* and backtracks to find weather, and if it finds it, it grabs the line portion after the words to the end of line with the subsequent .* pattern and backtracks again to find the last digit and the only one digit is stored in Capturing group 1 since one digit satisfies the [0-9]+ pattern. The last .* just consumes the line to its end.
You may solve the issue by just using ".*weather.*?([0-9]+).*" (making the second .* lazy), but since you are using Matcher#find(), you can use a simpler regex:
Pattern pattern = Pattern.compile("weather\\D*(\\d+)");
And after getting a match, retrieve the value with matcher.group(1).
See the regex demo.
Pattern details
weather - a weather word
\\D* - 0+ chars other than digits
(\\d+) - Capturing group 1: one or more digits
See the Java demo:
String inputLine = "What is the weather in 75042?";
Pattern pattern = Pattern.compile("weather\\D*(\\d+)");
Matcher matcher = pattern.matcher(inputLine);
if (matcher.find()) {
System.out.println(matcher.group(1)); // => 75042
}
I think all you need is \\d+
public String getZipcode(String inputLine) throws Exception {
Pattern pattern = Pattern.compile("\\d+");
Matcher matcher = pattern.matcher(inputLine);
if (matcher.find()) {
return matcher.group();
}
//A good practice is to throw an exception if no result found
throw new NoSuchElementException("Zipcode Not Found.");
}
In regular expressions operators that have no upper bound (*, +) are greedy.
There were already perfect solutions suggested.
I'm just adding one that is very close to your's and addresses the problem in a more isolated way:
If you use the regex
".*weather.*?([0-9]+).*" ... instead of ...
".*weather.*([0-9]+).*"
... your solution will work perfectly well. The '?' after the asterisk instructs the regex compiler to treat the asterisk as non-greedy.
Greedy means consuming as many characters as possible (from left to right) while still allowing the remainder of the regex to match.
Non-greedy means consuming as few characters as possible while still allowing the remainder of the regex to match.
I want to find numbers like
(123).234.4567
(123)-234-4567
123.234.4567
123-456-4567
This digits represents US phone numbers. So they can be separated with -(dash) or .(dot) and must have 3 digits, followed by another 3 digits, followed by 4 more. And the first 3 digits can be fenced or not with brackets '()'.
and code is like below
Pattern regex = Pattern.compile("^(\\(\\d{3}\\))|^\\d{3}[.-]?\\d{3}[.-]?\\d{4}$");
Matcher matcher = regex.matcher("(425).882.8080 tel");
while(matcher.find()){
System.out.println(matcher.group());
}
But the result is :
(425)
What I am doing Wrong. I want to print (425).882.8080 at once.
You can try Back reference to check for exact match.
Backreferences match the same text as previously matched by a capturing group
^(\(\d{3}\)|\d{3})([.-])\d{3}\2\d{4}$
Capture 2nd group -----------^^^^ ^^-------- Same text as 2nd captured group
Groups are captured by enclosing it inside the parenthesis (...) and can be accessed by \index
Here is online demo
Note: If you want to find substring in a string then remove ^ and $ that are used for beginning and ending of the string respectively.
Patten explanation:
( group and capture to \1:
\( '('
\d{3} digits (0-9) (3 times)
\) ')'
| OR
\d{3} digits (0-9) (3 times)
) end of \1
( group and capture to \2:
[.-] any character of: '.', '-'
) end of \2
\d{3} digits (0-9) (3 times)
\2 what was matched by capture \2
\d{4} digits (0-9) (4 times)
Sample code:
String regex="(\\(\\d{3}\\)|\\d{3})([.-])\\d{3}\\2\\d{4}";
System.out.println("(123).234.4567".matches(regex)); // true
System.out.println("(123)-234-4567".matches(regex)); // true
System.out.println("123.234.4567".matches(regex)); // true
System.out.println("123-456-4567".matches(regex)); // true
System.out.println("(123)-234.4567".matches(regex)); // false
System.out.println("(123-234-4567".matches(regex)); // false
System.out.println("123.234-4567".matches(regex)); // false
System.out.println("123-456.4567".matches(regex)); // false
Sample code: (as per your's)
Matcher matcher = Pattern.compile(regex).matcher("(425).882.8080 tel");
while (matcher.find()) {
String str = matcher.group();
System.out.println(str); // (425).882.8080
}
You only have one matching group: (\\(\\d{3}\\))
Try
Pattern.compile("^((?:\\(\\d{3}\\))|^\\d{3}[.-]?\\d{3}[.-]?\\d{4})");
this will provide you the whole matched number.
If you, instead, need all numbers separately you have to add multiple matching groups, like
Pattern.compile("^(\\(\\d{3}\\))|^(\\d{3})[.-]?(\\d{3})[.-]?(\\d{4})");
Btw. by using [.-]? you will also match 1232344567 (without dots/dashes). To fix this, drop the ? after [.-].
An optimized version could be:
Pattern.compile("^((\\((\\d{3})\\)|(\\d{3}))[.-](\\d{3})[.-](\\d{4}))");
This way you get the whole matched number as well as all included nubmers separately.
Another point: your ininital regexp would also match 123.234-4567 If that's not desireable, anothe OR is needed for all cases.
E.g.
Pattern.compile("^((\\((\\d{3})\\)|((\\d{3})\\.(\\d{3})\\.(\\d{4})|(\\d{3})-(\\d{3})-(\\d{4})))");
Updated for you last edit:
Pattern.compile("^((?:\\(\\d{3}\\)|\\d{3})(?:\\.\\d{3}\\.\\d{4}|\\d{3}-\\d{3}-\\d{4}))");
You don't need to put start and end anchors in your regex if the phone number would appears anywhere on the input.
\d{3}([.-])?\d{3}\1?\d{4}|\(\d{3}\)([.-])?\d{3}\2?\d{4}
DEMO
Java regex would be,
"\\d{3}([.-])?\\d{3}\\1?\\d{4}|\\(\\d{3}\\)([.-])?\\d{3}\\2?\\d{4}"
Example:
Pattern regex = Pattern.compile("\\d{3}([.-])?\\d{3}\\1?\\d{4}|\\(\\d{3}\\)([.-])?\\d{3}\\2?\\d{4}");
Matcher matcher = regex.matcher("(425).882.8080 tel");
while(matcher.find()){
System.out.println(matcher.group());
} // (425).882.8080
I'm trying to write a regex pattern that will match any sentence that begins with multiple or one tab and/or whitespace.
For example, I want my regex pattern to be able to match " hello there I like regex!"
but so I'm scratching my head on how to match words after "hello". So far I have this:
String REGEX = "(?s)(\\p{Blank}+)([a-z][ ])*";
Pattern PATTERN = Pattern.compile(REGEX);
Matcher m = PATTERN.matcher(" asdsada adf adfah.");
if (m.matches()) {
System.out.println("hurray!");
}
Any help would be appreciated. Thanks.
String regex = "^\\s+[A-Za-z,;'\"\\s]+[.?!]$"
^ means "begins with"
\\s means white space
+ means 1 or more
[A-Za-z,;'"\\s] means any letter, ,, ;, ', ", or whitespace character
$ means "ends with"
An example regex to match sentences by the definition: "A sentence is a series of characters, starting with at lease one whitespace character, that ends in one of ., ! or ?" is as follows:
\s+[^.!?]*[.!?]
Note that newline characters will also be included in this match.
A sentence starts with a word boundary (hence \b) and ends with one or more terminators. Thus:
\b[^.!?]+[.!?]+
https://regex101.com/r/7DdyM1/1
This gives pretty accurate results. However, it will not handle fractional numbers. E.g. This sentence will be interpreted as two sentences:
The value of PI is 3.141...
If you looking to match all strings starting with a white space you can try using "^\s+*"
regular expression.
This tool could help you to test your regular expression efficiently.
http://www.rubular.com/
Based upon what you desire and asked for, the following will work.
String s = " hello there I like regex!";
Pattern p = Pattern.compile("^\\s+[a-zA-Z\\s]+[.?!]$");
Matcher m = p.matcher(s);
if (m.matches()) {
System.out.println("hurray!");
}
See working demo
String regex = "(?<=^|(\.|!|\?) |\n|\t|\r|\r\n) *\(?[A-Z][^.!?]*((\.|!|\?)(?! |\n|\r|\r\n)[^.!?]*)*(\.|!|\?)(?= |\n|\r|\r\n)"
This match any sentence following the definition 'a sentence start with a capital letter and end with a dot'.
The below regex pattern matches sentences in a paragraph.
Pattern pattern = Pattern.compile("\\b[\\w\\p{Space}“”’\\p{Punct}&&[^.?!]]+[.?!]");
Reference: https://devsought.com/regex-pattern-to-match-sentence
I have the following string:
!date +10 (yyyy-MM-dd'T'HH:mm:ssz)
this string could be also (notice the minus instead of the plus.:
!date -10 (yyyy-MM-dd'T'HH:mm:ssz)
I need a regex pattern that will extract the numeric digits after the + (or -). There could be more than one digit.
I also need a pattern to extract the contents of the brackets ();
I've had a play around on regex pal. but couldn't get a working pattern.
Cheers.
To pick out the number & bracket content, you could do:
String str = "date +10 (yyyy-MM-dd'T'HH:mm:ssz)";
Matcher m = Pattern.compile(".*[+|-](\\d+).*\\((.*)\\).*").matcher(str);
if (m.matches()) {
System.out.println(m.group(1));
System.out.println(m.group(2));
}
This regex should give you a match with the digits after the +/- and the contents of the parentheses in the first and second capturing group, respectively:
"!date\\s[+-](\\d+)\\s\\(([^)]*)\\)"
The following regex leads to 2 capturing groups with the contents you want
"!date\\s[+-](\\d+)\\s\\((\\d{4}-\\d{2}-\\d{2}'T'\\d{2}:\\d{2}:\\d{2}z)\\)"