Match all occurrences Regex Java - java

i'd like to recognize all sequences of "word-number-word" of a string with Regex Java API.
For example, if i have "ABC-122-JDHFHG-456-MKJD", i'd like the output : [ABC-122-JDHFHG, JDHFHG-456-MKJD].
String test = "ABC-122-JDHFHG-456-MKJD";
Matcher m = Pattern.compile("(([A-Z]+)-([0-9]+)-([A-Z]+))+")
.matcher(test);
while (m.find()) {
System.out.println(m.group());
}
The code above return only "ABC-122-JDHFHG".
Any ideas ?

The last ([A-Z]+) matches and consumes JDHFHG, so the regex engine only "sees" -456-MKJD after the first match, and the pattern does not match this string remainder.
You want to get "whole word" overlapping matches.
Use
String test = "ABC-122-JDHFHG-456-MKJD";
Matcher m = Pattern.compile("(?=\\b([A-Z]+-[0-9]+-[A-Z]+)\\b)")
.matcher(test);
while (m.find()) {
System.out.println(m.group(1));
} // => [ ABC-122-JDHFHG, JDHFHG-456-MKJD ]
See the Java demo
Pattern details
(?= - start of a positive lookahead that matches a position that is immediately followed with
\\b - a word boundary
( - start of a capturing group (to be able to grab the value you need)
[A-Z]+ - 1+ ASCII uppercase letters
- - a hyphen
[0-9]+ - 1+ digits
- - a hyphen
[A-Z]+ - 1+ ASCII uppercase letters
) - end of the capturing group
\\b - a word boundary
) - end of the lookahead construct.

Here you go, overlap the last word.
Make an array out of capture group 1.
Basically, find 3 consume 2. This makes the next match position start
on the next possible known word.
(?=(([A-Z]+-\d+-)[A-Z]+))\2
https://regex101.com/r/Sl5FgT/1
Formatted
(?= # Assert to find
( # (1 start), word,num,word
( # (2 start), word,num
[A-Z]+
-
\d+
-
) # (2 end)
[A-Z]+
) # (1 end)
)
\2 # Consume word,num

Related

regex match two sets of digits from line

Matching two sets of numbers from a line. (2.66 and 34.3).
These can digits are variable in length but surrounded by whitespace. eg
Ox 2.66 abcda 34.3 abfdasd
I got 2.66 with \b(?:Ox)\s+(\d*\.*?\d+)
Any resources that can guide me in the right direction? Im stuck on matching the second separately.
cheers
You may continue the regex pattern and capture the second number after another word:
\bOx\s+(\d*\.?\d+)\s+\S+\s+(\d*\.?\d+)
See the regex demo. The second number will be in Group 2.
Details:
\b - a word boundary
Ox - a word Ox
\s+ - one or more whitespaces
(\d*\.?\d+) - Group 1: zero or more digits, an optional ., one or more digits
\s+ - one or more whitespaces
\S+ - one or more non-whitespaces
\s+ - one or more whitespaces
(\d*\.?\d+) - Group 2: zero or more digits, an optional ., one or more digits.
See a Java demo:
import java.util.*;
import java.util.regex.*;
class Test
{
public static void main (String[] args) throws java.lang.Exception
{
String s = "Ox 2.66 abcda 34.3 abfdasd";
Pattern pattern = Pattern.compile("\\bOx\\s+(\\d*\\.?\\d+)\\s+\\S+\\s+(\\d*\\.?\\d+)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(1)); // => 2.66
System.out.println(matcher.group(2)); // => 34.3
}
}
}
You could write the following regular expression: ([\d\D\s]*).
If you want numeric values only then ([\d\.]*).

Masking credit card number using regex

I am trying to mask the CC number, in a way that third character and last three characters are unmasked.
For eg.. 7108898787654351 to **0**********351
I have tried (?<=.{3}).(?=.*...). It unmasked last three characters. But it unmasks first three also.
Can you throw some pointers on how to unmask 3rd character alone?
You can use this regex with a lookahead and lookbehind:
str = str.replaceAll("(?<!^..).(?=.{3})", "*");
//=> **0**********351
RegEx Demo
RegEx Details:
(?<!^..): Negative lookahead to assert that we don't have 2 characters after start behind us (to exclude 3rd character from matching)
.: Match a character
(?=.{3}): Positive lookahead to assert that we have at least 3 characters ahead
I would suggest that regex isn't the only way to do this.
char[] m = new char[16]; // Or whatever length.
Arrays.fill(m, '*');
m[2] = cc.charAt(2);
m[13] = cc.charAt(13);
m[14] = cc.charAt(14);
m[15] = cc.charAt(15);
String masked = new String(m);
It might be more verbose, but it's a heck of a lot more readable (and debuggable) than a regex.
Here is another regular expression:
(?!(?:\D*\d){14}$|(?:\D*\d){1,3}$)\d
See the online demo
It may seem a bit unwieldy but since a credit card should have 16 digits I opted to use negative lookaheads to look for an x amount of non-digits followed by a digit.
(?! - Negative lookahead
(?: - Open 1st non capture group.
\D*\d - Match zero or more non-digits and a single digit.
){14} - Close 1st non capture group and match it 14 times.
$ - End string ancor.
| - Alternation/OR.
(?: - Open 2nd non capture group.
\D*\d - Match zero or more non-digits and a single digit.
){1,3} - Close 2nd non capture group and match it 1 to 3 times.
$ - End string ancor.
) - Close negative lookahead.
\d - Match a single digit.
This would now mask any digit other than the third and last three regardless of their position (due to delimiters) in the formatted CC-number.
Apart from where the dashes are after the first 3 digits, leave the 3rd digit unmatched and make sure that where are always 3 digits at the end of the string:
(?<!^\d{2})\d(?=[\d-]*\d-?\d-?\d$)
Explanation
(?<! Negative lookbehind, assert what is on the left is not
^\d{2} Match 2 digits from the start of the string
) Close lookbehind
\d Match a digit
(?= Positive lookahead, assert what is on the right is
[\d-]* 0+ occurrences of either - or a digit
\d-?\d-?\d Match 3 digits with optional hyphens
$ End of string
) Close lookahead
Regex demo | Java demo
Example code
String regex = "(?<!^\\d{2})\\d(?=[\\d-]*\\d-?\\d-?\\d$)";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
String strings[] = { "7108898787654351", "7108-8987-8765-4351"};
for (String s : strings) {
Matcher matcher = pattern.matcher(s);
System.out.println(matcher.replaceAll("*"));
}
Output
**0**********351
**0*-****-****-*351
Don't think you should use a regex to do what you want. You could use StringBuilder to create the required string
String str = "7108-8987-8765-4351";
StringBuilder sb = new StringBuilder("*".repeat(str.length()));
for (int i = 0; i < str.length(); i++) {
if (i == 2 || i >= str.length() - 3) {
sb.replace(i, i + 1, String.valueOf(str.charAt(i)));
}
}
System.out.print(sb.toString()); // output: **0*************351
You may add a ^.{0,1} alternative to allow matching . when it is the first or second char in the string:
String s = "7108898787654351"; // **0**********351
System.out.println(s.replaceAll("(?<=.{3}|^.{0,1}).(?=.*...)", "*"));
// => **0**********351
The regex can be written as a PCRE compliant pattern, too: (?<=.{3}|^|^.).(?=.*...).
The regex can be written as a PCRE compliant pattern, too: (?<=.{3}|^|^.).(?=.*...).
It is equal to
System.out.println(s.replaceAll("(?<!^..).(?=.*...)", "*"));
See the Java demo and a regex demo.
Regex details
(?<=.{3}|^.{0,1}) - there must be any three chars other than line break chars immediately to the left of the current location, or start of string, or a single char at the start of the string
(?<!^..) - a negative lookbehind that fails the match if there are any two chars other than line break chars immediately to the left of the current location
. - any char but a line break char
(?=.*...) - there must be any three chars other than line break chars immediately to the right of the current location.
If the CC number always has 16 digits, as it does in the example, and as do Visa and MasterCard CC's, matches of the following regular expression can be replaced with an asterisk.
\d(?!\d{0,2}$|\d{13}$)
Start your engine!

Masking using regular expressions for below format

I am trying to write a regular expression to mask the below string. Example below.
Input
A1../D//FASDFAS--DFASD//.F
Output (Skip first five and last two Alphanumeric's)
A1../D//FA***********D//.F
I am trying using below regex
([A-Za-z0-9]{5})(.*)(.{2})
Any help would be highly appreciated.
You solve your issue by using Pattern and Matcher with a regex which match multiple groups :
String str = "A1../D//FASDFAS--DFASD//.F";
Pattern pattern = Pattern.compile("(.*?\\/\\/..)(.*?)(.\\/\\/.*)");
Matcher matcher = pattern.matcher(str);
if (matcher.find()) {
str = matcher.group(1)
+ matcher.group(2).replaceAll(".", "*")
+ matcher.group(3);
}
Detail
(.*?\\/\\/..) first group to match every thing until //
(.*?) second group to match every thing between group one and three
(.\\/\\/.*) third group to match every thing after the last character before the // until the end of string
Outputs
A1../D//FA***********D//.F
I think this solution is more readable.
If you want to do that with a single regex you may use
text = text.replaceAll("(\\G(?!^|(?:[0-9A-Za-z][^0-9A-Za-z]*){2}$)|^(?:[^0-9A-Za-z]*[0-9A-Za-z]){5}).", "$1*");
Or, using the POSIX character class Alnum:
text = text.replaceAll("(\\G(?!^|(?:\\p{Alnum}\\P{Alnum}*){2}$)|^(?:\\P{Alnum}*\\p{Alnum}){5}).", "$1*");
See the Java demo and the regex demo. If you plan to replace any code point rather than a single code unit with an asterisk, replace . with \P{M}\p{M}*+ ("\\P{M}\\p{M}*+").
To make . match line break chars, add (?s) at the start of the pattern.
Details
(\G(?!^|(?:[0-9A-Za-z][^0-9A-Za-z]*){2}$)|^(?:[^0-9A-Za-z]*[0-9A-Za-z]){5}) -
\G(?!^|(?:[0-9A-Za-z][^0-9A-Za-z]*){2}$) - a location after the successful match that is not followed with 2 occurrences of an alphanumeric char followed with 0 or more chars other than alphanumeric chars
| - or
^(?:[^0-9A-Za-z]*[0-9A-Za-z]){5} - start of string, followed with five occurrences of 0 or more non-alphanumeric chars followed with an alphanumeric char
. - any code unit other than line break characters (if you use \P{M}\p{M}*+ - any code point).
Usually, masking of characters in the middle of a string can be done using negative lookbehind (?<!) and positive lookahead groups (?=).
But in this case lookbehind group can't be used because it does not have an obvious maximum length due to unpredictable number of non-alphanumeric characters between first five alphanumeric characters (. and / in the A1../D//FA).
A substring method can used as a workaround for inability to use negative lookbehind group:
String str = "A1../D//FASDFAS--DFASD//.F";
int start = str.replaceAll("^((?:\\W{0,}\\w{1}){5}).*", "$1").length();
String maskedStr = str.substring(0, start) +
str.substring(start).replaceAll(".(?=(?:\\W{0,}\\w{1}){2})", "*");
System.out.println(maskedStr);
// A1../D//FA***********D//.F
But the most straightforward way is to use java.util.regex.Pattern and java.util.regex.Matcher:
String str = "A1../D//FASDFAS--DFASD//.F";
Pattern pattern = Pattern.compile("^((?:\\W{0,}\\w{1}){5})(.+)((?:\\W{0,}\\w{1}){2})");
Matcher matcher = pattern.matcher(str);
if (matcher.find()) {
String maskedStr = matcher.group(1) +
"*".repeat(matcher.group(2).length()) +
matcher.group(3);
System.out.println(maskedStr);
// A1../D//FA***********D//.F
}
\W{0,} - 0 or more non-alphanumeric characters
\w{1} - exactly 1 alphanumeric character
(\W{0,}\w{1}){5} - 5 alphanumeric characters and any number of alphanumeric characters in between
(?:\W{0,}\w{1}){5} - do not capture as a group
^((?:\\W{0,}\\w{1}){5})(.+)((?:\\W{0,}\\w{1}){2})$ - substring with first five alphanumeric characters (group 1), everything else (group 2), substring with last 2 alphanumeric characters (group 3)

Java: Extracting a specific REGEXP pattern out of a string

How is it possible to extract only a time part of the form XX:YY out of a string?
For example - from a string like:
sdhgjhdgsjdf12:34knvxjkvndf, I would like to extract only 12:34.
( The surrounding chars can be spaces too of course )
Of course I can find the semicolon and get two chars before and two chars after, but it is bahhhhhh.....
You can use this look-around based regex for your match:
(?<!\d)\d{2}:\d{2}(?!\d)
RegEx Demo
In Java:
Pattern p = Pattern.compile("(?<!\\d)\\d{2}:\\d{2}(?!\\d)");
RegEx Breakup:
(?<!\d) # negative lookbehind to assert previous char is not a digit
\d{2} # match exact 2 digits
: # match a colon
\d{2} # match exact 2 digits
(?!\d) # negative lookahead to assert next char is not a digit
Full Code:
Pattern p = Pattern.compile("(?<!\\d)\\d{2}:\\d{2}(?!\\d)");
Matcher m = pattern.matcher(inputString);
if (m.find()) {
System.err.println("Time: " + m.group());
}

Extracting numbers from a String in Java by splitting on a regex

I want to extract numbers from Strings like this:
String numbers[] = "M0.286-3.099-0.44c-2.901,-0.436,,,123,0.123,.34".split(PATTERN);
From such String I'd like to extract these numbers:
0.286
-3.099
-0.44
-2.901
-0.436
123
0.123
.34
That is:
There can be garbage characters like "M", "c", "c"
The "-" sign is to include in the number, not to split on
A "number" can be anything that Float.parseFloat can parse, so .34 is valid
What I have so far:
String PATTERN = "([^\\d.-]+)|(?=-)";
Which works to some degree, but obviously far from perfect:
Doesn't skip the starting garbage "M" in the example
Doesn't handle consecutive garbage, like the ,,, in the middle
How to fix PATTERN to make it work?
You could use a regex like this:
([-.]?\d+(?:\.\d+)?)
Working demo
Match Information:
MATCH 1
1. [1-6] `0.286`
MATCH 2
1. [6-12] `-3.099`
MATCH 3
1. [12-17] `-0.44`
MATCH 4
1. [18-24] `-2.901`
MATCH 5
1. [25-31] `-0.436`
MATCH 6
1. [34-37] `123`
MATCH 7
1. [38-43] `0.123`
MATCH 8
1. [44-47] `.34`
Update
Jawee's approach
As Jawee pointed in his comment there is a problem for .34.34, so you can use his regex that fix this problem. Thanks Jawee to point out that.
(-?(?:\d+)?\.?\d+)
To have graphic idea about what happens behind this regex you can check this Debuggex
image:
Engine explanation:
1st Capturing group (-?(?:\d+)?\.?\d+)
-? -> matches the character - literally zero and one time
(?:\d+)? -> \d+ match a digit [0-9] one and unlimited times (using non capturing group)
\.? matches the character . literally zero and one time
\d+ match a digit [0-9] one and unlimited times
Try this one (-?(?:\d+)?\.?\d+)
Example as below:
Demo Here
Thanks a lot for nhahtdh's comments. That's true, we could update as below:
[-+]?(?:\d+(?:\.\d*)?|\.\d+)
Updated Demo Here
Actually, if we take all possible float input String format (e.g: Infinity, -Infinity, 00, 0xffp23d, 88F), then it could be a little bit complicated. However, we still could implement it as below Java code:
String sign = "[-+]?";
String hexFloat = "(?>0[xX](((\\p{XDigit}+)\\.?)|((\\p{XDigit}*)\\.(\\p{XDigit}+)))[pP]([-+])?(\\p{Digit}+)[fFdD]?)";
String nan = "(?>NaN)";
String inf = "(?>Infinity)";
String dig = "(?>\\d+(?:\\.\\d*)?|\\.\\d+)";
String exp = "(?:[eE][-+]?\\d+)?";
String suf = "[fFdD]?";
String digFloat = "(?>" + dig + exp + suf + ")";
String wholeFloat = sign + "(?>" + hexFloat + "|" + nan + "|" + inf + "|" + digFloat + ")";
String s = "M0.286-3.099-0.44c-2.901,-0.436,,,123,0.123d,.34d.34.34M24.NaNNaN,Infinity,-Infinity00,0xffp23d,88F";
Pattern floatPattern = Pattern.compile(wholeFloat);
Matcher matcher = floatPattern.matcher(s);
int i = 0;
while (matcher.find()) {
String f = matcher.group();
System.out.println(i++ + " : " + f + " --- " + Float.parseFloat(f) );
}
Then the output is as below:
0 : 0.286 --- 0.286
1 : -3.099 --- -3.099
2 : -0.44 --- -0.44
3 : -2.901 --- -2.901
4 : -0.436 --- -0.436
5 : 123 --- 123.0
6 : 0.123d --- 0.123
7 : .34d --- 0.34
8 : .34 --- 0.34
9 : .34 --- 0.34
10 : 24. --- 24.0
11 : NaN --- NaN
12 : NaN --- NaN
13 : Infinity --- Infinity
14 : -Infinity --- -Infinity
15 : 00 --- 0.0
16 : 0xffp23d --- 2.13909504E9
17 : 88F --- 88.0
You can do it in one line (but with one less step than aioobe's answer!):
String[] numbers = "M0.286-3.099-0.44c-2.901,-0.436,,,123,0.123,.34"
.replaceAll("^[^.\\d-]+|[^.\\d-]+$", "") // remove junk from start/end
.split("[^.\\d-]+"); // split on anything not part of a number
Although less calls are made, aioobe's answer is easier to read and understand, which makes his better code.
Using the regex you crafted yourself you can solve it as follows:
String[] numbers = "M0.286-3.099-0.44c-2.901,-0.436,,,123,0.123,.34"
.replaceAll(PATTERN, " ")
.trim()
.split(" +");
On the other hand, if I were you, I'd do the loop instead:
Matcher m = Pattern.compile("[.-]?\\d+(\\.\\d+)?").matcher(input);
List<String> matches = new ArrayList<>();
while (m.find())
matches.add(m.group());
I think this is exactly what you want:
String pattern = "[-+]?[0-9]*\\.?[0-9]+";
String line = "M0.286-3.099-0.44c-2.901,-0.436,,,123,0.123,.34";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(line);
List<String> numbers=new ArrayList<String>();
while(m.find()) {
numbers.add(m.group());
}
Its nice you put a bounty on this.
Unfortunately, as you probably already know, this can't be done using
Java's string split method directly.
If it can't be done directly, there is no reason to kludge it as it is, well .. a kludge.
The reasons are many, some related, some not.
To start off, you need to define a good regex as a base.
This is the only regex I know that will validate and extract a proper form:
# "((?=[+-]?\\d*\\.?\\d)[+-]?\\d*\\.?\\d*)"
( # (1 start)
(?= [+-]? \d* \.? \d )
[+-]? \d* \.? \d*
) # (1 end)
So, looking at this base regex, its clear you want this form that it matches.
In the case of split, you don't want the form that this matches, because that's
where you want the breaks to be.
As I look at Java's split, I see that no matter what it matches, it will be excluded
from the resulting array.
So, presuming split usage, the first thing to match (and consume) is all the stuff that is not
this. That part will be something like this:
(?:
(?!
(?= [+-]? \d* \.? \d )
[+-]? \d* \.? \d*
)
.
)+
Since the only thing left is valid decimal numbers, the next break will be somewhere
between valid numbers. This part, added to the first part, will be something like this:
(?:
(?!
(?= [+-]? \d* \.? \d )
[+-]? \d* \.? \d*
)
.
)+
| # or,
(?<=
(?= [+-]? \d* \.? \d )
[+-]? \d* \.? \d*
)
(?=
(?= [+-]? \d* \.? \d )
[+-]? \d* \.? \d*
)
And all of a sudden, we have a problem .. a variable length lookbehind assertion
So, its game over for the whole thing.
Lastly and unfortunately, Java does not (as far as I can see) have a provision to include capture
group contents (matched in the regex) as an element in the resulting array.
Perl does, but I can't find that ability in Java.
If Java had that provision, the break sub expressions could be combined to do a seamless split.
Like this:
(?:
(?!
(?= [+-]? \d* \.? \d )
[+-]? \d* \.? \d*
)
.
)*
(
(?= [+-]? \d* \.? \d )
[+-]? \d* \.? \d*
)

Categories

Resources