pattern split to get all values in a string representing object

pattern split to get all values in a string representing object - java

I have Strings that represent rows in a table like this:
{failures=4, successes=6, name=this_is_a_name, p=40.00}
I made an expression that can be used with Pattern.split() to get me back all of the values in a String[]:
[\{\,](.*?)\=
In the online regex tester it works well with the exception of the ending }.
But when I actually run the pattern against my first row I get a String[] where the first element is an empty string. I only want the 4 values (not keys) from each row not the extra empty value.
Pattern getRowValues = Pattern.compile("[\\{\\,](.*?)\\=");
String[] row = getRowValues.split("{failures=4, successes=6, name=this_is_a_name, p=40.00}");
//CURRENT
//row[0]=> ""
//row[1]=>"4"
//row[2]=>"6"
//row[3]=>"this_is_a_name"
//row[4]=>"40.00}"
//WANT
//row[0]=>"4"
//row[1]=>"6"
//row[2]=>"this_is_a_name"
//row[3]=>"40.00"

String[] parts = getRowValues
// Strip off the leading '{' and trailing '}'
.replaceAll("^\\{|\\}$", "")
// then just split on comma-space
.split(", ");
If you want just the values:
String[] parts = getRowValues
// Strip off the leading '{' and up to (but no including) the first =,
// and the trailing '}'
.replaceAll("^\\{[^=]*|\\}$", "")
// then just split on comma-space and up to (but no including) the =
.split(", [^=]*");

Option 1
Modify your regular expression to [{,](.*?)=|[}] where I removed all the unnecessarily escaped characters in each of the [...] constructs and added the |[}]
See also Live Demo
Option 2
=([^,]*)[,}]
This regular expression will do the following:
capture all the substrings after the = and before the , or close }
Example
Live Demo
https://regex101.com/r/yF2gG7/1
Sample text
{failures=4, successes=6, name=this_is_a_name, p=40.00}
Capture groups
Each match gets the following capture groups:
Capture group 0 gets the entire substring from = to , or }
Capture group 1 gets just the value not including the =, ,, or } characters
Sample Matches
[0][0] = =4,
[0][1] = 4
[1][0] = =6,
[1][1] = 6
[2][0] = =this_is_a_name,
[2][1] = this_is_a_name
[3][0] = =40.00}
[3][1] = 40.00
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^,]* any character except: ',' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
[,}] any character of: ',', '}'
----------------------------------------------------------------------

Related

Regex to get everything but last chars of capture group

How do I use regex to select everything before last 4 char in a capture group?
Example:
String str = "{Index1=StudentData(studentName=Sam, idNumber=321231312), Index2=StudentData(studentName=Adam, idNumber=5675), Index3=StudentData(studentName=Lisa, idNumber=67124)}";
String regex = "(?<=idNumber=)[a-zA-Z1-9]+(?=\))";
System.out.println(str.replaceAll(regex, "*"));
Current output:
{Index1=StudentData(studentName=Sam, idNumber=*), Index2=StudentData(studentName=Adam, idNumber=*), Index3=StudentData(studentName=Lisa, idNumber=*)}
Desired output:
{Index1=StudentData(studentName=Sam, idNumber=*****1312), Index2=StudentData(studentName=Adam, idNumber=5675), Index3=StudentData(studentName=Lisa, idNumber=*7124)

You can use this regex in Java:
(\hidNumber=|(?!^)\G)[a-zA-Z1-9](?=[a-zA-Z1-9]{4,}\))
And replace with $1*.
RegEx Demo
Java Code:
final String re = "(\\hidNumber=|(?!^)\\G)[a-zA-Z1-9](?=[a-zA-Z1-9]{4,}\\));
String r = s.replaceAll(re, "$1*");
Breakdown:
(: Start capture group #1
\h: Match a whitespace
idNumber=: Match text idNumber=
|: OR
(?!^)\G: Start at the end of the previous match
): Close capture group #1
[a-zA-Z1-9]: Match an ASCII letter or digit 1-9
(?=[a-zA-Z1-9]{4,}\)): Make sure that ahead of current position we have at least 4 ASCII letters or digits 1-9 followed by )

Masking credit card number using regex

I am trying to mask the CC number, in a way that third character and last three characters are unmasked.
For eg.. 7108898787654351 to **0**********351
I have tried (?<=.{3}).(?=.*...). It unmasked last three characters. But it unmasks first three also.
Can you throw some pointers on how to unmask 3rd character alone?

You can use this regex with a lookahead and lookbehind:
str = str.replaceAll("(?<!^..).(?=.{3})", "*");
//=> **0**********351
RegEx Demo
RegEx Details:
(?<!^..): Negative lookahead to assert that we don't have 2 characters after start behind us (to exclude 3rd character from matching)
.: Match a character
(?=.{3}): Positive lookahead to assert that we have at least 3 characters ahead

I would suggest that regex isn't the only way to do this.
char[] m = new char[16]; // Or whatever length.
Arrays.fill(m, '*');
m[2] = cc.charAt(2);
m[13] = cc.charAt(13);
m[14] = cc.charAt(14);
m[15] = cc.charAt(15);
String masked = new String(m);
It might be more verbose, but it's a heck of a lot more readable (and debuggable) than a regex.

Here is another regular expression:
(?!(?:\D*\d){14}$|(?:\D*\d){1,3}$)\d
See the online demo
It may seem a bit unwieldy but since a credit card should have 16 digits I opted to use negative lookaheads to look for an x amount of non-digits followed by a digit.
(?! - Negative lookahead
(?: - Open 1st non capture group.
\D*\d - Match zero or more non-digits and a single digit.
){14} - Close 1st non capture group and match it 14 times.
$ - End string ancor.
| - Alternation/OR.
(?: - Open 2nd non capture group.
\D*\d - Match zero or more non-digits and a single digit.
){1,3} - Close 2nd non capture group and match it 1 to 3 times.
$ - End string ancor.
) - Close negative lookahead.
\d - Match a single digit.
This would now mask any digit other than the third and last three regardless of their position (due to delimiters) in the formatted CC-number.

Apart from where the dashes are after the first 3 digits, leave the 3rd digit unmatched and make sure that where are always 3 digits at the end of the string:
(?<!^\d{2})\d(?=[\d-]*\d-?\d-?\d$)
Explanation
(?<! Negative lookbehind, assert what is on the left is not
^\d{2} Match 2 digits from the start of the string
) Close lookbehind
\d Match a digit
(?= Positive lookahead, assert what is on the right is
[\d-]* 0+ occurrences of either - or a digit
\d-?\d-?\d Match 3 digits with optional hyphens
$ End of string
) Close lookahead
Regex demo | Java demo
Example code
String regex = "(?<!^\\d{2})\\d(?=[\\d-]*\\d-?\\d-?\\d$)";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
String strings[] = { "7108898787654351", "7108-8987-8765-4351"};
for (String s : strings) {
Matcher matcher = pattern.matcher(s);
System.out.println(matcher.replaceAll("*"));
}
Output
**0**********351
**0*-****-****-*351

Don't think you should use a regex to do what you want. You could use StringBuilder to create the required string
String str = "7108-8987-8765-4351";
StringBuilder sb = new StringBuilder("*".repeat(str.length()));
for (int i = 0; i < str.length(); i++) {
if (i == 2 || i >= str.length() - 3) {
sb.replace(i, i + 1, String.valueOf(str.charAt(i)));
}
}
System.out.print(sb.toString()); // output: **0*************351

You may add a ^.{0,1} alternative to allow matching . when it is the first or second char in the string:
String s = "7108898787654351"; // **0**********351
System.out.println(s.replaceAll("(?<=.{3}|^.{0,1}).(?=.*...)", "*"));
// => **0**********351
The regex can be written as a PCRE compliant pattern, too: (?<=.{3}|^|^.).(?=.*...).
The regex can be written as a PCRE compliant pattern, too: (?<=.{3}|^|^.).(?=.*...).
It is equal to
System.out.println(s.replaceAll("(?<!^..).(?=.*...)", "*"));
See the Java demo and a regex demo.
Regex details
(?<=.{3}|^.{0,1}) - there must be any three chars other than line break chars immediately to the left of the current location, or start of string, or a single char at the start of the string
(?<!^..) - a negative lookbehind that fails the match if there are any two chars other than line break chars immediately to the left of the current location
. - any char but a line break char
(?=.*...) - there must be any three chars other than line break chars immediately to the right of the current location.

If the CC number always has 16 digits, as it does in the example, and as do Visa and MasterCard CC's, matches of the following regular expression can be replaced with an asterisk.
\d(?!\d{0,2}$|\d{13}$)
Start your engine!

How to Split text by Numbers and Group of words

Assuming I have a string containing
- some comma separated string
- and text
my_string = "2 Marine Cargo 14,642 10,528 16,016 more text 8,609 argA 2,106 argB"
I would like to extract them into an array that is split by "Numbers" and "group of words"
resultArray = {"2", "Marine Cargo", "14,642", "10,528", "16,016",
"more text", "8,609", "argA", "2,106", "argB"};
note 0: there might be multiple spaces between each entries, which should be ignored.
note 1: "Marine Cargo" and "more text" is not separated into different strings since they are a group of words without numbers separating them.
while argA and argB are separated because there's a number between them.

you can try splitting using this regex
([\d,]+|[a-zA-Z]+ *[a-zA-Z]*) //note the spacing between + and *.
[0-9,]+ // will search for one or more digits and commas
[a-zA-Z]+ [a-zA-Z] // will search for a word, followed by a space(if any) followed by another word(if any).
String regEx = "[0-9,]+|[a-zA-Z]+ *[a-zA-Z]*";
you use them like this
public static void main(String args[]) {
String input = new String("2 Marine Cargo 14,642 10,528 16,016 more text 8,609 argA 2,106 argB");
System.out.println("Return Value :" );
Pattern pattern = Pattern.compile("[0-9,]+|[a-zA-Z]+ *[a-zA-Z]*");
ArrayList<String> result = new ArrayList<String>();
Matcher m = pattern.matcher(input);
while (m.find()) {
System.out.println(">"+m.group(0)+"<");
result.add(m.group(0));
}
}
The following is the output as well as a detailed explaination of the RegEx that is autogenerated from https://regex101.com
1st Alternative [0-9,]+
Match a single character present in the list below [0-9,]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
0-9 a single character in the range between 0 (index 48) and 9 (index 57) (case sensitive)
, matches the character , literally (case sensitive)
2nd Alternative [a-zA-Z]+ *[a-zA-Z]*
Match a single character present in the list below [a-zA-Z]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
a-z a single character in the range between a (index 97) and z (index 122) (case sensitive)
A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)
* matches the character literally (case sensitive)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
Match a single character present in the list below [a-zA-Z]*
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
a-z a single character in the range between a (index 97) and z (index 122) (case sensitive)
A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)

If spaces are your problem. String#split takes a regex as parameter. Then you could do this:
my_list = Arrays.asList(my_string.split("\s?"));
But, this won't solve all the problems, like those mentioned in the comments.

You could do something like so:
List<String> strings = new ArrayList<>();
String prev = null;
for (String w: my_string.split("\\s+")) {
if (w.matches("\\d+(?:,\\d+)?")) {
if (prev != null) {
strings.add(prev);
prev = null;
}
strings.add(w);
} else if (prev == null) {
prev = w;
} else {
prev += " " + w;
}
}
if (prev != null) {
strings.add(prev);
}

I like Angel Koh solution and want to add on it. His solution will only match if the numeric part consists out of one or two parts.
If you also want to capture parts consisting out of three or more parts you have to alter the regex a bit to: ([\d,]+|[a-zA-Z]+(?: *[a-zA-Z])*)
The non capturing group (?: *[a-zA-Z]) repeats infinite times, if needed and will capture all pure numeric parts.

Parse numbers and parentheses from a String?

Given a String containing numbers (possibly with decimals), parentheses and any amount of whitespace, I need to iterate through the String and handle each number and parenthesis.
The below works for the String "1 ( 2 3 ) 4", but does not work if I remove whitespaces between the parentheses and the numbers "1 (2 3) 4)".
Scanner scanner = new Scanner(expression);
while (scanner.hasNext()) {
String token = scanner.next();
// handle token ...
System.out.println(token);
}

Scanner uses whitespace as it's default delimiter. You can change this to use a different Regex pattern, for example:
(?:\\s+)|(?<=[()])|(?=[()])
This pattern will set the delimiter to the left bracket or right bracket or one or more whitespace characters. However, it will also keep the left and right brackets (as I think you want to include those in your parsing?) but not the whitespace.
Here is an example of using this:
String test = "123(3 4)56(7)";
Scanner scanner = new Scanner(test);
scanner.useDelimiter("(?:\\s+)|(?<=[()])|(?=[()])");
while(scanner.hasNext()) {
System.out.println(scanner.next());
}
Output:
123
(
3
4
)
56
(
7
)
Detailed Regex Explanation:
(?:\\s+)|(?<=[()])|(?=[()])
1st Alternative: (?:\\s+)
(?:\\s+) Non-capturing group
\\s+ match any white space character [\r\n\t\f ]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
2nd Alternative: (?<=[()])
(?<=[()]) Positive Lookbehind - Assert that the regex below can be matched
[()] match a single character present in the list below
() a single character in the list () literally
3rd Alternative: (?=[()])
(?=[()]) Positive Lookahead - Assert that the regex below can be matched
[()] match a single character present in the list below
() a single character in the list () literally

Scanner's .next() method uses whitespace as its delimiter. Luckily, we can change the delimiter!
For example, if you need the scanner to process to handle whitespace and parentheses, you could run this code immediately after constructing your Scanner:
scanner.useDelimiter(" ()");

Regular Expression (RegEx) for User Name in Java

How to form the RegEx of user name string in Java?
Rules in Exercise :
Only 3 - 10 characters.
Only 'a'-'z', 'A'-'Z', '1'-'9', '_' and '.' are allowed.
'_' and '.' can only be appeared 0 to 2 times.
"abc_._" = false
"abc..." = false
"abc__" = true
"abc.." = true
"abc_." = true
If I do not use Regex, it will be easier.
Without considering '1'-'9', I have tried the following RegEx but they are not workable.
String username_regex = "[a-zA-Z||[_||.]{0,2}]{3,10}";
String username_regex = "[a-zA-Z]{3,10}||[_||.]{0,2}";
My function :
public static boolean isUserNameCorrect(String user_name) {
String username_regex = "[a-zA-Z||[_]{0,2}]{3,10}";
boolean isMatch = user_name.matches(username_regex);
return isMatch;
}
What RegEx should I use?

If I remember well from CS classes, it is not possible to create one single regex to satisfy all three requirements. So, I would make separate checks for each condintion. For example, this regex checks for conditions 1 and 2, and condition 3 is checked separately.
private static final Pattern usernameRegex = Pattern.compile("[a-zA-Z1-9._]{3,10}");
public static boolean isUserNameCorrect(String userName) {
boolean isMatch = usernameRegex.matcher(userName).matches();
return isMatch && countChar(userName, '.')<=2 && countChar(userName, '_') <=2;
}
public static int countChar(String s, char c) {
int count = 0;
int index = s.indexOf(c, 0);
while ( index >= 0 ) {
count++;
index = s.indexOf(c, index+1);
}
return count;
}
BTW, notice the pattern that allows you to reuse a regex in Java (performace gain, because it is expensive to compile a regex).
The reason that a regex cannot do what you want (again if I remember well) is that this problem requires a context-free-grammar, while regex is a regular grammar. Ream more

First off, || isn't necessary for this problem, and in fact doesn't do what you think it does. I've only ever seen it used in groups for regex (like if you want to match Hello or World, you'd match (Hello|World) or (?:Hello|World), and in those cases you only use a single |.
Next, let me explain why each of the regex you have tried won't work.
String username_regex = "[a-zA-Z||[_||.]{0,2}]{3,10}";
Range operators inside a character class aren't interpreted as range operators, and instead will just represent the literals that make up the range operators. In addition, nested character classes are simply combined. So this is effectively equal to:
String username_regex = "[a-zA-Z_|.{0,2}]{3,10}";
So it'll match some combination of 3-10 of the following: a-z, A-Z, 0, 2, {, }, ., |, and _.
And that's not what you wanted.
String username_regex = "[a-zA-Z]{3,10}||[_||.]{0,2}";
This will match 3 to 10 of a-z or A-Z, followed by two pipes, followed by _, |, or . 0 to 2 times. Also not what you wanted.
The easy way to do this is by splitting the requirements into two sections and creating two regex strings based off of those:
Only 3 - 10 characters, where only 'a'-'z', 'A'-'Z', '1'-'9', '_' and '.' are allowed.
'_' and '.' can only appear 0 to 2 times.
The first requirement is quite simple: we just need to create a character class including all valid characters and place limits on how many of those can appear:
"[a-zA-Z1-9_.]{3,10}"
Then I would validate that '_' and '.' appear 0 to 2 times:
".*[._].*[._].*"
or
"(?:.*[._].*){0,2}" // Might work, might not. Preferable to above regex if easy configuration is necessary. Might need reluctant quantifiers...
I'm unfortunately not experienced enough to figure out what a single regex would look like... But these are at least quite readable.

May not be elegant but you may try this:
^(([A-Za-z0-9\._])(?!.*[\._].*[\._].*[\._])){3,10}$
Here is the explanation:
NODE EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
( group and capture to \1 (between 3 and 10
times (matching the most amount
possible)):
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
[A-Za-z0-9\._] any character of: 'A' to 'Z', 'a' to
'z', '0' to '9', '\.', '_'
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
.* any character except \n (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
[\._] any character of: '\.', '_'
--------------------------------------------------------------------------------
.* any character except \n (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
[\._] any character of: '\.', '_'
--------------------------------------------------------------------------------
.* any character except \n (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
[\._] any character of: '\.', '_'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
){3,10} end of \1 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
This will satisfy your above-mentioned requirement. Hope it helps :)

Please try this:
[[a-Z][0-9][._]?[[a-Z][0-9][._]?[[a-Z][0-9]*
Niko
EDIT :
You're right. Then several Regexp :
Regex1: ^[\w.]{3-10}$
Regex2: ^[[a-Z][0-9]][_.]?[[a-Z][0-9]][_.]?[[a-Z][0-9]]*$
I hope I forgot nothing!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

pattern split to get all values in a string representing object - java

Related

Regex to get everything but last chars of capture group

Masking credit card number using regex

How to Split text by Numbers and Group of words

Parse numbers and parentheses from a String?

Regular Expression (RegEx) for User Name in Java

Categories

Resources