Split String using Pattern and Matcher until first occurance of ','

Split String using Pattern and Matcher until first occurance of ',' - java

I want to split the below string in three parts
(1) Number
(2) String until first occurance of ','
(3) Rest of the string
Like if the string is "12345 - electricity, flat no 1106 , Palash H , Pune"
Three parts should be
(1) 12345
(2) electricity
(3) flat no 1106 , Palash H , Pune
I am able to split into 12345 and rest of the string using below code. but not able to break 2 and 3rd part as required
Map<String, String> strParts= new HashMap<String, String>();
String text = "12345 - electricity, flat no 1106 , Palash 2E , Pune";
Pattern pttrnCrs = Pattern.compile("(.*)\\s\\W\\s(.*)");
Matcher matcher = pttrnCrs.matcher(text);
if (matcher.matches()) {
strParts.put("NUM", matcher.group(1));
StrParts.put("REST", matcher.group(2));
}
Can any one help ?

You need to use a regex with 3 capturing groups:
^(\d+)\W*([^,]+)\h*,\h*(.*)$
RegEx Demo
In Java use:
final String regex = "(\\d+)\\W*([^,]+)\\h*,\\h*(.*)";
No need to use anchors in Java if you are using Matcher#matches() method that implicitly anchors the regex.
RegEx Breakup:
^ # start
(\d+) # match and group 1+ digits in group #1
\W* # match 0 or more non-word characters
([^,]+) # Match and group 1+ character that are not comma in group #2
\h*,\h* # Match comma surrounded by optional whitespaces
(.*) # match and group remaining characters in string in group #3
$ # end

Related

Tokenize Words separated by non-word characters exept single quote

I have the following method I'm trying to implement: parses the input into “word tokens”: sequences of word characters separated by non-word characters. However, non-word characters can become part of a token if they are quoted (in single quotes).
I want to use regex but have trouble getting my code just right:
public static List<String> wordTokenize(String input) {
Pattern pattern = Pattern.compile ("\\b(?:(?<=\')[^\']*(?=\')|\\w+)\\b");
Matcher matcher = pattern.matcher (input);
ArrayList ans = new ArrayList();
while (matcher.find ()){
ans.add (matcher.group ());
}
return ans;
}
My regex fails to identify that starting a word mid word without space doesn't mean starting a new word. Examples:
The input: this-string 'has only three tokens' // works
The input:
"this*string'has only two#tokens'"
Expected :[this, stringhas only two#tokens]
Actual :[this, string, has only two#tokens]
The input: "one'two''three' '' four 'twenty-one'"
Expected :[onetwothree, , four, twenty-one]
Actual :[one, two, three, four, twenty-one]
How do I fix the spaces?

You want to match one or more occurrences of a word char or a substring between the closest single straight apostrophes, and remove all those apostrophes from the tokens.
Use the following regex and .replace("'", "") on the matches:
(?:\w|'[^']*')+
See the regex demo. Details:
(?: - start of a non-capturing group
\w - a word char
| - or
' - a straight single quotation mark
[^']* - any 0+ chars other than a straight single quotation mark
' - a straight single quotation mark
)+ - end of the group, 1+ occurrences.
See the Java demo:
// String s = "this*string'has only two#tokens'"; // => [this, stringhas only two#tokens]
String s = "one'two''three' '' four 'twenty-one'"; // => [onetwothree, , four, twenty-one]
Pattern pattern = Pattern.compile("(?:\\w|'[^']*')+", Pattern.UNICODE_CHARACTER_CLASS);
Matcher matcher = pattern.matcher(s);
List<String> tokens = new ArrayList<>();
while (matcher.find()){
tokens.add(matcher.group(0).replace("'", ""));
}
Note the Pattern.UNICODE_CHARACTER_CLASS is added for the \w pattern to match all Unicode letters and digits.

Match all occurrences Regex Java

i'd like to recognize all sequences of "word-number-word" of a string with Regex Java API.
For example, if i have "ABC-122-JDHFHG-456-MKJD", i'd like the output : [ABC-122-JDHFHG, JDHFHG-456-MKJD].
String test = "ABC-122-JDHFHG-456-MKJD";
Matcher m = Pattern.compile("(([A-Z]+)-([0-9]+)-([A-Z]+))+")
.matcher(test);
while (m.find()) {
System.out.println(m.group());
}
The code above return only "ABC-122-JDHFHG".
Any ideas ?

The last ([A-Z]+) matches and consumes JDHFHG, so the regex engine only "sees" -456-MKJD after the first match, and the pattern does not match this string remainder.
You want to get "whole word" overlapping matches.
Use
String test = "ABC-122-JDHFHG-456-MKJD";
Matcher m = Pattern.compile("(?=\\b([A-Z]+-[0-9]+-[A-Z]+)\\b)")
.matcher(test);
while (m.find()) {
System.out.println(m.group(1));
} // => [ ABC-122-JDHFHG, JDHFHG-456-MKJD ]
See the Java demo
Pattern details
(?= - start of a positive lookahead that matches a position that is immediately followed with
\\b - a word boundary
( - start of a capturing group (to be able to grab the value you need)
[A-Z]+ - 1+ ASCII uppercase letters
- - a hyphen
[0-9]+ - 1+ digits
- - a hyphen
[A-Z]+ - 1+ ASCII uppercase letters
) - end of the capturing group
\\b - a word boundary
) - end of the lookahead construct.

Here you go, overlap the last word.
Make an array out of capture group 1.
Basically, find 3 consume 2. This makes the next match position start
on the next possible known word.
(?=(([A-Z]+-\d+-)[A-Z]+))\2
https://regex101.com/r/Sl5FgT/1
Formatted
(?= # Assert to find
( # (1 start), word,num,word
( # (2 start), word,num
[A-Z]+
-
\d+
-
) # (2 end)
[A-Z]+
) # (1 end)
)
\2 # Consume word,num

Parse string using Java Regex Pattern?

I have the below java string in the below format.
String s = "City: [name:NYK][distance:1100] [name:CLT][distance:2300] [name:KTY][distance:3540] Price:"
Using the java.util.regex package matter and pattern classes I have to get the output string int the following format:
Output: [NYK:1100][CLT:2300][KTY:3540]
Can you suggest a RegEx pattern which can help me get the above output format?

You can use this regex \[name:([A-Z]+)\]\[distance:(\d+)\] with Pattern like this :
String regex = "\\[name:([A-Z]+)\\]\\[distance:(\\d+)\\]";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(s);
StringBuilder result = new StringBuilder();
while (matcher.find()) {
result.append("[");
result.append(matcher.group(1));
result.append(":");
result.append(matcher.group(2));
result.append("]");
}
System.out.println(result.toString());
Output
[NYK:1100][CLT:2300][KTY:3540]
regex demo
\[name:([A-Z]+)\]\[distance:(\d+)\] mean get two groups one the upper letters after the \[name:([A-Z]+)\] the second get the number after \[distance:(\d+)\]
Another solution from #tradeJmark you can use this regex :
String regex = "\\[name:(?<name>[A-Z]+)\\]\\[distance:(?<distance>\\d+)\\]";
So you can easily get the results of each group by the name of group instead of the index like this :
while (matcher.find()) {
result.append("[");
result.append(matcher.group("name"));
//----------------------------^^
result.append(":");
result.append(matcher.group("distance"));
//------------------------------^^
result.append("]");
}

If the format of the string is fixed, and you always have just 3 [...] groups inside to deal with, you may define a block that matches [name:...] and captures the 2 parts into separate groups and use a quite simple code with .replaceAll:
String s = "City: [name:NYK][distance:1100] [name:CLT][distance:2300] [name:KTY][distance:3540] Price:";
String matchingBlock = "\\s*\\[name:([A-Z]+)]\\[distance:(\\d+)]";
String res = s.replaceAll(String.format(".*%1$s%1$s%1$s.*", matchingBlock),
"[$1:$2][$3:$4][$5:$6]");
System.out.println(res); // [NYK:1100][CLT:2300][KTY:3540]
See the Java demo and a regex demo.
The block pattern matches:
\\s* - 0+ whitespaces
\\[name: - a literal [name: substring
([A-Z]+) - Group n capturing 1 or more uppercase ASCII chars (\\w+ can also be used)
]\\[distance: - a literal ][distance: substring
(\\d+) - Group m capturing 1 or more digits
] - a ] symbol.
In the .*%1$s%1$s%1$s.* pattern, the groups will have 1 to 6 IDs (referred to with $1 - $6 backreferences from the replacement pattern) and the leading and final .* will remove start and end of the string (add (?s) at the start of the pattern if the string can contain line breaks).

Java: Extracting a specific REGEXP pattern out of a string

How is it possible to extract only a time part of the form XX:YY out of a string?
For example - from a string like:
sdhgjhdgsjdf12:34knvxjkvndf, I would like to extract only 12:34.
( The surrounding chars can be spaces too of course )
Of course I can find the semicolon and get two chars before and two chars after, but it is bahhhhhh.....

You can use this look-around based regex for your match:
(?<!\d)\d{2}:\d{2}(?!\d)
RegEx Demo
In Java:
Pattern p = Pattern.compile("(?<!\\d)\\d{2}:\\d{2}(?!\\d)");
RegEx Breakup:
(?<!\d) # negative lookbehind to assert previous char is not a digit
\d{2} # match exact 2 digits
: # match a colon
\d{2} # match exact 2 digits
(?!\d) # negative lookahead to assert next char is not a digit
Full Code:
Pattern p = Pattern.compile("(?<!\\d)\\d{2}:\\d{2}(?!\\d)");
Matcher m = pattern.matcher(inputString);
if (m.find()) {
System.err.println("Time: " + m.group());
}

RegEx: Matching n-char long sequence of repeating character

I want to split of a text string that might look like this:
(((Hello! --> ((( and Hello!
or
########No? --> ######## and No?
At the beginning I have n-times the same special character, but I want to match the longest possible sequence.
What I have at the moment is this regex:
([^a-zA-Z0-9])\\1+([a-zA-Z].*)
This one would return for the first example
( (only 1 time) and Hello!
and for the second
# and No!
How do I tell regEx I want the maximal long repetition of the matching character?
I am using RegEx as part of a Java program in case this matters.

I suggest the following solution with 2 regexps: (?s)(\\W)\\1+\\w.* for checking if the string contains same repeating non-word symbols at the start, and if yes, split with a mere (?<=\\W)(?=\\w) pattern (between non-word and a word character), else, just return a list containing the whole string (as if not split):
String ptrn = "(?<=\\W)(?=\\w)";
List<String> strs = Arrays.asList("(((Hello!", "########No?", "$%^&^Hello!");
for (String str : strs) {
if (str.matches("(?s)(\\W)\\1+\\w.*")) {
System.out.println(Arrays.toString(str.split(ptrn)));
}else { System.out.println(Arrays.asList(str)); }
}
See IDEONE demo
Result:
[(((, Hello!]
[########, No?]
[$%^&^Hello!]
Also, your original regex can be modified to fit the requirement like this:
String ptrn = "(?s)((\\W)\\2+)(\\w.*)";
List<String> strs = Arrays.asList("(((Hello!", "########No?", "$%^&^Hello!");
for (String str : strs) {
Pattern p = Pattern.compile(ptrn);
Matcher m = p.matcher(str);
if (m.matches()) {
System.out.println(Arrays.asList(m.group(1), m.group(3)));
}
else {
System.out.println(Arrays.asList(str));
}
}
See another IDEONE demo
That regex matches:
(?s) - DOTALL inline modifier (if the string has newline characters, .* will also match them).
((\\W)\\2+) - Capture group 1 matching and capturing into Group 2 a non-word character followed by the same character (since a backreference \2 is used) 1 or more times.
(\\w.*) - matches and captures into Group 3 a word character and then one or more characters.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Split String using Pattern and Matcher until first occurance of ',' - java

Related

Tokenize Words separated by non-word characters exept single quote

Match all occurrences Regex Java

Parse string using Java Regex Pattern?

Java: Extracting a specific REGEXP pattern out of a string

RegEx: Matching n-char long sequence of repeating character

Categories

Resources