Java string split with regular experssions - java

I am far from mastering regular expressions but I would like to split a string on first and last underscore e.g.
split the string on first and last underscore with regular expression
"hello_5_9_2018_world"
to
"hello"
"5_9_2018"
"world"
I can split it on the last underscore with
String[] splitArray = subjectString.split("_(?=[^_]*$)");
but I am not able to figure out how to split on first underscore.
Could anyone show me how I can do this?
Thanks
David

You can achieve this without regex. You can achieve this by finding the first and last index of _ and getting substrings based on them.
String s = "hello_5_9_2018_world";
int firstIndex = s.indexOf("_");
int lastIndex = s.lastIndexOf("_");
System.out.println(s.substring(0, firstIndex));
System.out.println(s.substring(firstIndex + 1, lastIndex));
System.out.println(s.substring(lastIndex + 1));
The above prints
hello
5_9_2018
world
Note:
If the string does not have two _ you will get a StringIndexOutOfBoundsException.
To safeguard against it, you can check if the extracted indices are valid.
If firstIndex == lastIndex == -1 then it means the string does
not have any underscores.
If firstIndex == lastIndex then the string has just one underscore.

If you have always three parts as above, you can use
([^_]*)_(.*)_(^_)*
and get the single elements as groups.

Regular Expression
(?<first>[^_]+)_(?<middle>.+)+_(?<last>[^_]+)
Demo
Java Code
final String str = "hello_5_9_2018_world";
Pattern pattern = Pattern.compile("(?<first>[^_]+)_(?<middle>.+)+_(?<last>[^_]+)");
Matcher matcher = pattern.matcher(str);
if(matcher.matches()) {
String first = matcher.group("first");
String middle = matcher.group("middle");
String last = matcher.group("last");
}

I see that a lot of guys provided their solution, but I have another regex pattern for your question
You can achieve your goal with this pattern:
"([a-zA-Z]+)_(.*)_([a-zA-Z]+)"
The whole code looks like this:
String subjectString= "hello_5_9_2018_world";
Pattern pattern = Pattern.compile("([a-zA-Z]+)_(.*)_([a-zA-Z]+)");
Matcher matcher = pattern.matcher(subjectString);
if(matcher.matches()){
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
System.out.println(matcher.group(3));
}
It outputs:
hello
5_9_2018
world

While the other answers are actually nicer and better, if you really want to use split, this is the way to go:
"hello_5_9_2018_world".split("((?<=^[^_]*)_)|(_(?=[^_]*$))")
==> String[3] { "hello", "5_9_2018", "world" }
This is a combination of your lookahead pattern (_(?=[^_]*$))
and the symmetrical look-behind pattern: ((?<=^[^_]*)_)
(match the _ preceeded by ^ (start of the string) and [^_]* (0..n non-underscore chars).

Related

How to parse string using regex

I'm pretty new to java, trying to find a way to do this better. Potentially using a regex.
String text = test.get(i).toString()
// text looks like this in string form:
// EnumOption[enumId=test,id=machine]
String checker = text.replace("[","").replace("]","").split(",")[1].split("=")[1];
// checker becomes machine
My goal is to parse that text string and just return back machine. Which is what I did in the code above.
But that looks ugly. I was wondering what kinda regex can be used here to make this a little better? Or maybe another suggestion?
Use a regex' lookbehind:
(?<=\bid=)[^],]*
See Regex101.
(?<= ) // Start matching only after what matches inside
\bid= // Match "\bid=" (= word boundary then "id="),
[^],]* // Match and keep the longest sequence without any ']' or ','
In Java, use it like this:
import java.util.regex.*;
class Main {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("(?<=\\bid=)[^],]*");
Matcher matcher = pattern.matcher("EnumOption[enumId=test,id=machine]");
if (matcher.find()) {
System.out.println(matcher.group(0));
}
}
}
This results in
machine
Assuming you’re using the Polarion ALM API, you should use the EnumOption’s getId method instead of deparsing and re-parsing the value via a string:
String id = test.get(i).getId();
Using the replace and split functions don't take the structure of the data into account.
If you want to use a regex, you can just use a capturing group without any lookarounds, where enum can be any value except a ] and comma, and id can be any value except ].
The value of id will be in capture group 1.
\bEnumOption\[enumId=[^=,\]]+,id=([^\]]+)\]
Explanation
\bEnumOption Match EnumOption preceded by a word boundary
\[enumId= Match [enumId=
[^=,\]]+, Match 1+ times any char except = , and ]
id= Match literally
( Capture group 1
[^\]]+ Match 1+ times any char except ]
)\]
Regex demo | Java demo
Pattern pattern = Pattern.compile("\\bEnumOption\\[enumId=[^=,\\]]+,id=([^\\]]+)\\]");
Matcher matcher = pattern.matcher("EnumOption[enumId=test,id=machine]");
if (matcher.find()) {
System.out.println(matcher.group(1));
}
Output
machine
If there can be more comma separated values, you could also only match id making use of negated character classes [^][]* before and after matching id to stay inside the square bracket boundaries.
\bEnumOption\[[^][]*\bid=([^,\]]+)[^][]*\]
In Java
String regex = "\\bEnumOption\\[[^][]*\\bid=([^,\\]]+)[^][]*\\]";
Regex demo
A regex can of course be used, but sometimes is less performant, less readable and more bug-prone.
I would advise you not use any regex that you did not come up with yourself, or at least understand completely.
PS: I think your solution is actually quite readable.
Here's another non-regex version:
String text = "EnumOption[enumId=test,id=machine]";
text = text.substring(text.lastIndexOf('=') + 1);
text = text.substring(0, text.length() - 1);
Not doing you a favor, but the downvote hurt, so here you go:
String input = "EnumOption[enumId=test,id=machine]";
Matcher matcher = Pattern.compile("EnumOption\\[enumId=(.+),id=(.+)\\]").matcher(input);
if(!matcher.matches()) {
throw new RuntimeException("unexpected input: " + input);
}
System.out.println("enumId: " + matcher.group(1));
System.out.println("id: " + matcher.group(2));

count substring occurrence in every combination

I have a string phahahahoto and I need to find how many times the String haha appear in the above string. If you look closely it appears 2 times.
My code is below and I get the output 1 instead of 2.
Code is written in java.
Pattern pattern = Pattern.compile("haha");
Matcher matcher = pattern.matcher("phahahahoto");
int count = 0;
while (matcher.find()) {
count++;
}
System.out.println(count);
Use lookaheads in-order to do overlapping matches. If you clearly noticed that the string haha was overlapped. If you pass haha as regex, it won't do an overlapping match, since the pattern haha matches the first haha substring which leaves you only the last ha part. Lookarounds won't consume any single character. So it would be able to match only the boundaries.
Pattern pattern = Pattern.compile("(?=haha)");
Matcher matcher = pattern.matcher("phahahahoto");
int count = 0;
while (matcher.find()) {
count++;
}
System.out.println(count);
DEMO
Here it matches the boundary which exists before each haha . See the above demo link.
You can get the count in one line like this also:
int count = "phahahahoto".split("(?=haha)").length - 1;
//=> 2

Regex to match 2 or more commas

I'm trying to write a regex that will identify whether a string has 2 or more consecutive commas. For example:
hello,,457
,,,,,
dog,,,elephant,,,,,
Can anyone help on what a valid regex would be?
String str ="hello,,,457";
Pattern pat = Pattern.compile("[,]{2,}");
Matcher matcher = pat.matcher(str);
if(matcher.find()){
System.out.println("contains 2 or more commas");
}
The below regex would matches the strings which has two or more consecutive commas,
^.*?,,+.*$
DEMO
You don't need to include start and the end anchors while using the regex with matches method.
System.out.println("dog,,,elephant,,,,,".matches(".*?,,+.*"));
Output:
true
Try:
int occurance = StringUtils.countOccurrencesOf("dog,,,elephant,,,,,", ",,");
or
int count = StringUtils.countMatches("dog,,,elephant,,,,,", ",,");
depend which library you use:
Check the solution here: Java: How do I count the number of occurrences of a char in a String?

Find string after last underscore before dot extension

I need to find 20140809T0000Z in this string:
PREVIMER_F2-MARS3D-MENOR1200_20140809T0000Z.nc
I tried the following to keep the string before the .nc:
(?<=_)(.*)(?=.nc)
I have the following to start from the last underscore:
/_[^_]*$/
How can I find string after last underscore before dot extension, using a regex?
RegEx is not always the best solution... :)
String pattern="PREVIMER_F2-MARS3D-MENOR1200_20140809T0000Z.nc";
int start=pattern.lastIndexOf("_") + 1;
int end=pattern.lastIndexOf(".");
if(start != 0 && end != -1 && end > start) {
System.out.println(pattern.substring(start,end);
}
You just need lookahead for this requirement.
You can use:
[^._]+(?=[^_]*$)
// matches and returns 20140809T0000Z
RegEx Demo
You could use the below regex,
(?<=_)[^_]*(?=\.nc)
In your pattern just replace .* with [^_]* so that it would match the inner string.
DEMO
String s = "PREVIMER_F2-MARS3D-MENOR1200_20140809T0000Z.nc";
Pattern regex = Pattern.compile("(?<=_)[^_]*(?=\\.nc)");
Matcher regexMatcher = regex.matcher(s);
if (regexMatcher.find()) {
String ResultString = regexMatcher.group();
System.out.println(ResultString);
} //=> 20140809T0000Z
You could use a simpler pattern with a capturing group
.*_(.*)\.nc
By default the first .* will be "greedy" and consume as many characters as possible before the _, leaving just the desired string inside the (.*).
Demo: http://regex101.com/r/aI2xQ9/1
Java code:
String input = "PREVIMER_F2-MARS3D-MENOR1200_20140809T0000Z.nc";
Pattern pattern = Pattern.compile(".*_(.*)\\.nc");
Matcher matcher = pattern.matcher(input);
if (matcher.find()) {
String group = matcher.group(1);
// ...
}
So, you need a sequence of non-underscore characters that immediately precede the period character.
Try [^_.]+(?=\.)
Demo: https://regex101.com/r/sLAnVs/2
Thanks to Cary Swoveland for pointing out that "no need to escape a period in a character class".

Why the string does not split?

While trying to split a string xyz213123kop234430099kpf4532 into tokens :
xyz213123
kop234430099
kpf4532
I wrote the following code
String s = "xyz213123kop234430099kpf4532";
String regex = "/^[a-zA-z]+[0-9]+$/";
String tokens[] = s.split(regex);
for(String t : tokens) {
System.out.println(t);
}
but instead of tokens, I get the whole string as one output. What is wrong with the regular expression I used ?
You can do that:
String s = "xyz213123kop234430099kpf4532";
String[] result = s.split("(?<=[0-9])(?=[a-z])");
The idea is to use zero width assertions to find the place where to cut the string, then I use a lookbehind (preceded by a digit [0-9]) and a lookahead (followed by a letter [a-z]).
These lookarounds are just checks and match nothing, thus the delimiter of the split is an empty string and no characters are removed from the result.
You could split on this matching between a number and not-a-number.
String s = "xyz213123kop234430099kpf4532";
String[] parts = s.split("(?<![^\\d])(?=\\D)");
for (String p : parts) {
System.out.println(p);
}
Output
xyz213123
kop234430099
kpf4532
There's nothing in your string that matches the regular expression, because your expression starts with ^ (beginning of string) and ends with $ (end of string). So it would either match the whole string, or nothing at all. But because it doesn't match the string, it is not found when you split the string into tokens. That's why you get just one big token.
You don't want to use split for that. The argument to split is the delimiter between tokens. You don't have that. Instead, you have a pattern that repeats and you want each match to the pattern. Try this instead:
String s = "xyz213123kop234430099kpf4532";
Pattern p = Pattern.compile("([a-zA-z]+[0-9]+)");
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println(m.group());
}
Output:
xyz213123
kop234430099
kpf4532
(I don't know by what logic you would have the second token be "3kop234430099" as in your posted question. I assume that the leading "3" is a typo.)

Categories

Resources