How to match a string between two same delimiters? - java

some-string-test-moretext.csv
I want to extract the string test, which is always found after the 2nd and 3rd - delimiter.
The expression [-](.*?)[-] would match -string-. So it's probably close, but how can I move on to the next match?
If that matters, I'm using java.

If you know the number of delimiters in advance, you can just split the String.
String[] test = {
"some-string-test-moretext.csv",
"another-string-test-andthensome.csv"
};
for (String s: test) {
System.out.println(s.split("-")[2]);
}
Output
test
test

This should give you quite a good head start:
[^-]+-[^-]+-(.*?)-[^-]+\.csv
https://regex101.com/r/YjWDkv/1

I would propose this, using regex, and very short :
String str = "some-string-test-moretext.csv\n";
Matcher m = Pattern.compile("\\w+-\\w+-(\\w+).*").matcher(str);
String res = m.find() ? m.group(1) : "";
System.out.println(res);
For sureString.split() is another way :
String res = str.split("-")[2];

In sed:
$ echo 'some-string-test-moretext.csv' | sed 's/[^-]*-[^-]*-\([^-]*\)-.*/\1/'
test
[^-]* means "zero or more occurrences of any char except "-". Let's call that "notHyphen". So we're matching on notHyphen-notHyphen-\(notHyphen\)-.* and replacing the whole match with \1, that is, whatever is captured by the \(\).
In Java, you won't need to escape ( to \(, and the technique for extracting from capturing groups is different:
Pattern patt = Pattern.compile("[^-]*-[^-]*-([^-]*)-.*");
Matcher m = patt.matcher(filename);
String extracted = null;
if (m.matches()) {
extracted = m.group(1);
}

Related

How to parse string using regex

I'm pretty new to java, trying to find a way to do this better. Potentially using a regex.
String text = test.get(i).toString()
// text looks like this in string form:
// EnumOption[enumId=test,id=machine]
String checker = text.replace("[","").replace("]","").split(",")[1].split("=")[1];
// checker becomes machine
My goal is to parse that text string and just return back machine. Which is what I did in the code above.
But that looks ugly. I was wondering what kinda regex can be used here to make this a little better? Or maybe another suggestion?
Use a regex' lookbehind:
(?<=\bid=)[^],]*
See Regex101.
(?<= ) // Start matching only after what matches inside
\bid= // Match "\bid=" (= word boundary then "id="),
[^],]* // Match and keep the longest sequence without any ']' or ','
In Java, use it like this:
import java.util.regex.*;
class Main {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("(?<=\\bid=)[^],]*");
Matcher matcher = pattern.matcher("EnumOption[enumId=test,id=machine]");
if (matcher.find()) {
System.out.println(matcher.group(0));
}
}
}
This results in
machine
Assuming you’re using the Polarion ALM API, you should use the EnumOption’s getId method instead of deparsing and re-parsing the value via a string:
String id = test.get(i).getId();
Using the replace and split functions don't take the structure of the data into account.
If you want to use a regex, you can just use a capturing group without any lookarounds, where enum can be any value except a ] and comma, and id can be any value except ].
The value of id will be in capture group 1.
\bEnumOption\[enumId=[^=,\]]+,id=([^\]]+)\]
Explanation
\bEnumOption Match EnumOption preceded by a word boundary
\[enumId= Match [enumId=
[^=,\]]+, Match 1+ times any char except = , and ]
id= Match literally
( Capture group 1
[^\]]+ Match 1+ times any char except ]
)\]
Regex demo | Java demo
Pattern pattern = Pattern.compile("\\bEnumOption\\[enumId=[^=,\\]]+,id=([^\\]]+)\\]");
Matcher matcher = pattern.matcher("EnumOption[enumId=test,id=machine]");
if (matcher.find()) {
System.out.println(matcher.group(1));
}
Output
machine
If there can be more comma separated values, you could also only match id making use of negated character classes [^][]* before and after matching id to stay inside the square bracket boundaries.
\bEnumOption\[[^][]*\bid=([^,\]]+)[^][]*\]
In Java
String regex = "\\bEnumOption\\[[^][]*\\bid=([^,\\]]+)[^][]*\\]";
Regex demo
A regex can of course be used, but sometimes is less performant, less readable and more bug-prone.
I would advise you not use any regex that you did not come up with yourself, or at least understand completely.
PS: I think your solution is actually quite readable.
Here's another non-regex version:
String text = "EnumOption[enumId=test,id=machine]";
text = text.substring(text.lastIndexOf('=') + 1);
text = text.substring(0, text.length() - 1);
Not doing you a favor, but the downvote hurt, so here you go:
String input = "EnumOption[enumId=test,id=machine]";
Matcher matcher = Pattern.compile("EnumOption\\[enumId=(.+),id=(.+)\\]").matcher(input);
if(!matcher.matches()) {
throw new RuntimeException("unexpected input: " + input);
}
System.out.println("enumId: " + matcher.group(1));
System.out.println("id: " + matcher.group(2));

Get a substring from string multiple times

I have a String that I don't know how long it is or what caracters are used in it.
I want to search in the string and get any substring found inside "" .
I tried to use pattern.compile but it always return an empty string
Pattern p = Pattern.compile("\".\"");
Matcher m = p.matcher(mystring);
while(m.find()){
System.out.println(m.group().toString());
}
How can I do it?
Use the .+? to get all characters inside "" with grouping
Pattern p = Pattern.compile("\".+?\"");
The .+ specifies that you want at least one or more characters inside the quotations. The ? specifies that it is a reluctant quantifier, which means it will put different quotations into different groups.
Unit test example:
#Test
public void test() {
String test = "speak \"friend\" and \"enter\"";
Pattern p = Pattern.compile("\".+?\"");
Matcher m = p.matcher(test);
while(m.find()){
System.out.println(m.group().toString().replace("\"", ""));
}
}
Output:
friend
enter
That is because your regex actually searches for one character between " and " ... if you want to search for more character, you should rewrite your regex to "\".?\""

multiple regex matches in a string

i have the following text:
bla [string1] bli [string2]
I like to match string1 and string2 with regex in a loop in java.
Howto do ?
my code so far, which only matches the first string1, but not also string 2.
String sRegex="(?<=\\[).*?(?=\\])";
Pattern p = Pattern.compile(sRegex); // create the pattern only once,
Matcher m = p.matcher(sFormula);
if (m.find())
{
String sString1 = m.group(0);
String sString2 = m.group(1); // << no match
}
Your regex is not using any captured groups hence this call with throw exceptions:
m.group(1);
You can use just use:
String sRegex="(?<=\\[)[^]]*(?=\\])";
Pattern p = Pattern.compile(sRegex); // create the pattern only once,
Matcher m = p.matcher(sFormula);
while (m.find()) {
System.out.println( m.group() );
}
Also if should be replaced by while to match multiple times to return all matches.
Your approach is confused. Either write your regex so that it matches two [....] sequences in the one pattern, or call find multiple times. Your current attempt has a regex that "finds" just one [...] sequence.
Try something like this:
Pattern p = Pattern.compile("\\[([^\\]]+)]");
Matcher m = p.matcher(formula);
if (m.find()) {
String string1 = m.group(0);
if (m.find(m.end()) {
String string2 = m.group(0);
}
}
Or generalize using a loop and an array of String for the extracted strings.
(You don't need any fancy look-behind patterns in this case. And ugly "hungarian notation" is frowned in Java, so get out of the habit of using it.)

Why the string does not split?

While trying to split a string xyz213123kop234430099kpf4532 into tokens :
xyz213123
kop234430099
kpf4532
I wrote the following code
String s = "xyz213123kop234430099kpf4532";
String regex = "/^[a-zA-z]+[0-9]+$/";
String tokens[] = s.split(regex);
for(String t : tokens) {
System.out.println(t);
}
but instead of tokens, I get the whole string as one output. What is wrong with the regular expression I used ?
You can do that:
String s = "xyz213123kop234430099kpf4532";
String[] result = s.split("(?<=[0-9])(?=[a-z])");
The idea is to use zero width assertions to find the place where to cut the string, then I use a lookbehind (preceded by a digit [0-9]) and a lookahead (followed by a letter [a-z]).
These lookarounds are just checks and match nothing, thus the delimiter of the split is an empty string and no characters are removed from the result.
You could split on this matching between a number and not-a-number.
String s = "xyz213123kop234430099kpf4532";
String[] parts = s.split("(?<![^\\d])(?=\\D)");
for (String p : parts) {
System.out.println(p);
}
Output
xyz213123
kop234430099
kpf4532
There's nothing in your string that matches the regular expression, because your expression starts with ^ (beginning of string) and ends with $ (end of string). So it would either match the whole string, or nothing at all. But because it doesn't match the string, it is not found when you split the string into tokens. That's why you get just one big token.
You don't want to use split for that. The argument to split is the delimiter between tokens. You don't have that. Instead, you have a pattern that repeats and you want each match to the pattern. Try this instead:
String s = "xyz213123kop234430099kpf4532";
Pattern p = Pattern.compile("([a-zA-z]+[0-9]+)");
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println(m.group());
}
Output:
xyz213123
kop234430099
kpf4532
(I don't know by what logic you would have the second token be "3kop234430099" as in your posted question. I assume that the leading "3" is a typo.)

extract a substring in Java

I have the following string:
"hello this.is.a.test(MainActivity.java:47)"
and I want to be able to extract the MainActivity.java:47
(everything that is inside '(' and ')' and only the first occurance).
I tried with regex but it seems that I am doing something wrong.
Thanks
You can do it yourself:
int pos1 = str.indexOf('(') + 1;
int pos2 = str.indexOf(')', pos1);
String result = str.substring(pos1, pos2)
Or you can use commons-lang which contains a very nice StringUtils class that has substringBetween()
I think Regex is a liitle bit an overkill. I would use something like this:
String input = "hello this.is.a.test(MainActivity.java:47)";
String output = input.subString(input.lastIndexOf("(") + 1, input.lastIndexOf(")"));
This should work:
^[^\\(]*\\(([^\\)]+)\\)
The result is in the first group.
Another answer for your question :
String str = "hello this.is.a.test(MainActivity.java:47) another.test(MyClass.java:12)";
Pattern p = Pattern.compile("[a-z][\\w]+\\.java:\\d+", Pattern.CASE_INSENSITIVE);
Matcher m=p.matcher(str);
if(m.find()) {
System.out.println(m.group());
}
The RegExp explained :
[a-z][\w]+\.java:\d+
[a-z] > Check that we start with a letter ...
[\w]+ > ... followed by a letter, a digit or an underscore...
\.java: > ... followed exactly by the string ".java:"...
\d+ > ... ending by one or more digit(s)
Pseudo-code:
int p1 = location of '('
int p2 = location of ')', starting the search from p1
String s = extract string from p1 to p2
String.indexOf() and String.substring() are your friends.
Try this:
String input = "hello this.is.a.test(MainActivity.java:47) (and some more text)";
Pattern p = Pattern.compile("[^\\)]*\\(([^\\)]*)\\).*");
Matcher m = p.matcher( input );
if(m.matches()) {
System.out.println(m.group( 1 )); //output: MainActivity.java:47
}
This also finds the first occurence of text between ( and ) if there are more of them.
Note that in Java you normally have the expressions wrapped with ^ and $ implicitly (or at least the same effect), i.e. the regex must match the entire input string. Thus [^\\)]* at the beginning and .* at the end are necessary.

Categories

Resources