extract a substring in Java - java

I have the following string:
"hello this.is.a.test(MainActivity.java:47)"
and I want to be able to extract the MainActivity.java:47
(everything that is inside '(' and ')' and only the first occurance).
I tried with regex but it seems that I am doing something wrong.
Thanks

You can do it yourself:
int pos1 = str.indexOf('(') + 1;
int pos2 = str.indexOf(')', pos1);
String result = str.substring(pos1, pos2)
Or you can use commons-lang which contains a very nice StringUtils class that has substringBetween()

I think Regex is a liitle bit an overkill. I would use something like this:
String input = "hello this.is.a.test(MainActivity.java:47)";
String output = input.subString(input.lastIndexOf("(") + 1, input.lastIndexOf(")"));

This should work:
^[^\\(]*\\(([^\\)]+)\\)
The result is in the first group.

Another answer for your question :
String str = "hello this.is.a.test(MainActivity.java:47) another.test(MyClass.java:12)";
Pattern p = Pattern.compile("[a-z][\\w]+\\.java:\\d+", Pattern.CASE_INSENSITIVE);
Matcher m=p.matcher(str);
if(m.find()) {
System.out.println(m.group());
}
The RegExp explained :
[a-z][\w]+\.java:\d+
[a-z] > Check that we start with a letter ...
[\w]+ > ... followed by a letter, a digit or an underscore...
\.java: > ... followed exactly by the string ".java:"...
\d+ > ... ending by one or more digit(s)

Pseudo-code:
int p1 = location of '('
int p2 = location of ')', starting the search from p1
String s = extract string from p1 to p2
String.indexOf() and String.substring() are your friends.

Try this:
String input = "hello this.is.a.test(MainActivity.java:47) (and some more text)";
Pattern p = Pattern.compile("[^\\)]*\\(([^\\)]*)\\).*");
Matcher m = p.matcher( input );
if(m.matches()) {
System.out.println(m.group( 1 )); //output: MainActivity.java:47
}
This also finds the first occurence of text between ( and ) if there are more of them.
Note that in Java you normally have the expressions wrapped with ^ and $ implicitly (or at least the same effect), i.e. the regex must match the entire input string. Thus [^\\)]* at the beginning and .* at the end are necessary.

Related

Java string split with regular experssions

I am far from mastering regular expressions but I would like to split a string on first and last underscore e.g.
split the string on first and last underscore with regular expression
"hello_5_9_2018_world"
to
"hello"
"5_9_2018"
"world"
I can split it on the last underscore with
String[] splitArray = subjectString.split("_(?=[^_]*$)");
but I am not able to figure out how to split on first underscore.
Could anyone show me how I can do this?
Thanks
David
You can achieve this without regex. You can achieve this by finding the first and last index of _ and getting substrings based on them.
String s = "hello_5_9_2018_world";
int firstIndex = s.indexOf("_");
int lastIndex = s.lastIndexOf("_");
System.out.println(s.substring(0, firstIndex));
System.out.println(s.substring(firstIndex + 1, lastIndex));
System.out.println(s.substring(lastIndex + 1));
The above prints
hello
5_9_2018
world
Note:
If the string does not have two _ you will get a StringIndexOutOfBoundsException.
To safeguard against it, you can check if the extracted indices are valid.
If firstIndex == lastIndex == -1 then it means the string does
not have any underscores.
If firstIndex == lastIndex then the string has just one underscore.
If you have always three parts as above, you can use
([^_]*)_(.*)_(^_)*
and get the single elements as groups.
Regular Expression
(?<first>[^_]+)_(?<middle>.+)+_(?<last>[^_]+)
Demo
Java Code
final String str = "hello_5_9_2018_world";
Pattern pattern = Pattern.compile("(?<first>[^_]+)_(?<middle>.+)+_(?<last>[^_]+)");
Matcher matcher = pattern.matcher(str);
if(matcher.matches()) {
String first = matcher.group("first");
String middle = matcher.group("middle");
String last = matcher.group("last");
}
I see that a lot of guys provided their solution, but I have another regex pattern for your question
You can achieve your goal with this pattern:
"([a-zA-Z]+)_(.*)_([a-zA-Z]+)"
The whole code looks like this:
String subjectString= "hello_5_9_2018_world";
Pattern pattern = Pattern.compile("([a-zA-Z]+)_(.*)_([a-zA-Z]+)");
Matcher matcher = pattern.matcher(subjectString);
if(matcher.matches()){
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
System.out.println(matcher.group(3));
}
It outputs:
hello
5_9_2018
world
While the other answers are actually nicer and better, if you really want to use split, this is the way to go:
"hello_5_9_2018_world".split("((?<=^[^_]*)_)|(_(?=[^_]*$))")
==> String[3] { "hello", "5_9_2018", "world" }
This is a combination of your lookahead pattern (_(?=[^_]*$))
and the symmetrical look-behind pattern: ((?<=^[^_]*)_)
(match the _ preceeded by ^ (start of the string) and [^_]* (0..n non-underscore chars).

How to match a string between two same delimiters?

some-string-test-moretext.csv
I want to extract the string test, which is always found after the 2nd and 3rd - delimiter.
The expression [-](.*?)[-] would match -string-. So it's probably close, but how can I move on to the next match?
If that matters, I'm using java.
If you know the number of delimiters in advance, you can just split the String.
String[] test = {
"some-string-test-moretext.csv",
"another-string-test-andthensome.csv"
};
for (String s: test) {
System.out.println(s.split("-")[2]);
}
Output
test
test
This should give you quite a good head start:
[^-]+-[^-]+-(.*?)-[^-]+\.csv
https://regex101.com/r/YjWDkv/1
I would propose this, using regex, and very short :
String str = "some-string-test-moretext.csv\n";
Matcher m = Pattern.compile("\\w+-\\w+-(\\w+).*").matcher(str);
String res = m.find() ? m.group(1) : "";
System.out.println(res);
For sureString.split() is another way :
String res = str.split("-")[2];
In sed:
$ echo 'some-string-test-moretext.csv' | sed 's/[^-]*-[^-]*-\([^-]*\)-.*/\1/'
test
[^-]* means "zero or more occurrences of any char except "-". Let's call that "notHyphen". So we're matching on notHyphen-notHyphen-\(notHyphen\)-.* and replacing the whole match with \1, that is, whatever is captured by the \(\).
In Java, you won't need to escape ( to \(, and the technique for extracting from capturing groups is different:
Pattern patt = Pattern.compile("[^-]*-[^-]*-([^-]*)-.*");
Matcher m = patt.matcher(filename);
String extracted = null;
if (m.matches()) {
extracted = m.group(1);
}

Find string after last underscore before dot extension

I need to find 20140809T0000Z in this string:
PREVIMER_F2-MARS3D-MENOR1200_20140809T0000Z.nc
I tried the following to keep the string before the .nc:
(?<=_)(.*)(?=.nc)
I have the following to start from the last underscore:
/_[^_]*$/
How can I find string after last underscore before dot extension, using a regex?
RegEx is not always the best solution... :)
String pattern="PREVIMER_F2-MARS3D-MENOR1200_20140809T0000Z.nc";
int start=pattern.lastIndexOf("_") + 1;
int end=pattern.lastIndexOf(".");
if(start != 0 && end != -1 && end > start) {
System.out.println(pattern.substring(start,end);
}
You just need lookahead for this requirement.
You can use:
[^._]+(?=[^_]*$)
// matches and returns 20140809T0000Z
RegEx Demo
You could use the below regex,
(?<=_)[^_]*(?=\.nc)
In your pattern just replace .* with [^_]* so that it would match the inner string.
DEMO
String s = "PREVIMER_F2-MARS3D-MENOR1200_20140809T0000Z.nc";
Pattern regex = Pattern.compile("(?<=_)[^_]*(?=\\.nc)");
Matcher regexMatcher = regex.matcher(s);
if (regexMatcher.find()) {
String ResultString = regexMatcher.group();
System.out.println(ResultString);
} //=> 20140809T0000Z
You could use a simpler pattern with a capturing group
.*_(.*)\.nc
By default the first .* will be "greedy" and consume as many characters as possible before the _, leaving just the desired string inside the (.*).
Demo: http://regex101.com/r/aI2xQ9/1
Java code:
String input = "PREVIMER_F2-MARS3D-MENOR1200_20140809T0000Z.nc";
Pattern pattern = Pattern.compile(".*_(.*)\\.nc");
Matcher matcher = pattern.matcher(input);
if (matcher.find()) {
String group = matcher.group(1);
// ...
}
So, you need a sequence of non-underscore characters that immediately precede the period character.
Try [^_.]+(?=\.)
Demo: https://regex101.com/r/sLAnVs/2
Thanks to Cary Swoveland for pointing out that "no need to escape a period in a character class".

How to find the text between ( and )

I have a few strings which are like this:
text (255)
varchar (64)
...
I want to find out the number between ( and ) and store that in a string. That is, obviously, store these lengths in strings.
I have the rest of it figured out except for the regex parsing part.
I'm having trouble figuring out the regex pattern.
How do I do this?
The sample code is going to look like this:
Matcher m = Pattern.compile("<I CANT FIGURE OUT WHAT COMES HERE>").matcher("text (255)");
Also, I'd like to know if there's a cheat sheet for regex parsing, from where one can directly pick up the regex patterns
I would use a plain string match
String s = "text (255)";
int start = s.indexOf('(')+1;
int end = s.indexOf(')', start);
if (end < 0) {
// not found
} else {
int num = Integer.parseInt(s.substring(start, end));
}
You can use regex as sometimes this makes your code simpler, but that doesn't mean you should in all cases. I suspect this is one where a simple string indexOf and substring will not only be faster, and shorter but more importantly, easier to understand.
You can use this pattern to match any text between parentheses:
\(([^)]*)\)
Or this to match just numbers (with possible whitespace padding):
\(\s*(\d+)\s*\)
Of course, to use this in a string literal, you have to escape the \ characters:
Matcher m = Pattern.compile("\\(\\s*(\\d+)\\s*\\)")...
Here is some example code:
import java.util.regex.*;
class Main
{
public static void main(String[] args)
{
String txt="varchar (64)";
String re1=".*?"; // Non-greedy match on filler
String re2="\\((\\d+)\\)"; // Round Braces 1
Pattern p = Pattern.compile(re1+re2,Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher m = p.matcher(txt);
if (m.find())
{
String rbraces1=m.group(1);
System.out.print("("+rbraces1.toString()+")"+"\n");
}
}
}
This will print out any (int) it finds in the input string, txt.
The regex is \((\d+)\) to match any numbers between ()
int index1 = string.indexOf("(")
int index2 = string.indexOf(")")
String intValue = string.substring(index1+1, index2-1);
Matcher m = Pattern.compile("\\((\\d+)\\)").matcher("text (255)");
if (m.find()) {
int len = Integer.parseInt (m.group(1));
System.out.println (len);
}

Iterating through String with .find() in Java regex

I'm currently trying to solve a problem from codingbat.com with regular expressions.
I'm new to this, so step-by-step explanations would be appreciated. I could solve this with String methods relatively easily, but I am trying to use regular expressions.
Here is the prompt:
Given a string and a non-empty word string, return a string made of each char just before and just after every appearance of the word in the string. Ignore cases where there is no char before or after the word, and a char may be included twice if it is between two words.
wordEnds("abcXY123XYijk", "XY") → "c13i"
wordEnds("XY123XY", "XY") → "13"
wordEnds("XY1XY", "XY") → "11"
etc
My code thus far:
String regex = ".?" + word+ ".?";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(str);
String newStr = "";
while(m.find())
newStr += m.group().replace(word, "");
return newStr;
The problem is that when there are multiple instances of word in a row, the program misses the character preceding the word because m.find() progresses beyond it.
For example: wordEnds("abc1xyz1i1j", "1") should return "cxziij", but my method returns "cxzij", not repeating the "i"
I would appreciate a non-messy solution with an explanation I can apply to other general regex problems.
This is a one-liner solution:
String wordEnds = input.replaceAll(".*?(.)" + word + "(?:(?=(.)" + word + ")|(.).*?(?=$|." + word + "))", "$1$2$3");
This matches your edge case as a look ahead within a non-capturing group, then matches the usual (consuming) case.
Note that your requirements don't require iteration, only your question title assumes it's necessary, which it isn't.
Note also that to be absolutely safe, you should escape all characters in word in case any of them are special "regex" characters, so if you can't guarantee that, you need to use Pattern.quote(word) instead of word.
Here's a test of the usual case and the edge case, showing it works:
public static String wordEnds(String input, String word) {
word = Pattern.quote(word); // add this line to be 100% safe
return input.replaceAll(".*?(.)" + word + "(?:(?=(.)" + word + ")|(.).*?(?=$|." + word + "))", "$1$2$3");
}
public static void main(String[] args) {
System.out.println(wordEnds("abcXY123XYijk", "XY"));
System.out.println(wordEnds("abc1xyz1i1j", "1"));
}
Output:
c13i
cxziij
Use positive lookbehind and postive lookahead which are zero-width assertions
(?<=(.)|^)1(?=(.)|$)
^ ^ ^-looks for a character after 1 and captures it in group2
| |->matches 1..you can replace it with any word
|
|->looks for a character just before 1 and captures it in group 1..this is zero width assertion that doesn't move forward to match.it is just a test and thus allow us to capture the values
$1 and $2 contains your value..Go on finding till the end
So this should be like
String s1 = "abcXY123XYiXYjk";
String s2 = java.util.regex.Pattern.quote("XY");
String s3 = "";
String r = "(?<=(.)|^)"+s2+"(?=(.)|$)";
Pattern p = Pattern.compile(r);
Matcher m = p.matcher(s1);
while(m.find()) s3 += m.group(1)+m.group(2);
//s3 now contains c13iij
works here
Use regex as follows:
Matcher m = Pattern.compile("(.|)" + Pattern.quote(b) + "(?=(.?))").matcher(a);
for (int i = 1; m.find(); c += m.group(1) + m.group(2), i++);
Check this demo.

Categories

Resources