extract substring in java using regex - java

I need to extract "URPlus1_S2_3" from the string:
"Last one: http://abc.imp/Basic2#URPlus1_S2_3,"
using regular expression in Java language.
Can someone please help me? I am using regex for the first time.

Try
Pattern p = Pattern.compile("#([^,]*)");
Matcher m = p.matcher(myString);
if (m.find()) {
doSomethingWith(m.group(1)); // The matched substring
}

String s = "Last one: http://abc.imp/Basic2#URPlus1_S2_3,";
Matcher m = Pattern.compile("(URPlus1_S2_3)").matcher(s);
if (m.find()) System.out.println(m.group(1));
You gotta learn how to specify your requirements ;)

You haven't really defined what criteria you need to use to find that string, but here is one way to approach based on '#' separator. You can adjust the regex as necessary.
expr: .*#([^,]*)
extract: \1
Go here for syntax documentation:
http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html

String s = Last one: http://abc.imp/Basic2#URPlus1_S2_3,"
String result = s.replaceAll(".*#", "");
The above returns the full String in case there's no "#". There are better ways using regex, but the best solution here is using no regex. There are classes URL and URI doing the job.

Since it's the first time you use regular expressions I would suggest going another way, which is more understandable for now (until you master regular expressions ;) and it will be easily modified if you will ever need to:
String yourPart = new String().split("#")[1];

Here's a long version:
String url = "http://abc.imp/Basic2#URPlus1_S2_3,";
String anchor = null;
String ps = "#(.+),";
Pattern p = Pattern.compile(ps);
Matcher m = p.matcher(url);
if (m.matches()) {
anchor = m.group(1);
}
The main point to understand is the use of the parenthesis, they are used to create groups which can be extracted from a pattern. In the Matcher object, the group method will return them in order starting at index 1, while the full match is returned by the index 0.

If you just want everything after the #, use split:
String s = "Last one: http://abc.imp/Basic2#URPlus1_S2_3," ;
System.out.println(s.split("#")[1]);
Alternatively, if you want to parse the URI and get the fragment component you can do:
URI u = new URI("http://abc.imp/Basic2#URPlus1_S2_3,");
System.out.println(u.getFragment());

Related

Java split a string by using a regex [duplicate]

I have a string that has two single quotes in it, the ' character. In between the single quotes is the data I want.
How can I write a regex to extract "the data i want" from the following text?
mydata = "some string with 'the data i want' inside";
Assuming you want the part between single quotes, use this regular expression with a Matcher:
"'(.*?)'"
Example:
String mydata = "some string with 'the data i want' inside";
Pattern pattern = Pattern.compile("'(.*?)'");
Matcher matcher = pattern.matcher(mydata);
if (matcher.find())
{
System.out.println(matcher.group(1));
}
Result:
the data i want
You don't need regex for this.
Add apache commons lang to your project (http://commons.apache.org/proper/commons-lang/), then use:
String dataYouWant = StringUtils.substringBetween(mydata, "'");
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String[] args) {
Pattern pattern = Pattern.compile(".*'([^']*)'.*");
String mydata = "some string with 'the data i want' inside";
Matcher matcher = pattern.matcher(mydata);
if(matcher.matches()) {
System.out.println(matcher.group(1));
}
}
}
There's a simple one-liner for this:
String target = myData.replaceAll("[^']*(?:'(.*?)')?.*", "$1");
By making the matching group optional, this also caters for quotes not being found by returning a blank in that case.
See live demo.
Since Java 9
As of this version, you can use a new method Matcher::results with no args that is able to comfortably return Stream<MatchResult> where MatchResult represents the result of a match operation and offers to read matched groups and more (this class is known since Java 1.5).
String string = "Some string with 'the data I want' inside and 'another data I want'.";
Pattern pattern = Pattern.compile("'(.*?)'");
pattern.matcher(string)
.results() // Stream<MatchResult>
.map(mr -> mr.group(1)) // Stream<String> - the 1st group of each result
.forEach(System.out::println); // print them out (or process in other way...)
The code snippet above results in:
the data I want
another data I want
The biggest advantage is in the ease of usage when one or more results is available compared to the procedural if (matcher.find()) and while (matcher.find()) checks and processing.
Because you also ticked Scala, a solution without regex which easily deals with multiple quoted strings:
val text = "some string with 'the data i want' inside 'and even more data'"
text.split("'").zipWithIndex.filter(_._2 % 2 != 0).map(_._1)
res: Array[java.lang.String] = Array(the data i want, and even more data)
String dataIWant = mydata.replaceFirst(".*'(.*?)'.*", "$1");
as in javascript:
mydata.match(/'([^']+)'/)[1]
the actual regexp is: /'([^']+)'/
if you use the non greedy modifier (as per another post) it's like this:
mydata.match(/'(.*?)'/)[1]
it is cleaner.
String dataIWant = mydata.split("'")[1];
See Live Demo
In Scala,
val ticks = "'([^']*)'".r
ticks findFirstIn mydata match {
case Some(ticks(inside)) => println(inside)
case _ => println("nothing")
}
for (ticks(inside) <- ticks findAllIn mydata) println(inside) // multiple matches
val Some(ticks(inside)) = ticks findFirstIn mydata // may throw exception
val ticks = ".*'([^']*)'.*".r
val ticks(inside) = mydata // safe, shorter, only gets the first set of ticks
Apache Commons Lang provides a host of helper utilities for the java.lang API, most notably String manipulation methods.
In your case, the start and end substrings are the same, so just call the following function.
StringUtils.substringBetween(String str, String tag)
Gets the String that is nested in between two instances of the same
String.
If the start and the end substrings are different then use the following overloaded method.
StringUtils.substringBetween(String str, String open, String close)
Gets the String that is nested in between two Strings.
If you want all instances of the matching substrings, then use,
StringUtils.substringsBetween(String str, String open, String close)
Searches a String for substrings delimited by a start and end tag,
returning all matching substrings in an array.
For the example in question to get all instances of the matching substring
String[] results = StringUtils.substringsBetween(mydata, "'", "'");
you can use this
i use while loop to store all matches substring in the array if you use
if (matcher.find())
{
System.out.println(matcher.group(1));
}
you will get on matches substring so you can use this to get all matches substring
Matcher m = Pattern.compile("[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+").matcher(text);
// Matcher mat = pattern.matcher(text);
ArrayList<String>matchesEmail = new ArrayList<>();
while (m.find()){
String s = m.group();
if(!matchesEmail.contains(s))
matchesEmail.add(s);
}
Log.d(TAG, "emails: "+matchesEmail);
add apache.commons dependency on your pom.xml
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-io</artifactId>
<version>1.3.2</version>
</dependency>
And below code works.
StringUtils.substringBetween(String mydata, String "'", String "'")
Some how the group(1) didnt work for me. I used group(0) to find the url version.
Pattern urlVersionPattern = Pattern.compile("\\/v[0-9][a-z]{0,1}\\/");
Matcher m = urlVersionPattern.matcher(url);
if (m.find()) {
return StringUtils.substringBetween(m.group(0), "/", "/");
}
return "v0";

Pattern Matcher Vs String Split, which should I use?

First time posting.
Firstly I know how to use both Pattern Matcher & String Split.
My questions is which is best for me to use in my example and why?
Or suggestions for better alternatives.
Task:
I need to extract an unknown NOUN between two known regexp in an unknown string.
My Solution:
get the Start and End of the noun (from Regexp 1&2) and substring to extract the noun.
String line = "unknownXoooXNOUNXccccccXunknown";
int goal = 12 ;
String regexp1 = "Xo+X";
String regexp2 = "Xc+X";
I need to locate the index position AFTER the first regex.
I need to locate the index position BEFORE the second regex.
A) I can use pattern matcher
Pattern p = Pattern.compile(regexp1);
Matcher m = p.matcher(line);
if (m.find()) {
int afterRegex1 = m.end();
} else {
throw new IllegalArgumentException();
//TODO Exception Management;
}
B) I can use String Split
String[] split = line.split(regex1,2);
if (split.length != 2) {
throw new UnsupportedOperationException();
//TODO Exception Management;
}
int afterRegex1 = line.indexOf(split[1]);
Which Approach should I use and why?
I don't know which is more efficient on time and memory.
Both are near enough as readable to myself.
I'd do it like this:
String line = "unknownXoooXNOUNXccccccXunknown";
String regex = "Xo+X(.*?)Xc+X";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(line);
if (m.find()) {
String noun = m.group(1);
}
The (.*?) is used to make the inner match on the NOUN reluctant. This protects us from a case where our ending pattern appears again in the unknown portion of the string.
EDIT
This works because the (.*?) defines a capture group. There's only one such group defined in the pattern, so it gets index 1 (the parameter to m.group(1)). These groups are indexed from left to right starting at 1. If the pattern were defined like this
String regex = "(Xo+X)(.*?)(Xc+X)";
Then there would be three capture groups, such that
m.group(1); // yields "XoooX"
m.group(2); // yields "NOUN"
m.group(3); // yields "XccccccX"
There is a group 0, but that matches the whole pattern, and it's equivalent to this
m.group(); // yields "XoooXNOUNXccccccX"
For more information about what you can do with the Matcher, including ways to get the start and end positions of your pattern within the source string, see the Matcher JavaDocs
You should use String.split() for readability unless you're in a tight loop.
Per split()'s javadoc, split() does the equivalent of Pattern.compile(), which you can optimize away if you're in a tight loop.
It looks like you want to get a unique occurrence. For this do simply
input.replaceAll(".*Xo+X(.*)Xc+X.*", "$1")
For efficiency, use Pattern.matcher(input).replaceAll instead.
In case you input contains line breaks, use Pattern.DOTALL or the s modifier.
In case you want to use split, consider using Guava's Splitter. It behaves more sane and also accepts a Pattern which is good for speed.
If you really need the locations you can do it like this:
String line = "unknownXoooXNOUNXccccccXunknown";
String regexp1 = "Xo+X";
String regexp2 = "Xc+X";
Matcher m=Pattern.compile(regexp1).matcher(line);
if(m.find())
{
int start=m.end();
if(m.usePattern(Pattern.compile(regexp2)).find())
{
final int end = m.start();
System.out.println("from "+start+" to "+end+" is "+line.substring(start, end));
}
}
But if you just need the word in between, I recommend the way Ian McLaird has shown.

Matching everything after the first comma in a string

I am using java to do a regular expression match. I am using rubular to verify the match and ideone to test my code.
I got a regex from this SO solution , and it matches the group as I want it to in rubular, but my implementation in java is not matching. When it prints 'value', it is printing the value of commaSeparatedString and not matcher.group(1) I want the captured group/output of println to be "v123_gpbpvl-testpv1,v223_gpbpvl-testpv1-iso"
String commaSeparatedString = "Vtest7,v123_gpbpvl-testpv1,v223_gpbpvl-testpv1-iso";
//match everything after first comma
String myRegex = ",(.*)";
Pattern pattern = Pattern.compile(myRegex);
Matcher matcher = pattern.matcher(commaSeparatedString);
String value = "";
if (matcher.matches())
value = matcher.group(1);
else
value = commaSeparatedString;
System.out.println(value);
(edit: I left out that commaSeparatedString will not always contain 2 commas. Rather, it will always contain 0 or more commas)
If you don't have to solve it with regex, you can try this:
int size = commaSeparatedString.length();
value = commaSeparatedString.substring(commaSeparatedString.indexOf(",")+1,size);
Namely, the code above returns the substring which starts from the first comma's index.
EDIT:
Sorry, I've omitted the simpler version. Thanks to one of the commentators, you can use this single line as well:
value = commaSeparatedString.substring( commaSeparatedString.indexOf(",") );
The definition of the regex is wrong. It should be:
String myRegex = "[^,]*,(.*)";
You are yet another victim of Java's misguided regex method naming.
.matches() automatically anchors the regex at the beginning and end (which is in total contradiction with the very definition of "regex matching"). The method you are looking for is .find().
However, for such a simple problem, it is better to go with #DelShekasteh's solution.
I would do this like
String commaSeparatedString = "Vtest7,v123_gpbpvl-testpv1,v223_gpbpvl-testpv1-iso";
System.out.println(commaSeparatedString.substring(commaSeparatedString.indexOf(",")+1));
Here is another approach with limited split
String[] spl = "Vtest7,v123_gpbpvl-testpv1,v223_gpbpvl-testpv1-iso".split(",", 2);
if (spl.length == 2)
System.out.println(spl[1]);
Byt IMHO Del's answer is best for your case.
I would use replaceFirst
String commaSeparatedString = "Vtest7,v123_gpbpvl-testpv1,v223_gpbpvl-testpv1-iso";
System.out.println(commaSeparatedString.replaceFirst(".*?,", ""));
prints
v123_gpbpvl-testpv1,v223_gpbpvl-testpv1-iso
or you could use the shorter but obtuse
System.out.println(commaSeparatedString.split(",", 2)[1]);

Silly RegEx issue. What am I doing wrong?

String url = "hello world";
String p = "world";
Pattern pattern = Pattern.compile(p);
Matcher matcher = pattern.matcher(url);
if (matcher.matches()) {
int start = matcher.start();
int end = matcher.end();
}
What am I doing wrong? How comes the if statement never gets hit?
The matches() method attempts to match the entire string to the pattern. You want the find() method.
Try Matcher.find(). Matcher.matches() checks whether the whole string matches the pattern.
You need to use find because,
matches tries to match the patten against the entire string and
implicitly add a ^ at the start and $ at the end of your pattern.
So your pattern is equivalent to "^world$".
Try to change your pattern to ".*world.*":
String p = ".*world.*";
That way it'll match any string that contains "world".
I experienced same problem.
I don't know the reason.
If someone knows problem please post here.
I had solved problem with using find() repeatedly instead of matches().

The best way to find out that part of the string is potencial RegEx match

how would you do this:
I have a string and some regexes. Then I iterate over the string and in every iteration I need to know if the part (string index 0 to string currently iterated index) of that string is possible full match of one or more given regexes in next iterations.
Thank you for help.
What about a code like this:
// all of *greedy* regexs into a list
List<String> regex = new ArrayList<String>();
// here is my text
String mytext = "...";
String tmp = null;
// iterate over letters of my text
for (int i = 0; i < mytext.length(); i++) {
// substring from 0. position till i. index
tmp = mytext.substring(0, i);
// append regex on sub text
for (String reg : regex ) {
Pattern p = Pattern.compile(reg);
Matcher m = p.matcher(tmp);
// if found, do smt
if (m.find() ) { bingo.. do smt! }
}
}
You could use Matcher.lookingAt() to try to match as much as possible from a given input, but not requiring the whole input to match (.matches() would require the full input to match and .find() would not require the match to start at the beginning).
I don't believe the Java regular expression API provides such "incremental" or "step-by-step" search.
What you could do however, is to formulate your expression using reluctant quantifiers.
[...] The reluctant quantifiers, however, take the opposite approach: They start at the beginning of the input string, then reluctantly eat one character at a time looking for a match. The last thing they try is the entire input string. [...]
If this isn't viable in your case, you could use the Matcher.setRegion method to incrementally increase the region used by the matcher.
So I've been searching for alternatives to Java's standart RegEx library and found one that does the job well - JRegex

Categories

Resources