Pattern Matcher Vs String Split, which should I use? - java

First time posting.
Firstly I know how to use both Pattern Matcher & String Split.
My questions is which is best for me to use in my example and why?
Or suggestions for better alternatives.
Task:
I need to extract an unknown NOUN between two known regexp in an unknown string.
My Solution:
get the Start and End of the noun (from Regexp 1&2) and substring to extract the noun.
String line = "unknownXoooXNOUNXccccccXunknown";
int goal = 12 ;
String regexp1 = "Xo+X";
String regexp2 = "Xc+X";
I need to locate the index position AFTER the first regex.
I need to locate the index position BEFORE the second regex.
A) I can use pattern matcher
Pattern p = Pattern.compile(regexp1);
Matcher m = p.matcher(line);
if (m.find()) {
int afterRegex1 = m.end();
} else {
throw new IllegalArgumentException();
//TODO Exception Management;
}
B) I can use String Split
String[] split = line.split(regex1,2);
if (split.length != 2) {
throw new UnsupportedOperationException();
//TODO Exception Management;
}
int afterRegex1 = line.indexOf(split[1]);
Which Approach should I use and why?
I don't know which is more efficient on time and memory.
Both are near enough as readable to myself.

I'd do it like this:
String line = "unknownXoooXNOUNXccccccXunknown";
String regex = "Xo+X(.*?)Xc+X";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(line);
if (m.find()) {
String noun = m.group(1);
}
The (.*?) is used to make the inner match on the NOUN reluctant. This protects us from a case where our ending pattern appears again in the unknown portion of the string.
EDIT
This works because the (.*?) defines a capture group. There's only one such group defined in the pattern, so it gets index 1 (the parameter to m.group(1)). These groups are indexed from left to right starting at 1. If the pattern were defined like this
String regex = "(Xo+X)(.*?)(Xc+X)";
Then there would be three capture groups, such that
m.group(1); // yields "XoooX"
m.group(2); // yields "NOUN"
m.group(3); // yields "XccccccX"
There is a group 0, but that matches the whole pattern, and it's equivalent to this
m.group(); // yields "XoooXNOUNXccccccX"
For more information about what you can do with the Matcher, including ways to get the start and end positions of your pattern within the source string, see the Matcher JavaDocs

You should use String.split() for readability unless you're in a tight loop.
Per split()'s javadoc, split() does the equivalent of Pattern.compile(), which you can optimize away if you're in a tight loop.

It looks like you want to get a unique occurrence. For this do simply
input.replaceAll(".*Xo+X(.*)Xc+X.*", "$1")
For efficiency, use Pattern.matcher(input).replaceAll instead.
In case you input contains line breaks, use Pattern.DOTALL or the s modifier.
In case you want to use split, consider using Guava's Splitter. It behaves more sane and also accepts a Pattern which is good for speed.

If you really need the locations you can do it like this:
String line = "unknownXoooXNOUNXccccccXunknown";
String regexp1 = "Xo+X";
String regexp2 = "Xc+X";
Matcher m=Pattern.compile(regexp1).matcher(line);
if(m.find())
{
int start=m.end();
if(m.usePattern(Pattern.compile(regexp2)).find())
{
final int end = m.start();
System.out.println("from "+start+" to "+end+" is "+line.substring(start, end));
}
}
But if you just need the word in between, I recommend the way Ian McLaird has shown.

Related

Java regex to match the start of the word?

Objective: for a given term, I want to check if that term exist at the start of the word. For example if the term is 't'. then in the sentance:
"This is the difficult one Thats it"
I want it to return "true" because of :
This, the, Thats
so consider:
public class HelloWorld{
public static void main(String []args){
String term = "t";
String regex = "/\\b"+term+"[^\\b]*?\\b/gi";
String str = "This is the difficult one Thats it";
System.out.println(str.matches(regex));
}
}
I am getting following Exception:
Exception in thread "main" java.util.regex.PatternSyntaxException:
Illegal/unsupported escape sequence near index 7
/\bt[^\b]*?\b/gi
^
at java.util.regex.Pattern.error(Pattern.java:1924)
at java.util.regex.Pattern.escape(Pattern.java:2416)
at java.util.regex.Pattern.range(Pattern.java:2577)
at java.util.regex.Pattern.clazz(Pattern.java:2507)
at java.util.regex.Pattern.sequence(Pattern.java:2030)
at java.util.regex.Pattern.expr(Pattern.java:1964)
at java.util.regex.Pattern.compile(Pattern.java:1665)
at java.util.regex.Pattern.<init>(Pattern.java:1337)
at java.util.regex.Pattern.compile(Pattern.java:1022)
at java.util.regex.Pattern.matches(Pattern.java:1128)
at java.lang.String.matches(String.java:2063)
at HelloWorld.main(HelloWorld.java:8)
Also the following does not work:
import java.util.regex.*;
public class HelloWorld{
public static void main(String []args){
String term = "t";
String regex = "\\b"+term+"gi";
//String regex = ".";
System.out.println(regex);
String str = "This is the difficult one Thats it";
System.out.println(str.matches(regex));
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(str);
System.out.println(m.find());
}
}
Example:
{ This , one, Two, Those, Thanks }
for words This Two Those Thanks; result should be true.
Thanks
Since you're using the Java regex engine, you need to write the expressions in a way Java understands. That means removing trailing and leading slashes and adding flags as (?<flags>) at the beginning of the expression.
Thus you'd need this instead:
String regex = "(?i)\\b"+term+".*?\\b"
Have a look at regular-expressions.info/java.html for more information. A comparison of supported features can be found here (just as an entry point): regular-expressions.info/refbasic.html
In Java we don't surround regex with / so instead of "/regex/flags" we just write regex. If you want to add flags you can do it with (?flags) syntax and place it in regex at position from which flag should apply, for instance a(?i)a will be able to find aa and aA but not Aa because flag was added after first a.
You can also compile your regex into Pattern like this
Pattern pattern = Pattern.compile(regex, flags);
where regex is String (again not enclosed with /) and flag is integer build from constants from Pattern like Pattern.DOTALL or when you need more flags you can use Pattern.CASE_INSENSITIVE|Pattern.MULTILINE.
Next thing which may confuse you is matches method. Most people are mistaken by its name, because they assume that it will try to check if it can find in string element which can be matched by regex, but in reality, it checks if entire string can be matched by regex.
What you seem to want is mechanism to test of some regex can be found at least once in string. In that case you may either
add .* at start and end of your regex to let other characters which are not part of element you want to find be matched by regex engine, but this way matches must iterate over entire string
use Matcher object build from Pattern (representing your regex), and use its find() method, which will iterate until it finds match for regex, or will find end of string. I prefer this approach because it will not need to iterate over entire string, but will stop when match will be found.
So your code could look like
String str = "This is the difficult one Thats it";
String term = "t";
Pattern pattern = Pattern.compile("\\b"+term, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(str);
System.out.println(matcher.find());
In case your term could contain some regex special characters but you want regex engine to treat them as normal characters you need to make sure that they will be escaped. To do this you can use Pattern.quote method which will add all necessary escapes for you, so instead of
Pattern pattern = Pattern.compile("\\b"+term, Pattern.CASE_INSENSITIVE);
for safety you should use
Pattern pattern = Pattern.compile("\\b"+Pattern.quote(term), Pattern.CASE_INSENSITIVE);
String regex = "(?i)\\b"+term;
In Java, the modifiers must be inserted between "(?" and ")" and there is a variant for turning them off again: "(?-" and ")".
For finding all words beginning with "T" or "t", you may want to use Matcher's find method repeatedly. If you just need the offset, Matcher's start method returns the offset.
If you need to match the full word, use
String regex = "(?i)\\b"+term + "\\w*";
String str = "This is the difficult one Thats it";
String term = "t";
Pattern pattern = Pattern.compile("^[+"+term+"].*",Pattern.CASE_INSENSITIVE);
String[] strings = str.split(" ");
for (String s : strings) {
if (pattern.matcher(s).matches()) {
System.out.println(s+"-->"+true);
} else {
System.out.println(s+"-->"+false);
}
}

Java split a string by using a regex [duplicate]

I have a string that has two single quotes in it, the ' character. In between the single quotes is the data I want.
How can I write a regex to extract "the data i want" from the following text?
mydata = "some string with 'the data i want' inside";
Assuming you want the part between single quotes, use this regular expression with a Matcher:
"'(.*?)'"
Example:
String mydata = "some string with 'the data i want' inside";
Pattern pattern = Pattern.compile("'(.*?)'");
Matcher matcher = pattern.matcher(mydata);
if (matcher.find())
{
System.out.println(matcher.group(1));
}
Result:
the data i want
You don't need regex for this.
Add apache commons lang to your project (http://commons.apache.org/proper/commons-lang/), then use:
String dataYouWant = StringUtils.substringBetween(mydata, "'");
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String[] args) {
Pattern pattern = Pattern.compile(".*'([^']*)'.*");
String mydata = "some string with 'the data i want' inside";
Matcher matcher = pattern.matcher(mydata);
if(matcher.matches()) {
System.out.println(matcher.group(1));
}
}
}
There's a simple one-liner for this:
String target = myData.replaceAll("[^']*(?:'(.*?)')?.*", "$1");
By making the matching group optional, this also caters for quotes not being found by returning a blank in that case.
See live demo.
Since Java 9
As of this version, you can use a new method Matcher::results with no args that is able to comfortably return Stream<MatchResult> where MatchResult represents the result of a match operation and offers to read matched groups and more (this class is known since Java 1.5).
String string = "Some string with 'the data I want' inside and 'another data I want'.";
Pattern pattern = Pattern.compile("'(.*?)'");
pattern.matcher(string)
.results() // Stream<MatchResult>
.map(mr -> mr.group(1)) // Stream<String> - the 1st group of each result
.forEach(System.out::println); // print them out (or process in other way...)
The code snippet above results in:
the data I want
another data I want
The biggest advantage is in the ease of usage when one or more results is available compared to the procedural if (matcher.find()) and while (matcher.find()) checks and processing.
Because you also ticked Scala, a solution without regex which easily deals with multiple quoted strings:
val text = "some string with 'the data i want' inside 'and even more data'"
text.split("'").zipWithIndex.filter(_._2 % 2 != 0).map(_._1)
res: Array[java.lang.String] = Array(the data i want, and even more data)
String dataIWant = mydata.replaceFirst(".*'(.*?)'.*", "$1");
as in javascript:
mydata.match(/'([^']+)'/)[1]
the actual regexp is: /'([^']+)'/
if you use the non greedy modifier (as per another post) it's like this:
mydata.match(/'(.*?)'/)[1]
it is cleaner.
String dataIWant = mydata.split("'")[1];
See Live Demo
In Scala,
val ticks = "'([^']*)'".r
ticks findFirstIn mydata match {
case Some(ticks(inside)) => println(inside)
case _ => println("nothing")
}
for (ticks(inside) <- ticks findAllIn mydata) println(inside) // multiple matches
val Some(ticks(inside)) = ticks findFirstIn mydata // may throw exception
val ticks = ".*'([^']*)'.*".r
val ticks(inside) = mydata // safe, shorter, only gets the first set of ticks
Apache Commons Lang provides a host of helper utilities for the java.lang API, most notably String manipulation methods.
In your case, the start and end substrings are the same, so just call the following function.
StringUtils.substringBetween(String str, String tag)
Gets the String that is nested in between two instances of the same
String.
If the start and the end substrings are different then use the following overloaded method.
StringUtils.substringBetween(String str, String open, String close)
Gets the String that is nested in between two Strings.
If you want all instances of the matching substrings, then use,
StringUtils.substringsBetween(String str, String open, String close)
Searches a String for substrings delimited by a start and end tag,
returning all matching substrings in an array.
For the example in question to get all instances of the matching substring
String[] results = StringUtils.substringsBetween(mydata, "'", "'");
you can use this
i use while loop to store all matches substring in the array if you use
if (matcher.find())
{
System.out.println(matcher.group(1));
}
you will get on matches substring so you can use this to get all matches substring
Matcher m = Pattern.compile("[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+").matcher(text);
// Matcher mat = pattern.matcher(text);
ArrayList<String>matchesEmail = new ArrayList<>();
while (m.find()){
String s = m.group();
if(!matchesEmail.contains(s))
matchesEmail.add(s);
}
Log.d(TAG, "emails: "+matchesEmail);
add apache.commons dependency on your pom.xml
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-io</artifactId>
<version>1.3.2</version>
</dependency>
And below code works.
StringUtils.substringBetween(String mydata, String "'", String "'")
Some how the group(1) didnt work for me. I used group(0) to find the url version.
Pattern urlVersionPattern = Pattern.compile("\\/v[0-9][a-z]{0,1}\\/");
Matcher m = urlVersionPattern.matcher(url);
if (m.find()) {
return StringUtils.substringBetween(m.group(0), "/", "/");
}
return "v0";

How would I do this in Java Regex?

Trying to make a regex that grabs all words like lets just say, chicken, that are not in brackets. So like
chicken
Would be selected but
[chicken]
Would not. Does anyone know how to do this?
String template = "[chicken]";
String pattern = "\\G(?<!\\[)(\\w+)(?!\\])";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(template);
while (m.find())
{
System.out.println(m.group());
}
It uses a combination of negative look-behind and negative look-aheads and boundary matchers.
(?<!\\[) //negative look behind
(?!\\]) //negative look ahead
(\\w+) //capture group for the word
\\G //is a boundary matcher for marking the end of the previous match
(please read the following edits for clarification)
EDIT 1:
If one needs to account for situations like:
"chicken [chicken] chicken [chicken]"
We can replace the regex with:
String regex = "(?<!\\[)\\b(\\w+)\\b(?!\\])";
EDIT 2:
If one also needs to account for situations like:
"[chicken"
"chicken]"
As in one still wants the "chicken", then you could use:
String pattern = "(?<!\\[)?\\b(\\w+)\\b(?!\\])|(?<!\\[)\\b(\\w+)\\b(?!\\])?";
Which essentially accounts for the two cases of having only one bracket on either side. It accomplishes this through the | which acts as an or, and by using ? after the look-ahead/behinds, where ? means 0 or 1 of the previous expression.
I guess you want something like:
final Pattern UNBRACKETED_WORD_PAT = Pattern.compile("(?<!\\[)\\b\\w+\\b(?!])");
private List<String> findAllUnbracketedWords(final String s) {
final List<String> ret = new ArrayList<String>();
final Matcher m = UNBRACKETED_WORD_PAT.matcher(s);
while (m.find()) {
ret.add(m.group());
}
return Collections.unmodifiableList(ret);
}
Use this:
/(?<![\[\w])\w+(?![\w\]])/
i.e., consecutive word characters with no square bracket or word character before or after.
This needs to check both left and right for both a square bracket and a word character, else for your input of [chicken] it would simply return
hicke
Without look around:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class MatchingTest
{
private static String x = "pig [cow] chicken bull] [grain";
public static void main(String[] args)
{
Pattern p = Pattern.compile("(\\[?)(\\w+)(\\]?)");
Matcher m = p.matcher(x);
while(m.find())
{
String firstBracket = m.group(1);
String word = m.group(2);
String lastBracket = m.group(3);
if ("".equals(firstBracket) && "".equals(lastBracket))
{
System.out.println(word);
}
}
}
}
Output:
pig
chicken
A bit more verbose, sure, but I find it more readable and easier to understand. Certainly simpler than a huge regular expression trying to handle all possible combinations of brackets.
Note that this won't filter out input like [fence tree grass]; it will indicate that tree is a match. You cannot skip tree in that without a parser. Hopefully, this is not a case you need to handle.

The best way to find out that part of the string is potencial RegEx match

how would you do this:
I have a string and some regexes. Then I iterate over the string and in every iteration I need to know if the part (string index 0 to string currently iterated index) of that string is possible full match of one or more given regexes in next iterations.
Thank you for help.
What about a code like this:
// all of *greedy* regexs into a list
List<String> regex = new ArrayList<String>();
// here is my text
String mytext = "...";
String tmp = null;
// iterate over letters of my text
for (int i = 0; i < mytext.length(); i++) {
// substring from 0. position till i. index
tmp = mytext.substring(0, i);
// append regex on sub text
for (String reg : regex ) {
Pattern p = Pattern.compile(reg);
Matcher m = p.matcher(tmp);
// if found, do smt
if (m.find() ) { bingo.. do smt! }
}
}
You could use Matcher.lookingAt() to try to match as much as possible from a given input, but not requiring the whole input to match (.matches() would require the full input to match and .find() would not require the match to start at the beginning).
I don't believe the Java regular expression API provides such "incremental" or "step-by-step" search.
What you could do however, is to formulate your expression using reluctant quantifiers.
[...] The reluctant quantifiers, however, take the opposite approach: They start at the beginning of the input string, then reluctantly eat one character at a time looking for a match. The last thing they try is the entire input string. [...]
If this isn't viable in your case, you could use the Matcher.setRegion method to incrementally increase the region used by the matcher.
So I've been searching for alternatives to Java's standart RegEx library and found one that does the job well - JRegex

extract substring in java using regex

I need to extract "URPlus1_S2_3" from the string:
"Last one: http://abc.imp/Basic2#URPlus1_S2_3,"
using regular expression in Java language.
Can someone please help me? I am using regex for the first time.
Try
Pattern p = Pattern.compile("#([^,]*)");
Matcher m = p.matcher(myString);
if (m.find()) {
doSomethingWith(m.group(1)); // The matched substring
}
String s = "Last one: http://abc.imp/Basic2#URPlus1_S2_3,";
Matcher m = Pattern.compile("(URPlus1_S2_3)").matcher(s);
if (m.find()) System.out.println(m.group(1));
You gotta learn how to specify your requirements ;)
You haven't really defined what criteria you need to use to find that string, but here is one way to approach based on '#' separator. You can adjust the regex as necessary.
expr: .*#([^,]*)
extract: \1
Go here for syntax documentation:
http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html
String s = Last one: http://abc.imp/Basic2#URPlus1_S2_3,"
String result = s.replaceAll(".*#", "");
The above returns the full String in case there's no "#". There are better ways using regex, but the best solution here is using no regex. There are classes URL and URI doing the job.
Since it's the first time you use regular expressions I would suggest going another way, which is more understandable for now (until you master regular expressions ;) and it will be easily modified if you will ever need to:
String yourPart = new String().split("#")[1];
Here's a long version:
String url = "http://abc.imp/Basic2#URPlus1_S2_3,";
String anchor = null;
String ps = "#(.+),";
Pattern p = Pattern.compile(ps);
Matcher m = p.matcher(url);
if (m.matches()) {
anchor = m.group(1);
}
The main point to understand is the use of the parenthesis, they are used to create groups which can be extracted from a pattern. In the Matcher object, the group method will return them in order starting at index 1, while the full match is returned by the index 0.
If you just want everything after the #, use split:
String s = "Last one: http://abc.imp/Basic2#URPlus1_S2_3," ;
System.out.println(s.split("#")[1]);
Alternatively, if you want to parse the URI and get the fragment component you can do:
URI u = new URI("http://abc.imp/Basic2#URPlus1_S2_3,");
System.out.println(u.getFragment());

Categories

Resources