How to extract string between two strings using java Pattern

How to extract string between two strings using java Pattern - java

I have a string /subscription/ffcc218c-985c-4ec8-82d7-751fdcac93f0/subscribe from which I want to extract the middle string /subscription/<....>/subscribe. I have written the below code to get the string
String subscriber = subscriberDestination.substring(1);
int startPos = subscriber.indexOf("/") + 2;
int destPos = startPos + subscriber.substring(startPos + 2).indexOf("/");
return subscriberDestination.substring(startPos, destPos + 2);
Gives back ffcc218c-985c-4ec8-82d7-751fdcac93f0
How can I use java Pattern library to write better code?

If you want to use a regular expression, a simple way would be:
return subscriber.replaceAll("/.*/([^/]*)/.*", "$1");
/.*/ is for the /subscription/ bit
([^/]*) a capturing group that matches all characters until the next /
/.* is for the /subscribe bit
And the second argument of replaceAll says that we want to keep the first group.
You can use a Pattern to improve efficiency by compiling the expression:
Pattern p = Pattern.compile("/.*/([^/]*)/.*"); ///store it outside the method to reuse it
Matcher m = p.matcher(subscriber);
if (m.find()) return m.group(1);
else return "not found";

5c from me. I recommend to use Pattern for extracting substring with known format:
public final class Foo {
private static final Pattern PATTERN = Pattern.compile(".*subscription\\/(?<uuid>[\\w-]+)\\/subscribe");
public static String getUuid(String url) {
Matcher matcher = PATTERN.matcher(url);
return matcher.matches() ? matcher.group("uuid") : null;
}
}
RegEx Demo

Performance can be improved by:
not creating a substrings.
Also indexOf(..) with a char should be faster than with String
final int startPos = subscriberDestination.indexOf('/',1) + 1 ;
final int destPos = subscriberDestination.indexOf('/',startPos+1);
return subscriberDestination.substring(startPos, destPos );
About useing the java Pattern library:
Do you expect any performance gain? I doubt you'll get some by using java Pattern library. But I recommend to profile it to be absolute sure about it.

Related

Splitting a String that has a particular structure

I have a string that goes something like this
"330 Daniel T92435"
Now I need to obtain the name "Daniel", and I could simply just type
string.substring(4,11);
But the position where a name ("Daniel") is placed could vary.
And I don't want to use the split[] method.
I was thinking if there was a way to make the substring method read data until a whitespace is found.

If input string always has the following string structure "someSymbols Name someSymbols" you can use the following regular expression to extract the name:
"[^\\s]+\\s+(\\p{Alpha}+)\\s+[^\\s]+"
\\p{Alpha} - alphabetic character;
\\s - white space;
[^\\s] - any symbol apart from the white space.
In the code below Pattern is as object representing the regular expression. In turn, Matcher is a special object that is responsible for navigation over the given string and allows discovering the parts of this string that match the pattern.
public static String findName(String source) {
Pattern pattern = Pattern.compile("[^\\s]+\\s+(\\p{Alpha}+)\\s+[^\\s]+");
Matcher matcher = pattern.matcher(source);
String result = "no match was found";
if (matcher.find()) {
result = matcher.group(1); // group 1 corresponds to the first element enclosed in parentheses (\\p{Alpha}+)
}
return result;
}
main()
public static void main(String[] args) {
System.out.println(findName("330 Daniel T92435"));
}
Output
Daniel

You can use the str.indexOf(" ") function.
int start = string.indexOf(" ")+1;
string.substring(start,start + 7);
Edit: You can use
int start = string.indexOf(" ")+1;
int end = string.indexOf(" ", start+1);
string.substring(start,end >= 0 ? end : string.length());
if you want to select the first word and don't know how long it will be.

Parsing text using Regex

So I am trying to parse a String that contains two key components. One tells me the timing options, and the other is position.
Here is what the text looks like
KB_H9Oct4GFP_20130305_p00{iiii}t00000{ttt}z001c02.tif
The {iiii} is the position and the {ttt} is the timing options.
I need to separate the {ttt} and {iiii} out so I can get a full file name: example, position 1 and time slice 1 = KB_H9Oct4GFP_20130305_p0000001t000000001z001c02.tif
So far here is how I am parsing them:
int startTimeSlice = 1;
int startTile = 1;
String regexTime = "([^{]*)\\{([t]+)\\}(.*)";
Pattern patternTime = Pattern.compile(regexTime);
Matcher matcherTime = patternTime.matcher(filePattern);
if (!matcherTime.find() || matcherTime.groupCount() != 3)
{
throw new IllegalArgumentException("Incorect filePattern: " + filePattern);
}
String timePrefix = matcherTime.group(1);
int tCount = matcherTime.group(2).length();
String timeSuffix = matcherTime.group(3);
String timeMatcher = timePrefix + "%0" + tCount + "d" + timeSuffix;
String timeFileName = String.format(timeMatcher, startTimeSlice);
String regex = "([^{]*)\\{([i]+)\\}(.*)";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(timeFileName);
if (!matcher.find() || matcher.groupCount() != 3)
{
throw new IllegalArgumentException("Incorect filePattern: " + filePattern);
}
String prefix = matcher.group(1);
int iCount = matcher.group(2).length();
String suffix = matcher.group(3);
String nameMatcher = prefix + "%0" + iCount + "d" + suffix;
String fileName = String.format(nameMatcher, startTile);
Unfortunately my code is not working and it fails when checking if the second matcher finds anything in timeFileName.
After the first regex check it gets the following as the timeFileName: 000000001z001c02.tif, so it is cutting off the beginning potions including the {iiii}
Unfortunately I cannot assuming which group goes first ({iiii} or {ttt}), so I am trying to devise a solution that just handles {ttt} first and then processes {iiii}.
Also, here is another example of valid text that I am also trying to parse: F_{iii}_{ttt}.tif

Steps to follow:
Find string {ttt...} in file name
Form a number format based on no of "t" in string
Find string {iiii...} in file name
Form a number format based on no of "i" in string
Use String.replace() method to replace time and possition
Here is the code:
String filePattern = "KB_H9Oct4GFP_20130305_p00{iiii}t00000{ttt}z001c02.tif";
int startTimeSlice = 1;
int startTile = 1;
Pattern patternTime = Pattern.compile("(\\{[t]*\\})");
Matcher matcherTime = patternTime.matcher(filePattern);
if (matcherTime.find()) {
String timePattern = matcherTime.group(0);// {ttt}
NumberFormat timingFormat = new DecimalFormat(timePattern.replaceAll("t", "0")
.substring(1, timePattern.length() - 1));// 000
Pattern patternPosition = Pattern.compile("(\\{[i]*\\})");
Matcher matcherPosition = patternPosition.matcher(filePattern);
if (matcherPosition.find()) {
String positionPattern = matcherPosition.group(0);// {iiii}
NumberFormat positionFormat = new DecimalFormat(positionPattern
.replaceAll("i", "0").substring(1, positionPattern.length() - 1));// 0000
System.out.println(filePattern.replace(timePattern,
timingFormat.format(startTimeSlice)).replace(positionPattern,
positionFormat.format(startTile)));
}
}

Okay, so after a bit of testing I found a way to handle the case:
For parsing the {ttt} I can use the regex: (.*)\\{t([t]+)\\}(.*)
Now this means I have to increment tCount by one to account for the t I grab from \\{t
Same goes for {iii}: (.*)\\{i([i]+)\\}(.*)

Your first pattern looks like this:
String regexTime = "([^{]*)\\{([t]+)\\}(.*)";
This finds a string consisting of a sequence of zero or more non-{ characters, followed by {t...t}, followed by other characters.
When your input is
KB_H9Oct4GFP_20130305_p00{iiii}t00000{ttt}z001c02.tif
the first substring that matches is
iiii}t00000{ttt}z001c02.tif
The { before the i's can't match, because you told it only to match non-{ characters. The result is that when you re-form the string to do the second match, it will start with iiii} and therefore won't match {iiii} like you're trying to do.
When you're looking for {ttt...}, I don't see any reason to exclude { or any other character from the first part of the string. So changing the regex to
"^(.*)\\{(t+\\}(.*)$"
may be a simple way to fix this. Note that if you want to make sure you include the entire beginning of the string and the entire end of the string in your groups, you should include ^ and $ to match the beginning and end of the string, respectively; otherwise the matcher engine may decide not to include everything. In this case, it won't, but it's a good habit to get into anyway, because that makes things explicit and doesn't require anyone to know the difference between "greedy" and "reluctant" matching. Or use matches() instead of find(), since matches() automatically tries to match the entire string.

Perhaps an easier way to do this (as confirmed by http://regex101.com/r/vG7kY7) is
(\{i+\}).*(\{t+\})
You don't need the [] around a single character you are matching. Keep it simple. i+ means "one or more i's", and as long as these are in the order given, this expression will work (with the first match being {iiii} and the second {ttttt}).
You may need to escape the backslash when writing it in a string...

Pattern Matcher Vs String Split, which should I use?

First time posting.
Firstly I know how to use both Pattern Matcher & String Split.
My questions is which is best for me to use in my example and why?
Or suggestions for better alternatives.
Task:
I need to extract an unknown NOUN between two known regexp in an unknown string.
My Solution:
get the Start and End of the noun (from Regexp 1&2) and substring to extract the noun.
String line = "unknownXoooXNOUNXccccccXunknown";
int goal = 12 ;
String regexp1 = "Xo+X";
String regexp2 = "Xc+X";
I need to locate the index position AFTER the first regex.
I need to locate the index position BEFORE the second regex.
A) I can use pattern matcher
Pattern p = Pattern.compile(regexp1);
Matcher m = p.matcher(line);
if (m.find()) {
int afterRegex1 = m.end();
} else {
throw new IllegalArgumentException();
//TODO Exception Management;
}
B) I can use String Split
String[] split = line.split(regex1,2);
if (split.length != 2) {
throw new UnsupportedOperationException();
//TODO Exception Management;
}
int afterRegex1 = line.indexOf(split[1]);
Which Approach should I use and why?
I don't know which is more efficient on time and memory.
Both are near enough as readable to myself.

I'd do it like this:
String line = "unknownXoooXNOUNXccccccXunknown";
String regex = "Xo+X(.*?)Xc+X";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(line);
if (m.find()) {
String noun = m.group(1);
}
The (.*?) is used to make the inner match on the NOUN reluctant. This protects us from a case where our ending pattern appears again in the unknown portion of the string.
EDIT
This works because the (.*?) defines a capture group. There's only one such group defined in the pattern, so it gets index 1 (the parameter to m.group(1)). These groups are indexed from left to right starting at 1. If the pattern were defined like this
String regex = "(Xo+X)(.*?)(Xc+X)";
Then there would be three capture groups, such that
m.group(1); // yields "XoooX"
m.group(2); // yields "NOUN"
m.group(3); // yields "XccccccX"
There is a group 0, but that matches the whole pattern, and it's equivalent to this
m.group(); // yields "XoooXNOUNXccccccX"
For more information about what you can do with the Matcher, including ways to get the start and end positions of your pattern within the source string, see the Matcher JavaDocs

You should use String.split() for readability unless you're in a tight loop.
Per split()'s javadoc, split() does the equivalent of Pattern.compile(), which you can optimize away if you're in a tight loop.

It looks like you want to get a unique occurrence. For this do simply
input.replaceAll(".*Xo+X(.*)Xc+X.*", "$1")
For efficiency, use Pattern.matcher(input).replaceAll instead.
In case you input contains line breaks, use Pattern.DOTALL or the s modifier.
In case you want to use split, consider using Guava's Splitter. It behaves more sane and also accepts a Pattern which is good for speed.

If you really need the locations you can do it like this:
String line = "unknownXoooXNOUNXccccccXunknown";
String regexp1 = "Xo+X";
String regexp2 = "Xc+X";
Matcher m=Pattern.compile(regexp1).matcher(line);
if(m.find())
{
int start=m.end();
if(m.usePattern(Pattern.compile(regexp2)).find())
{
final int end = m.start();
System.out.println("from "+start+" to "+end+" is "+line.substring(start, end));
}
}
But if you just need the word in between, I recommend the way Ian McLaird has shown.

How to find expression, evaluate and replace in Java?

I have the following expressions inside a String (that comes from a text file):
{gender=male#his#her}
{new=true#newer#older}
And I would like to:
Find the occurences of that pattern {variable=value#if_true#if_false}
Temporarily store those variables in fields such as variableName, variableValue, ifTrue, ifFalse as Strings.
Evaluate an expression based on variableName and variableValue according to local variables (like String gender = "male" and String new = "true").
And finally replace the pattern with ifTrue or ifFalse according to (3).
Should I use String.replaceAll() in some way, or how do I look for this expression and save the strings that are inside? Thanks for your help
UPDATE
It would be something like PHP's preg_match_all.
UPDATE 2
I solved this by using Pattern and Matcher as I post as an answer below.

If the strings always take this format, then string.split('#') is probably the way to go. This will return an array of strings in the '#' separator (e.g. "{gender=male#his#her}".split('#') = {"{gender=male", "his", "her}"}; use substring to remove the first and last character to get rid of the braces)

After strugling for a while I managed to get this working using Pattern and Matcher as follows:
// \{variable=value#if_true#if_false\}
Pattern pattern = Pattern.compile(Pattern.quote("\\{") + "([\\w\\s]+)=([\\w\\s]+)#([\\w\\s]+)#([\\w\\s]+)" + Pattern.quote("\\}"));
Matcher matcher = pattern.matcher(doc);
// if we'll make multiple replacements we should keep an offset
int offset = 0;
// perform the search
while (matcher.find()) {
// by default, replacement is the same expression
String replacement = matcher.group(0);
String field = matcher.group(1);
String value = matcher.group(2);
String ifTrue = matcher.group(3);
String ifFalse = matcher.group(4);
// verify if field is gender
if (field.equalsIgnoreCase("Gender")) {
replacement = value.equalsIgnoreCase("Female")?ifTrue:ifFalse;
}
// replace the string
doc = doc.substring(0, matcher.start() + offset) + replacement + doc.substring(matcher.end() + offset);
// adjust the offset
offset += replacement.length() - matcher.group(0).length();
}

how can i extract an value using regex java?

i need to extract the numbers alone from this text i use sub string to extract the details some times the number decreases so i am getting an error value...
example(16656);

Use Pattern to compile your regular expression and Matcher to get a particular captured group. The regex I'm using is:
example\((\d+)\)
which captures the digits (\d+) within the parentheses. So:
Pattern p = Pattern.compile("example\\((\\d+)\\)");
Matcher m = p.matcher(text);
if (m.find()) {
int i = Integer.valueOf(m.group(1));
...
}

look at Java Regular Expression sample here:
http://java.sun.com/developer/technicalArticles/releases/1.4regex/
specially focus on find method.

String yourString = "example(16656);";
Pattern pattern = Pattern.compile("\\w+\\((\\d+)\\);");
Matcher matcher = pattern.matcher(yourString);
if (matcher.matches())
{
int value = Integer.parseInt(matcher.group(1));
System.out.println("Your number: " + value);
}

I will suggest you to write your own logic to do this. Using Pattern and Matcher things from java are good practice but these are standard solutions and may not suit as a solution in effective manner always. Like cletus provided a very neat solution but what happens in this logic is that a substring matching algorithm is performed in the background to trace digits. You do not need the pattern finding here I suppose. You just need to extract the digits from a string (like 123 from "a1b2c3") .See the following code which does it in clean manner in O(n) and does not perform unnecessary extra operation as Pattern and Matcher classes do for you (just do copy and paste and run :) ):
public class DigitExtractor {
/**
* #param args
*/
public static void main(String[] args) {
String sample = "sdhj12jhj345jhh6mk7mkl8mlkmlk9knkn0";
String digits = getDigits(sample);
System.out.println(digits);
}
private static String getDigits(String sample) {
StringBuilder out = new StringBuilder(10);
int stringLength = sample.length();
for(int i = 0; i <stringLength ; i++)
{
char currentChar = sample.charAt(i);
int charDiff = currentChar -'0';
boolean isDigit = ((9-charDiff)>=0&& (9-charDiff <=9));
if(isDigit)
out.append(currentChar);
}
return out.toString();
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to extract string between two strings using java Pattern - java

Related

Splitting a String that has a particular structure

Parsing text using Regex

Pattern Matcher Vs String Split, which should I use?

How to find expression, evaluate and replace in Java?

how can i extract an value using regex java?

Categories

Resources