Extract specific data from string with regex - java

I want to capture multiple string which match some specific patterns,
For example my string is like
String textData = "#1_Label for UK#2_Label for US#4_Label for FR#";
I want to get string between two # which match with string like for UK
Output should like this
if match string is UK than
output should be 1_Label for UK
if match string is label than
output should be 1_Label for UK, 2_Label for US and 4_Label for FR
if match string is 1_ than
output should be 1_Label for UK
I don't want to extract data via array list and extraction should be case insensitive.
Can you please help me out from this problem?
Regards,
Ashish Mishra

You can use this regex for search:
#([^#]*?Label[^#]*)(?=#)
Replace Label with your search keyword.
RegEx Demo
Java Pattern:
Pattern p = Pattern.compile( "#([^#]*?" + Pattern.quote(keyword) + "[^#]*)(?=#)" );

If the data always is between two hashes, try a regex like this: (?i)#.*your_match.*# where your_match would be UK, label, 1_ etc.
Then use this expression in conjunction with the Pattern and Matcher classes.
If you want to match multiple strings, you'd need to exclude the hashes from the match by using look-around methods as well as reluctant modifiers, e.g. (?i)(?<=#).*?label.*?(?=#).
Short breakdown:
(?i) will make the expression case insensitive
(?<=#) is a positive look-behind, i.e. the match must be preceeded by a hash (but doesn't include the hash)
.*? matches any sequence of characters but is reluctant, i.e. it tries to match as few characters as possible
(?=#) is a positive look-ahead, which means the match must be followed by a hash (also not included in the match)
Without the look-around methods the hashes would be included in the match and thus using Matcher.find() you'd skip every other label in your test string, i.e. you'd get the matches #1_Label for UK# and #4_Label for FR# but not #2_Label for US#.
Without the relucatant modifiers the expression would match everything between the first and the last hash.
As an alternative and better, replace .*? with [^#]*, which would mean that the match cannot contain any hash, thus removing the need for reluctant modifiers as well as removing the problem that looking for US would match 1_Label for UK#2_Label for US.
So most probably the final regex you're after looks like this: (?i)(?<=#)[^#]*your_match[^#]*(?=#).

([^#]*UK[^#]*) for UK
([^#]*Label[^#]*) for Label
([^#]*1_[^#]*) for 1_
Try this.Grab the captures.See demo.
http://regex101.com/r/kQ0zR5/3
http://regex101.com/r/kQ0zR5/4
http://regex101.com/r/kQ0zR5/5

I have solved this problem with below pattern,
(?i)([^#]*?us[^#]*)(?=#)
Thank you so much Anubhava, VKS and Thomas for you reply.
Regards,
Ashish Mishra

Related

Regex to match user and user#domain

A user can login as "user" or as "user#domain". I only want to extract "user" in both cases. I am looking for a matcher expression to fit it, but im struggling.
final Pattern userIdPattern = Pattern.compile("(.*)[#]{0,1}.*");
final Matcher fieldMatcher = userIdPattern.matcher("user#test");
final String userId = fieldMatcher.group(1)
userId returns "user#test". I tried various expressions but it seems that nothing fits my requirement :-(
Any ideas?
If you use "(.*)[#]{0,1}.*" pattern with .matches(), the (.*) grabs the whole line first, then, when the regex index is still at the end of the line, the [#]{0,1} pattern triggers and matches at the end of the line because it can match 0 # chars, and then .* again matches at that very location as it matches any 0+ chars. Thus, the whole line lands in your Group 1.
You may use
String userId = s.replaceFirst("^([^#]+).*", "$1");
See the regex demo.
Details
^ - start of string
([^#]+) - Group 1 (referred to with $1 from the replacement pattern): any 1+ chars other than #
.* - the rest of the string.
A little bit of googling came up with this:
(.*?)(?=#|$)
Will match everthing before an optional #
I would suggest keeping it simple and not relying on regex in this case if you are using java and have a simple case like you provided.
You could simply do something like this:
String userId = "user#test";
if (userId.indexOf("#") != -1)
userId = userId.substring(0, userId.indexOf("#"));
// from here on userId will be "user".
This will always either strip out the "#test" or just skip stripping it out when it is not there.
Using regex in most cases makes the code less maintainable by another dev in the future because most devs are not very good with regular expressions, at least in my experience.
You included the # as optional, so the match tries to get the longest user name. As you didn't put the restriction of a username is not allowed to have #s in it, it matched the longest string.
Just use:
[^#]*
as the matching subexpr for usernames (and use $0 to get the matched string)
Or you can use this one that can be used to find several matches (and to get both the user part and the domain part):
\b([^#\s]*)(#[^#\s]*)?\b
The \b force your string to be tied to word boundaries, then the first group matches non-space and non-# chars (any number, better to use + instead of * there, as usernames must have at least one char) followed (optionally) by a # and another string of non-space and non-# chars). In this case, $0 matches the whole email addres, $1 matches the username part, and $2 the #domain part (you can refine to only the domain part, adding a new pair of parenthesis, as in
b([^#\s]*)(#([^#\s]*))?\b
See demo.

Java Regular expression validation

I want to validate a string which allows only alpha numeric values and only
one dot character and only underscore character in java .
String fileName = (String) request.getParameter("read");
I need to validate the fileName retrieving from the request and should
satisfy the above criteria
I tried in "^[a-zA-Z0-9_'.']*$" , but this allows more than one dot character
I need to validate my string in the given scenarios ,
1 . Filename contains only alpha numeric values .
2 . It allows only one dot character (.) , example : fileRead.pdf ,
fileWrite.txt etc
3 . it allows only underscore characters . All the other symbols should be
declined
Can any one help me on this ?
You should use String.matches() method :
System.out.println("My_File_Name.txt".matches("\\w+\\.\\w+"));
You can also use java.util.regex package.
java.util.regex.Pattern pattern =
java.util.regex.Pattern.compile("\\w+\\.\\w+");
java.util.regex.Matcher matcher = pattern.matcher("My_File_Name.txt");
System.out.println(matcher.matches());
For more information about REGEX and JAVA, look at this page :
https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
You could use two negative lookaheads here:
^((?!.*\..*\.)(?!.*_.*_)[A-Za-z0-9_.])*$
Each lookahead asserts that either a dot or an underscore does not occur two times, implying that it can occur at most once.
It wasn't completely clear whether you require one dot and/or underscore. I assumed not, but my regex could be easily modified to this requirement.
Demo
You can first check the special characters which have the number limits.
Here is the code:
int occurance = StringUtils.countOccurrencesOf("123123..32131.3", ".");
or
int count = StringUtils.countMatches("123123..32131.3", ".");
If it does not match your request you can discard it before regex check.
If there is no problem you can now put your String to alphanumeric value check.

Regex: Match group if present otherwise ignore and proceed with other matches

I have been trying to match a regex pattern within the following data:
String:
TestData to 1colon delimiter list has 1 rows.Set...value is: 1 Save Error: 267467374736437-TestInfo send Error
Words to match:
TestData
267467374736437-TestInfo
Regex pattern i m using:
(.+?\s)?.*(\s\d+-.*?\s)?
Scenario here is that 2nd match (267467374736437-TestInfo) can be absent in the string to be matched. So, i want it to be a match if it exists otherwise proceed with other matches. Due to this i added zero or one match quantifier ? to the group pattern above. But then it ignores the 2nd group all together.
If i use the below pattern:
`(.+?\s)?.*(\s\d+-.*?\s)`
It matches just fine but fails if string "267467374736437-TestInfo" from the matching string as it's not having the "?" quantifier.
Please help me understand where is it going wrong.
I would rather not use a complex regex, which will be ugly and a maintenance nightmare. Instead, one simple way would be to just split the string and grab the first term, and then use a smart regex to pinpoint the second term.
String input = "TestData to 1colon delimiter list has 1 rows.Set...value is: 1 Save Error: 267467374736437-TestInfo send Error";
String first = input.split(" ")[0];
String second = input.replaceAll(".*Save Error:\\s(.*)?\\s", "$1");
Explore the regex:
Regex101
The optional pattern at the end will almost never not be matched if a more generic pattern occurs. In your case, the greedy dot .* grabs the whole rest of the line up to the end, and since the last pattern is optional, the regex engine calls it a day and does not try to accommodate any text for it.
If you had a lazy dot .*?, the only position where it would work is right after the preceding subpattern, which is rarely the case.
Thus, you can only rely on a tempered greedy token:
^(\S+)(?:(?!\d+-\S).)*(\d+-\S+)?
See the regex demo.
Or an unrolled version:
^(\S+)\D*(?:\d(?!\d*-\S)\D*)*(\d+-\S+)?

Java String Regex replacement

Sample Input:
a:b
a.in:b
asds.sdsd:b
a:b___a.sds:bc___ab:bd
Sample Output:
a:replaced
a.in:replaced
asds.sdsd:replaced
a:replaced___a.sds:replaced___ab:replaced
String which comes after : should be replaced with custom function.
I have done the same without Regex. I feel it can be replaced with regex as we are trying to extract string out of specific pattern.
For first three cases, it's simple enough to extract String after :, but I couldn't find a way to deal with third case, unless I split the string ___ and apply the approach for first type of pattern and again concatenate them.
Just replace only the letters with exists next to : with the string replaced.
string.replaceAll("(?<=:)[A-Za-z]+", "replaced");
DEMO
or
If you also want to deal with digits, then add \d inside the char class.
string.replaceAll("(?<=:)[A-Za-z\\d]+", "replaced");
(:)[a-zA-Z]+
You can simply do this with string.replaceAll.Replace by $1replaced.See demo.
https://regex101.com/r/fX3oF6/18

How to use two types of regex in single regex?

I have a string field. I need to pass UUID string or digits number to that field.
So I want to validate this passing value using regex.
sample :
stringField = "1af6e22e-1d7e-4dab-a31c-38e0b88de807";
stringField = "123654";
For UUID I can use,
"[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}"
For digits I can use
"\\d+"
Is there any way to use above 2 pattern in single regex
Yes..you can use |(OR) between those two regex..
[\\da-f]{8}-[\\da-f]{4}-[\\da-f]{4}-[\\da-f]{4}-[\\da-f]{12}|\\d+
^
try:
"(?:[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})|(?:\\d+)"
You can group regular expressions with () and use | to allow alternatives.
So this will work:
(([0-9a-fA-F]){8}-([0-9a-fA-F]){4}-([0-9a-fA-F]){4}-([0-9a-fA-F]){4}-([0-9a-fA-F]){12})|(\\d+)
Note that I've adjusted your UUID regular expression a little to allow for upper case letters.
How are you applying the regex? If you use the matches(), all you have to do is OR them together as #Anirudh said:
return myString.matches(
"[\\da-f]{8}-[\\da-f]{4}-[\\da-f]{4}-[\\da-f]{4}-[\\da-f]{12}|\\d+");
This works because matches() acts as if the regex were enclosed in a non-capturing group and anchored at both ends, like so:
"^(?:[\\da-f]{8}-[\\da-f]{4}-[\\da-f]{4}-[\\da-f]{4}-[\\da-f]{12}|\\d+)$"
If you use Matcher's find() method, you have to add the group and the anchors yourself. That's because find() returns a positive result if any substring of the string matches the regex. For example, "xyz123<>&&" would match because the "123" matches the "\\d+" in your regex.
But I recommend you add the explicit group and anchors anyway, no matter what method you use. In fact, you probably want to add the inline modifier for case-insensitivity:
"(?i)^(?:[\\da-f]{8}-[\\da-f]{4}-[\\da-f]{4}-[\\da-f]{4}-[\\da-f]{12}|\\d+)$"
This way, anyone who looks at the regex will be able to tell exactly what it's meant to do. They won't have to notice that you're using the matches() method and remember that matches() automatically anchors the match. (This will be especially helpful for people who learned regexes in a non-Java context. Almost every other regex flavor in the world uses the find() semantics by default, and has no equivalent for Java's matches(); that's what anchors are for.)
In case you're wondering, the group is necessary because alternation (the | operator) has the lowest precedence of all the regex constructs. This regex would match a string that starts with something that looks like a UUID or ends with one or more digits.
"^[\\da-f]{8}-[\\da-f]{4}-[\\da-f]{4}-[\\da-f]{4}-[\\da-f]{12}|\\d+$" // WRONG

Categories

Resources