Regex with prefix and optional suffix - java

This is maybe the 100+1 question regarding regex optional suffixes on SO, but I didn't find any, that could help me :(
I need to extract a part of string from the common pattern:
prefix/s/o/m/e/t/h/i/n/g/suffix
using a regular expression. The prefix is constant and the suffix may not appear at all, so prefix/(.+)/suffix doesn't meet my requirements. Pattern prefix/(.+)(?:/suffix)? returns s/o/m/e/t/h/i/n/g/suffix. The part (?:/suffix)? must be somehow more greedy.
I want to get s/o/m/e/t/h/i/n/g from these input strings:
prefix/s/o/m/e/t/h/i/n/g/suffix
prefix/s/o/m/e/t/h/i/n/g/
prefix/s/o/m/e/t/h/i/n/g
Thanks in advance!

Try
prefix\/(.+?)\/?(?:suffix|$)
The regex need to know when the match is done, so match either suffix or end of line ($), and make the capture non greedy.
See it here at regex101.

Try prefix(.*?)(?:/?(?:suffix|$)) if there are characters allowed before prefix of after suffix.
This requires the match to be as short as possible (reluctant quantifier) and be preceeded by one of 3 things: a single slash right before the end of the input, /suffix or the end of the input. That would match /s/o/m/e/t/h/i/n/g in the test cases you provided but would match more for input like prefix/s/o/m/e/t/h/i/n/g/suff (which is ok IMO since you don't know whether /suff is meant to be part of the match or a typo in the suffix).

Related

Regex pattern for the server log

I have the following log message from the server and I am trying to identify regex pattern from the below message.
2015-10-01T03:14:49.000-07:00 lvn-d1-dev DevServer[9876]: INFO: [EVENT][SEQ=248717] 2015:10:01:03:14:49 101 sign-in_id=11111#psop.com ip_address=1.1.1.1 service_id=IP1234-NPB12345_00 result=RESULT_SUCCESconsole_id=0000000138e91b4e58236bf32besdafasdfasdfasdfsadf account_id=11111 platform=pik
I have used the following regex pattern
.+\[SEQ=\w+\]\s*(\d+:[\d\d:]+)\s(\d+)\s*.+\=(.+)
Using the above regex pattern, I am able to isolate the date(2015:10:01:03:14:49) and Id (101) but I am unable to get the email (11111#psop.com) and service id separately.
In my regex pattern string; '\=' is pointing to the last '=' match. Am I missing something here? Kindly help me in identifying the regex pattern.
Regex is by default greedy. That's why .+\= matched the entire remaining string until the last =.
Instead you can use the non-greedy version: .+?\= - note the ?.
The complete version would look like this:
.+\[SEQ=\w+\]\s*(\d+:[\d\d:]+)\s(\d+)\s*(.+?)\=(.+)
In addition you shouldn't overcomplicate things. As already pointed out in #InternetUnexplorers answer: you should use the names associated with the required values as anchors to simplify matching. As long as none of the names are repeated something like
.+\[SEQ=\w+\]\s*(\d+:[\d\d:]+)\s(\d+)\s*sign-in_id\=(.+)
would work.
At the end of your regex is the problem: .+\=(.+).
+ matches as many characters as possible, only giving back as needed (greedy).
.+ was matching all of the characters it could, up until the point \=(.+) could no longer be satisfied. That is why it was matching the last equals sign.
Instead of just searching for any equals sign, try this:
.*\[SEQ=\d+\] (\d+:[\d:]+) (\d+) sign-in_id=(\S+) .* service_id=(\S+)
The ids are matched by name, which works much better.

Java regular expression, disallow forward-slash

I have to create a regular expression that validates whether a string contains a forward-slash. If a forward-slash is present the validation fails. The string has to pass even if it is empty.
This is what I have done so far:
"^[a-zA-Z0-9\\\\ ]*$"
You can use a negative lookahead to assert that there cannot be any forward slashes, that looks like (?!.*\/). From there, make sure you find at least one backward slash, or you find nothing (the end of the line):
^(?!.*\/)((?:.*\\.*)|$)
You can see it matching here. Note that there are two matches in the right hand column, one for the empty regex, and one for the line that contains a backward slash.
Edit: If the requirement is only to make sure that the string does not have any forward slashes, then the regex is easier. You just take the negative lookahead from the above regex.
^(?!.*\/).*$
You can see that matching here.
If I understand the comments correctly, then you don't require a backslash for the expression to pass - you just want to make sure there are no forward slashes. In that case, you could simply use ^[^/]*$ You should not need any type of lookahead for this.
To interpret this expression a bit: this expression matches the beginning of the string (^) followed by zero or more non-slash characters ([^/]*), followed by end of string ($). The square brackets usually indicate that you want to match any character inside of them, but in this case, the leading ^ inverts that portion of the expression, so it will match any character that is NOT a slash. The * indicates that we want to attempt that match zero or more times as needed for the pattern to work.

regex with lookbehind weird behavior

I have been trying to resolve this for the past 2 days...
Please help me in understanding why this is happening. My intention is to just select the <HDR> that has a <DTL1 val="92">.....</HDR>
This is my regular expression
(?<=<HDR>).*?<DTL1\sval="3".*?</HDR>
And the input string is:
<HDR>abc<DTL1 val="1"><DTL2 val="2"></HDR><HDR><DTL1 val="92"><DTL2 val="55"></HDR><HDR><DTL1 val="3"><DTL2 val="4"></HDR>
But this regular expression selects
abc<DTL1 val="1"><DTL2 val="2"></HDR><HDR><DTL1 val="92"><DTL2 val="55"></HDR>
Can anyone please help me?
A regex engine will give you always the leftmost match in a string (even if you use a non-greedy quantifier). This is exactly what you obtain.
So, a solution is to forbid the presence of another <HDR> in the parts described by .*? that is too permissive.
You have two technics to do that, you can replace the .*? with:
(?>[^<]+|<(?!/HDR))*
or with:
(?:(?!</HDR).)*+
Most of the time, the first technic is more performant, but if your string contains an high density of <, the second way can give good results too.
The use of a possessive quantifier or an atomic group can reduce the number of steps to obtain a result in particular when the subpattern fails.
Example:
With the first way:
(?<=<HDR>)(?>[^<]+|<(?!/HDR))*<DTL1\sval="3"(?>[^<]+|<(?!/HDR))*</HDR>
or this variant:
(?<=<HDR>)(?:[^<]+|<(?!/HDR|DTL1))*+<DTL1\sval="3"(?:[^<]+|<(?!/HDR))*+</HDR>
With the second way:
(?<=<HDR>)(?:(?!</HDR).)*<DTL1\sval="3"(?:(?!</HDR).)*+</HDR>
or this variant:
(?<=<HDR>)(?:(?!</HDR|DTL1).)*+<DTL1\sval="3"(?:(?!</HDR).)*+</HDR>
Casimir et Hippolyte already gave you a couple of good solutions. I want to elaborate on a few things.
First, why your regex fails to do what you want: (?<=<HDR>).*? tells it to match any number of characters starting with the first character preceded by <HDR>, until it encounters what follows the non-greedy quantifier (<DTL1...). Well, the first character that's preceded by <HDR> is the first a, so it matches everything starting from there until the fixed string <DTL1\sval="3" is encountered.
Casimir et Hippolyte's solutions are for the generalized case, where the contents of the <HDR> tags can be anything other than nested <HDR>'s. You could also do it with a positive look-ahead:
(?<=<HDR>)(.(?!</HDR>))*<DTL1\sval="3".*?</HDR>
However, if the string is guaranteed to be in the structure shown, where the <HDR> tags only contain one or more <DTL1 val="##"> tags, so you know there won't be any closing tags within, you could do it more efficiently by replacing the first .*? with [^/]*:
(?<=<HDR>)[^/]*<DTL1\sval="3".*?</HDR>
A negated character class is more efficient than a zero-width assertion, and if you're using a negated character class, a greedy quantifier becomes more efficient than a lazy one.
Note also that by using a lookbehind to match the opening <HDR>, you're excluding it from the match, but you're including the closing </HDR>. Are you sure that's what you want? You're matching this...
<DTL1 val="3"><DTL2 val="4"></HDR>
...when presumably you want this...
<HDR><DTL1 val="3"><DTL2 val="4"></HDR>
...or this...
<DTL1 val="3"><DTL2 val="4">
So, in the fist case, don't use a lookbehind for the opening tag:
<HDR>(.(?!</HDR>))*<DTL1\sval="3".*?</HDR>
<HDR>[^/]*<DTL1\sval="3".*?</HDR>
In the second case, use a look-ahead for the closing tag:
(?<=<HDR>)(.(?!</HDR>))*<DTL1\sval="3".*?(?=</HDR>)
(?<=<HDR>)[^/]*<DTL1\sval="3".*?(?=</HDR>)

validating input string "RX-EZ12345678912345B" using regex

I need to validate input string which should be in the below format:
<2_upper_case_letters><"-"><2_upper_case_letters><14-digit number><1_uppercase_letter>
Ex: RX-EZ12345678912345B
I tried something like this ^[IN]-?[A-Z]{0,2}?\\d{0,14}[A-Z]{0,1} but its not giving the expected result.
Any help will be appreciated.
Thanks
Your biggest problem is the [IN] at the beginning, which matches only one letter, and only if it's I or N. If you want to match two of any letters, use [A-Z]{2}.
Once you fix that, your regex will still only match RX-E. That's because [A-Z]{0,2}? starts out trying to consume nothing, thanks to the reluctant quantifier, {0,2}?. Then \d{0,14} matches zero digits, and [A-Z]{0,1} greedily consumes the E.
If you want to match exactly 2 letters and 14 digits, use [A-Z]{2} and \d{14}. And since you're validating the string, you should end the regex with the end anchor, $. Result:
^[A-Z]{2}-[A-Z]{2}\d{14}[A-Z]$
...or, as a Java string literal:
"^[A-Z]{2}-[A-Z]{2}\\d{14}[A-Z]$"
As #nhahtdh observed, you don't really have to use the anchors if you're using Java's matches() method to apply the regex, but I recommend doing so anyway. It communicates your intent better, and it makes the regex portable, in case you have to use it in a different flavor/context.
EDIT: If the first two characters should be exactly IN, it would be
^IN-[A-Z]{2}\d{14}[A-Z]$
Simply translating your requirements into a java regex:
"^[A-Z]{2}-[A-Z]{2}\\d{14}[A-Z]$"
This will allow you to use:
if (!input.matches("^[A-Z]{2}-[A-Z]{2}\\d{14}[A-Z]$")) {
// do something because input is invalid
}
Not sure what you are trying to do at the beginning of your current regex.
"^[A-Z]{2}-[A-Z]{2}\\d{14}[A-Z]$"
The regex above will strictly match the input string as you specified. If you use matches function, ^ and $ may be omitted.
Since you want exact number of repetitions, you should specify it as {<number>} only. {<number>,<number>} is used for variable number of repetitions. And ? specify that the token before may or may not appear - if it must be there, then specifying ? is incorrect.
^[A-Z]{2}-[A-Z]{2}\\d{14}[A-Z]$
This should solve your purpose. You can confirm it from here
This should solve your problem. Check out the validity here
^[A-Z]{2}-[A-Z]{2}[0-9]{14}[A-Z]$
^([A-Z]{2,2}[-]{1,1}[A-Z]{2,2}[0-9]{14,14}[A-Z]{1,1}){1,1}$

How do I make this regex more general, sometimes it works and sometimes it doesn't

I have the following regex that I am using in a java application. Sometimes it works correctly and sometimes it doesn't.
<!-- <editable name=(\".*\")?> -->(.*)<!-- </editable> -->
Sometimes I will have whitespace before/after it, sometimes there will be text. The same goes for the region within the tags.
The main problem is that name=(\".*\")?> sometimes matches more than it is supposed to. I am not sure if that is something that is obvious to solve, simply looking at this code.
XML is not a regular language, nor is HTML or any other language with "nesting" constructs. Don't try to parse it with regular expressions.
Choose an XML parser.
As others have pointed out, the greedy .* (dot-star) that matches the "name" attribute needs to be made non-greedy (.*?) or even better, replaced with a negated character class ([^"]*) so it can't match beyond the closing quotation mark no matter what happens in the rest of the regex. Once you've fixed that, you'll probably find you have the same problem with the other dot-star; you need to make it non-greedy too.
Pattern p = Pattern.compile(
"<!--\\s*<editable\\s+name=\"([^\"]*)\">\\s*-->" +
"(.*?)" +
"<!--\\s*</editable>\\s*-->",
Pattern.DOTALL);
I don't get the significance of your remarks about whitespace. If it's linefeeds and/or carriage returns you're talking about, the DOTALL modifier lets the dot match those--and of course, \s matches them as well.
I wrote this in the form of a Java string literal to avoid confusion about where you need backslashes and how many of them you need. In a "raw" regex, there would be only one backslash in each of the whitespace shorthands (\s*), and the quotation marks wouldn't need to be escaped ("[^"]*").
I would replace that .* with [\w-]* for example if name is an identifier of some sort.
Or [^\"]* so it doesn't capture the end double quote.
Edit:
As mentioned in other post you might consider going for a simple DOM traversal, XPath or XQuery based evaluation process instead of a plain regular expression. But note that you will still need to have regex in the filtering process because you can find the target comments only by testing their body against a regular expression (as I doubt the body is constant judjing from the sample).
Edit 2:
It might be that the leading, trailing or internal whitespaces of the comment body makes your regexp fail. Consider putting \s* in the beginning and at the end, plus \s+ before the attribute-like thing.
<!--\s*<editable\s+name=(\"[^\"]*\")?>\s*-->(.*)<!--\s*</editable>\s*-->
Or when you are filtering on XML based search:
"\\s*<editable\\s+name=(\"[^\"]*\")?>\\s*"
"\\s*</editable>\\s*"
Edit 3: Fixed the escapes twice. Thanks Alan M.
the * multiplier is "greedy" by default, meaning it matches as much as possible, while still matching the pattern successfully.
You can disable this by using *?, so try:
(\".*?\")

Categories

Resources